The Science of Value-Added Evaluation

"A value-added analysis constitutes a series of personal, high-stakes experiments conducted under extremely uncontrolled conditions".

If drug experiments were conduted like VAM we might all have 3 legs or worse

Value-added teacher evaluation has been extensively criticized and strongly defended, but less frequently examined from a dispassionate scientific perspective. Among the value-added movement's most fervent advocates is a respected scientific school of thought that believes reliable causal conclusions can be teased out of huge data sets by economists or statisticians using sophisticated statistical models that control for extraneous factors.

Another scientific school of thought, especially prevalent in medical research, holds that the most reliable method for arriving at defensible causal conclusions involves conducting randomized controlled trials, or RCTs, in which (a) individuals are premeasured on an outcome, (b) randomly assigned to receive different treatments, and (c) measured again to ascertain if changes in the outcome differed based upon the treatments received.

The purpose of this brief essay is not to argue the pros and cons of the two approaches, but to frame value-added teacher evaluation from the latter, experimental perspective. For conceptually, what else is an evaluation of perhaps 500 4th grade teachers in a moderate-size urban school district but 500 high-stakes individual experiments? Are not students premeasured, assigned to receive a particular intervention (the teacher), and measured again to see which teachers were the more (or less) efficacious?

Granted, a number of structural differences exist between a medical randomized controlled trial and a districtwide value-added teacher evaluation. Medical trials normally employ only one intervention instead of 500, but the basic logic is the same. Each medical RCT is also privy to its own comparison group, while individual teachers share a common one (consisting of the entire district's average 4th grade results).

From a methodological perspective, however, both medical and teacher-evaluation trials are designed to generate causal conclusions: namely, that the intervention was statistically superior to the comparison group, statistically inferior, or just the same. But a degree in statistics shouldn't be required to recognize that an individual medical experiment is designed to produce a more defensible causal conclusion than the collected assortment of 500 teacher-evaluation experiments.

How? Let us count the ways:

  • Random assignment is considered the gold standard in medical research because it helps to ensure that the participants in different experimental groups are initially equivalent and therefore have the same propensity to change relative to a specified variable. In controlled clinical trials, the process involves a rigidly prescribed computerized procedure whereby every participant is afforded an equal chance of receiving any given treatment. Public school students cannot be randomly assigned to teachers between schools for logistical reasons and are seldom if ever truly randomly assigned within schools because of (a) individual parent requests for a given teacher; (b) professional judgments regarding which teachers might benefit certain types of students; (c) grouping of classrooms by ability level; and (d) other, often unknown, possibly idiosyncratic reasons. Suffice it to say that no medical trial would ever be published in any reputable journal (or reputable newspaper) which assigned its patients in the haphazard manner in which students are assigned to teachers at the beginning of a school year.
  • Medical experiments are designed to purposefully minimize the occurrence of extraneous events that might potentially influence changes on the outcome variable. (In drug trials, for example, it is customary to ensure that only the experimental drug is received by the intervention group, only the placebo is received by the comparison group, and no auxiliary treatments are received by either.) However, no comparable procedural control is attempted in a value-added teacher-evaluation experiment (either for the current year or for prior student performance) so any student assigned to any teacher can receive auxiliary tutoring, be helped at home, team-taught, or subjected to any number of naturally occurring positive or disruptive learning experiences.
  • When medical trials are reported in the scientific literature, their statistical analysis involves only the patients assigned to an intervention and its comparison group (which could quite conceivably constitute a comparison between two groups of 30 individuals). This means that statistical significance is computed to facilitate a single causal conclusion based upon a total of 60 observations. The statistical analyses reported for a teacher evaluation, on the other hand, would be reported in terms of all 500 combined experiments, which in this example would constitute a total of 15,000 observations (or 30 students times 500 teachers). The 500 causal conclusions published in the newspaper (or on a school district website), on the other hand, are based upon separate contrasts of 500 "treatment groups" (each composed of changes in outcomes for a single teacher's 30 students) versus essentially the same "comparison group."
  • Explicit guidelines exist for the reporting of medical experiments, such as the (a) specification of how many observations were lost between the beginning and the end of the experiment (which is seldom done in value-added experiments, but would entail reporting student transfers, dropouts, missing test data, scoring errors, improperly marked test sheets, clerical errors resulting in incorrect class lists, and so forth for each teacher); and (b) whether statistical significance was obtained—which is impractical for each teacher in a value-added experiment since the reporting of so many individual results would violate multiple statistical principles.

[readon2 url=""]Continue reading...[/readon2]

Do Different Value-Added Models Tell Us the Same Things?



  • Statistical models that evaluate teachers based on growth in student achievement differ in how they account for student backgrounds, school, and classroom resources. They also differ by whether they compare teachers across a district (or state) or just within schools.
  • Statistical models that do not account for student background factors produce estimates of teacher quality that are highly correlated with estimates from value-added models that do control for student backgrounds, as long as each includes a measure of prior student achievement.
  • Even when correlations between models are high, different models will categorize many teachers differently.
  • Teachers of advantaged students benefit from models that do not control for student background factors, while teachers of disadvantaged students benefit from models that do.
  • The type of teacher comparisons, whether within or between schools, generally has a larger effect on teacher rankings than statistical adjustments for differences in student backgrounds across classrooms.


There are good reasons for re-thinking teacher evaluation. As we know, evaluation systems in most school districts appear to be far from rigorous. A recent study showed that more than 99 percent of teachers in a number of districts were rated “satisfactory,” which does not comport with empirical evidence that teachers differ substantially from each other in terms of their effectiveness. Likewise, the ratings do not reflect the assessment of the teacher workforce by administrators, other teachers, or students.

Evaluation systems that fail to recognize the true differences that we know exist among teachers greatly hamper the ability of school leaders and policymakers to make informed decisions about such matters as which teachers to hire, what teachers to help, which teachers to promote, and which teachers to dismiss. Thus it is encouraging that policymakers are developing more rigorous evaluation systems, many of which are partly based on student test scores.

Yet while the idea of using student test scores for teacher evaluations may be conceptually appealing, there is no universally accepted methodology for translating student growth into a measure of teacher performance. In this brief, we review what is known about how measures that use student growth align with one another, and what that agreement or disagreement might mean for policy.

[readon2 url=""]Continue reading...[/readon2]

New Research Uncovers Fresh Trouble for VAM Evaluations

As more and more schools implement various forms of Value-Added method (VAM) evaluation systems, we are learning some disturbing things about how reliable these methods are.

Education Week's Stephan Sawchuk, in "'Value-Added' Measures at Secondary Level Questioned," explains that value-added statistical modeling was once limited to analyzing large sets of data. These statistical models projected students' test score growth, based on their past performance, and thus estimated a growth target. But, now 30 states require teacher evaluations to use student performance, and that has expanded use of algorithms for high-stakes purposes. Value-added estimates are now being applied to secondary schools, even though the vast majority of research on their use has been limited to elementary schools.

Sawchuk reports on two major studies that should slow this rush to evaluate all teachers with experimental models. This month, Douglas Harris will be presenting "Bias of Public Sector Worker Performance Monitoring." It is based on a six years of Florida middle school data on 1.3 million math students.

Harris divides classes into three types, remedial, midlevel, and advanced. After controlling for tracking, he finds that between 30 to 70% of teachers would be placed in the wrong category by normative value-added models. Moreover, Harris discovers that teachers who taught more remedial classes tended to have lower value-added scores than teachers who taught mainly higher-level classes. "That phenomenon was not due to the best teachers' disproportionately teaching the more-rigorous classes, as is often asserted. Instead, the paper shows, even those teachers who taught courses at more than one level of rigor did better when their performance teaching the upper-level classes was compared against that from the lower-level classes."

[readon2 url=""]Continue reading...[/readon2]

VIDEO: Merit Pay, Teacher Pay, and Value Added Measures

Value added measures sound fair, but they are not. In this video Prof. Daniel Willingham describes six problems (some conceptual, some statistical) with evaluating teachers by comparing student achievement in the fall and in the spring.

A Worthington teacher testifies against HB153

OEA member and WEA President Mark Hill's written testimony against HB 153

Chairman Widener, Ranking Member Skindell, and members of the Senate Finance Committee, my name is Mark Hill. I am a math teacher in the Worthington City Schools currently serving as president of the Worthington Education Association. Thank you for allowing me to offer testimony on HB153.

I come today to talk to you about the teacher accountability provisions in HB 153. I have some concerns about the structure for accountability that is in the version passed out of the House.

I would like to begin by saying that I don’t have a problem with a rigorous evaluation system for teachers nor do I disagree with the notion of removing ineffective teachers from the classroom. That may sound unusual coming from a leader of a local teachers union but I am a parent, too, and I care about access to a high quality education for my kids. The teachers I represent take a great deal of pride in teaching in an excellent school district; many of them live in the district and all of them want it to remain excellent; none of them want to work alongside a bad teacher.

HB153, as passed by the House, goes too far. It requires teachers to be rated highly effective, effective, needs improvement, or unsatisfactory based on an evaluation in which 50% of the score is measuring student growth through value added scores averaged over three years. It requires the Superintendent of Public Instruction to set a minimum level value added measure for a teacher for each of the rating levels. Furthermore, it imposes draconian penalties for teachers who are rated as unsatisfactory or needs improvement including imposition of unpaid leave on a teacher rated at those levels if their principal does not consent to placing them in their building the next year effectively ending their careers.

Value added scores are a great concept but as a statistical measure, they are fraught with error. Scores fluctuate by random error; in Houston’s value added system only 38% of the top fifth remained in the top rating the next year. 23% of the top fifth in performance ended up being in the bottom fifth the next year and vice versa. Fluctuations like that defy reason; it is highly unlikely that a fourth of the top teachers in Houston one year were poor performers the next.

According to another study done for the US Department of Education’s National Center for Education Evaluation found that, using three years of data, a teacher who should be rated as average has a 25% chance of being rated significantly below average. A teacher who should be rated as a top performer has a 10% chance of being rated significantly below average. This means under HB 153, 25% of the average teachers in Ohio and 10% of the good teachers in Ohio would be in jeopardy of losing their jobs due to statistical error. I hope the Ohio General Assembly would not want to add a “Wheel of Fortune” element to teachers’ careers.

Under this system, who would take care of the kids? There are teachers who ask for the students with behavior problems and learning disabilities because they care about them and believe they deserve an education. Under HB 153, these teachers would be putting their career at risk to do so. My own son has Aspergers Syndrome, which is a condition on the autism spectrum – who will want to teach him? Under HB153, math and reading teachers are far more at risk for losing their jobs than other teachers because those are the only areas with enough scores to build a value-added modeling system. Who would want to work in an area where you are constantly worrying about losing your job due to a statistical error?

I don’t come just to complain but to offer solutions. First, you’ve already passed this framework for evaluation in Senate Bill 5. There is no logical reason to duplicate it in HB153 – frankly, I don’t believe it belongs in either bill but should be a subject of debate on its own.

Second, instead of mandating 50% value added, allow the local education agency to decide how to best fit value added in their evaluations. This is the system under Race to the Top – Worthington is a Race to the Top district, so we have already agreed to rate teachers’ effectiveness through evaluation using value added modeling. A top down statewide approach will have serious unintended consequences.

Thank you for listening.

Please contact your State Senator and ask them to remove the SB5 provisions from HB153 (the budget bill).