How Should Educators Interpret Value-Added Scores?



  • Each teacher, in principle, possesses one true value-added score each year, but we never see that "true" score. Instead, we see a single estimate within a range of plausible scores.
  • The range of plausible value-added scores -; the confidence interval -; can overlap considerably for many teachers. Consequently, for many teachers we cannot readily distinguish between them with respect to their true value-added scores.
  • Two conditions would enable us to achieve value-added estimates with high reliability: first, if teachers' value-added measurements were more precise, and second, if teachers’ true value-added scores varied more dramatically than they do.
  • Two kinds of errors of interpretation are possible when classifying teachers based on value-added: a) “false identifications” of teachers who are actually above a certain percentile but who are mistakenly classified as below it; and b) “false non-identifications” of teachers who are actually below a certain percentile but who are classified as above it. Falsely identifying teachers as being below a threshold poses risk to teachers, but failing to identify teachers who are truly ineffective poses risks to students.
  • Districts can conduct a procedure to identify how uncertainty about true value-added scores contributes to potential errors of classification. First, specify the group of teachers you wish to identify. Then, specify the fraction of false identifications you are willing to tolerate. Finally, specify the likely correlation between value-added score this year and next year. In most real-world settings, the degree of uncertainty will lead to considerable rates of misclassification of teachers.


A teacher's value-added score is intended to convey how much that teacher has contributed to student learning in a particular subject in a particular year. Different school districts define and compute value-added scores in different ways. But all of them share the idea that teachers who are particularly successful will help their students make large learning gains, that these gains can be measured by students' performance on achievement tests, and that the value-added score isolates the teacher's contribution to these gains.

A variety of people may see value-added estimates, and each group may use them for different purposes. Teachers themselves may want to compare their scores with those of others and use them to improve their work. Administrators may use them to make decisions about teaching assignments, professional development, pay, or promotion. Parents, if they see the scores, may use them to request particular teachers for their children. And, finally, researchers may use the estimates for studies on improving instruction.

Using value-added scores in any of these ways can be controversial. Some people doubt the validity of the achievement tests on which the scores are based, some question the emphasis on test scores to begin with, and others challenge the very idea that student learning gains reflect how well teachers do their jobs.

In order to sensibly interpret value-added scores, it is important to do two things: understand the sources of uncertainty and quantify its extent.

Our purpose is not to settle these controversies, but, rather, to answer a more limited, but essential, question: How might educators reasonably interpret value-added scores? Social science has yet to come up with a perfect measure of teacher effectiveness, so anyone who makes decisions on the basis of value-added estimates will be doing so in the midst of uncertainty. Making choices in the face of doubt is hardly unusual – we routinely contend with projected weather forecasts, financial predictions, medical diagnoses, and election polls. But as in these other areas, in order to sensibly interpret value-added scores, it is important to do two things: understand the sources of uncertainty and quantify its extent. Our aim is to identify possible errors of interpretation, to consider how likely these errors are to arise, and to help educators assess how consequential they are for different decisions.

We'll begin by asking how value-added scores are defined and computed. Next, we'll consider two sources of error: statistical bias and statistical imprecision.

[readon2 url=""]Continue reading...[/readon2]

2 new studies question value add measures

Evidence is overwhelming, as yet more studies show that using value add to measure teacher quality is fraught with error.

Academic tracking in secondary education appears to confound an increasingly common method for gauging differences in teacher quality, according to two recently released studies.

Failing to account for how students are sorted into more- or less-rigorous classes—as well as the effect different tracks have on student learning—can lead to biased "value added" estimates of middle and high school teachers' ability to boost their students' standardized-test scores, the papers conclude.

"I think it suggests that we're making even more errors than we need to—and probably pretty large errors—when we're applying value-added to the middle school level," said Douglas N. Harris, an associate professor of economics at Tulane University in New Orleans, whose study examines the application of a value-added approach to middle school math scores.

High-school-level findings from a separate second study, by C. Kirabo Jackson, an associate professor of human development and social policy at Northwestern University in Evanston, Ill., complement Mr. Harris' paper.

"At the elementary level, [value-added] is a pretty reliable measure, in terms of predicting how teachers will perform the following year," Mr. Jackson said. "At the high school level, it is quite a bit less reliable, so the scope for using this to improve student outcomes is much more limited."

The first study mentioned in this article concludes(emphasis ours)

We test the degree to which variation in measured performance is due to misalignment versus selection bias in a statewide sample of middle schools where students and teachers are assigned to explicit “tracks,” reflecting heterogeneous student ability and/or preferences. We find that failing to account for tracks leads to large biases in teacher value-added estimates.

A teacher of all lower track courses whose measured value-added is at the 50th percentile could increase her measured value-added to the 99th percentile simply by switching to all upper-track courses. We estimate that 75-95 percent of the bias is due to student sorting and the remainder due to test misalignment.

We also decompose the remaining bias into two parts, metric and multidimensionality misalignment, which work in opposite directions. Even after accounting for explicit tracking, the standard method for estimating teacher value-added may yield biased estimates.

The second study, replicates the findings and concludes

Unlike in elementary-school, high-school teacher effects may be confounded with both selection to tracks and unobserved track-level treatments. I document sizable confounding tracks effects, and show that traditional tests for the existence of teacher effects are likely biased. After accounting for these biases, algebra teachers have modest effects and there is little evidence of English teacher effects.

Unlike in elementary-school, value-added estimates are weak predictors of teachers’ future performance. Results indicate that either (a) teachers are less influential in high-school than in elementary-school, or (b) test-scores are a poor metric to measure teacher quality at the high-school level.

Corporate education reformers need to begin to address the science that is refuting their policies, the sooner this happens, the less damage is likely to be wrought.

What Value-Added Research Does And Does Not Show

Worth reading in it's entirety.

For example, the most prominent conclusion of this body of evidence is that teachers are very important, that there’s a big difference between effective and ineffective teachers, and that whatever is responsible for all this variation is very difficult to measure (see here, here, here and here). These analyses use test scores not as judge and jury, but as a reasonable substitute for “real learning,” with which one might draw inferences about the overall distribution of “real teacher effects.”

And then there are all the peripheral contributions to understanding that this line of work has made, including (but not limited to):

Prior to the proliferation of growth models, most of these conclusions were already known to teachers and to education researchers, but research in this field has helped to validate and elaborate on them. That’s what good social science is supposed to do.

Conversely, however, what this body of research does not show is that it’s a good idea to use value-added and other growth model estimates as heavily-weighted components in teacher evaluations or other personnel-related systems. There is, to my knowledge, not a shred of evidence that doing so will improve either teaching or learning, and anyone who says otherwise is misinformed.*

As has been discussed before, there is a big difference between demonstrating that teachers matter overall – that their test-based effects vary widely, and in a manner that is not just random –and being able to accurately identify the “good” and “bad” performers at the level of individual teachers. Frankly, to whatever degree the value-added literature provides tentative guidance on how these estimates might be used productively in actual policies, it suggests that, in most states and districts, it is being done in a disturbingly ill-advised manner.

[readon2 url=""]Read entire article[/readon2]

Certainty And Good Policymaking Don’t Mix

Using value-added and other types of growth model estimates in teacher evaluations is probably the most controversial and oft-discussed issue in education policy over the past few years.

Many people (including a large proportion of teachers) are opposed to using student test scores in their evaluations, as they feel that the measures are not valid or reliable, and that they will incentivize perverse behavior, such as cheating or competition between teachers. Advocates, on the other hand, argue that student performance is a vital part of teachers’ performance evaluations, and that the growth model estimates, while imperfect, represent the best available option.

I am sympathetic to both views. In fact, in my opinion, there are only two unsupportable positions in this debate: Certainty that using these measures in evaluations will work; and certainty that it won’t. Unfortunately, that’s often how the debate has proceeded – two deeply-entrenched sides convinced of their absolutist positions, and resolved that any nuance in or compromise of their views will only preclude the success of their efforts. You’re with them or against them. The problem is that it’s the nuance – the details – that determine policy effects.

Let’s be clear about something: I’m not aware of a shred of evidence – not a shred – that the use of growth model estimates in teacher evaluations improves performance of either teachers or students.

[readon2 url=""]Continue reading...[/readon2]