How Should Educators Interpret Value-Added Scores?



  • Each teacher, in principle, possesses one true value-added score each year, but we never see that "true" score. Instead, we see a single estimate within a range of plausible scores.
  • The range of plausible value-added scores -; the confidence interval -; can overlap considerably for many teachers. Consequently, for many teachers we cannot readily distinguish between them with respect to their true value-added scores.
  • Two conditions would enable us to achieve value-added estimates with high reliability: first, if teachers' value-added measurements were more precise, and second, if teachers’ true value-added scores varied more dramatically than they do.
  • Two kinds of errors of interpretation are possible when classifying teachers based on value-added: a) “false identifications” of teachers who are actually above a certain percentile but who are mistakenly classified as below it; and b) “false non-identifications” of teachers who are actually below a certain percentile but who are classified as above it. Falsely identifying teachers as being below a threshold poses risk to teachers, but failing to identify teachers who are truly ineffective poses risks to students.
  • Districts can conduct a procedure to identify how uncertainty about true value-added scores contributes to potential errors of classification. First, specify the group of teachers you wish to identify. Then, specify the fraction of false identifications you are willing to tolerate. Finally, specify the likely correlation between value-added score this year and next year. In most real-world settings, the degree of uncertainty will lead to considerable rates of misclassification of teachers.


A teacher's value-added score is intended to convey how much that teacher has contributed to student learning in a particular subject in a particular year. Different school districts define and compute value-added scores in different ways. But all of them share the idea that teachers who are particularly successful will help their students make large learning gains, that these gains can be measured by students' performance on achievement tests, and that the value-added score isolates the teacher's contribution to these gains.

A variety of people may see value-added estimates, and each group may use them for different purposes. Teachers themselves may want to compare their scores with those of others and use them to improve their work. Administrators may use them to make decisions about teaching assignments, professional development, pay, or promotion. Parents, if they see the scores, may use them to request particular teachers for their children. And, finally, researchers may use the estimates for studies on improving instruction.

Using value-added scores in any of these ways can be controversial. Some people doubt the validity of the achievement tests on which the scores are based, some question the emphasis on test scores to begin with, and others challenge the very idea that student learning gains reflect how well teachers do their jobs.

In order to sensibly interpret value-added scores, it is important to do two things: understand the sources of uncertainty and quantify its extent.

Our purpose is not to settle these controversies, but, rather, to answer a more limited, but essential, question: How might educators reasonably interpret value-added scores? Social science has yet to come up with a perfect measure of teacher effectiveness, so anyone who makes decisions on the basis of value-added estimates will be doing so in the midst of uncertainty. Making choices in the face of doubt is hardly unusual – we routinely contend with projected weather forecasts, financial predictions, medical diagnoses, and election polls. But as in these other areas, in order to sensibly interpret value-added scores, it is important to do two things: understand the sources of uncertainty and quantify its extent. Our aim is to identify possible errors of interpretation, to consider how likely these errors are to arise, and to help educators assess how consequential they are for different decisions.

We'll begin by asking how value-added scores are defined and computed. Next, we'll consider two sources of error: statistical bias and statistical imprecision.

[readon2 url=""]Continue reading...[/readon2]

Natural disaster based ed reform

Corporate education reformers will latch on to anything to portray their preferred policies as being effective. Terry Ryan of the Fordham Foundation has one of the most ridiculous efforts to date

Is it time for urban school superintendents to move from being Reformers to Relinquishers? Yes, is the compelling case that Neerav Kingsland makes today over at Straight Up. Kingsland, chief strategy officer for New Schools for New Orleans, writes that reform-minded superintendents should embrace the lessons from New Orleans, a key one being that the academic achievement gains made in the Big Easy have not come from traditional reforms and tweaks to the system. Rather, the changes in New Orleans are the result of virtually replacing the traditional, centralized, bureaucratic system of one-size-fits-all command and control with a system of independent high-performing charter schools all held accountable by the center for their academic performance.

That's one heck of a claim, but the entire piece misses one astoundingly obvious and important fact. New Orleans suffered one of the worst natural disasters to ever afflict a US city, and as a consequence the demographics of the city changed dramatically.

The aftermath of the 2005 storm, which took 1,835 lives and caused an estimated $81 billion in property damage, has left the city with an older, wealthier and less diverse population, according to data recently released by the Nielsen company. If its findings are confirmed by the 2010 Census, that information could go a long way in helping the city attract businesses and outside capital to continue rebuilding.

According to Nielsen, New Orleans lost 595,205 people prior to and shortly after Katrina, dropping it from the country's 35th largest market in 2000 to the 49th largest market in 2006. Atlanta, Houston and Dallas received the bulk of Katrina refugees. Now in 2010, New Orleans ranks as the 46th largest market with 1,194,196 persons. Nielsen projects the city will have a population of 1,264,365 in 2015 and will likely remain ranked as the 46th largest market in the U.S.

"The city has become older (the median age rose from 34 to 38.8), less diverse (the white non-Hispanic population increased from 25.8% to 30.9%) and a bit wealthier (median income rose from $31,369 to $39,530)," says the Nielsen report. The challenge now for New Orleans is to find ways to use some of these changes to help attract the developers and corporations who could help the city rebound.

The population got smaller, richer. It's not a stretch to understand that these factors, and not some corporate education reform policies that have failed to work at scale anywhere are the cause for any aggregate gains in student performance in New Orleans.

Unless, along with getting superintendents to relinquish control of their districts, corporate education reformers can also summon great floods and pestilence, we might be better off not throwing everything out the window just yet.

New Gates Study on teacher evaluations

A new Gates study released today finds effective teacher evaluations require high standards, with multiple measures.

ABOUT THIS REPORT: This report is intended for policymakers and practitioners wanting to understand the implications of the Measures of Effective Teaching (MET) project’s interim analysis of classroom observations. Those wanting to explore all the technical aspects of the study and analysis also should read the companion research report, available at

Together, these two documents on classroom observations represent the second pair of publications from the MET project. In December 2010, the project released its initial analysis of measures of student perceptions and student achievement in Learning about Teaching: Initial Findings from the Measures of Effective Teaching Project. Two more reports are planned for mid-2012: one on the implications of assigning weights to different measures; another using random assignment to study the extent to which student assignment may affect teacher effectiveness results. ABOUT THE MET PROJECT: The MET project is a research partnership of academics, teachers, and education organizations committed to investigating better ways to identify and develop effective teaching. Funding is provided by the Bill & Melinda Gates Foundation.

The report provides for 3 takeaways.

High-quality classroom observations will require clear standards, certified raters, and multiple observations per teacher. Clear standards and high-quality training and certification of observers are fundamental to increasing inter-rater reliability. However, when measuring consistent aspects of a teacher’s practice, reliability will require more than inter- rater agreement on a single lesson. Because teaching practice varies from lesson to lesson, multiple observations will be necessary when high-stakes decisions are to be made. But how will school systems know when they have implemented a fair system? Ultimately, the most direct way is to periodically audit a representative sample of official observations, by having impartial observers perform additional observations. In our companion research report, we describe one approach to doing this.

Combining the three approaches (classroom observations, student feedback, and value-added student achievement gains) capitalizes on their strengths and offsets their weaknesses. For example, value-added is the best single predictor of a teacher’s student achievement gains in the future. But value-added is often not as reliable as some other measures and it does not point a teacher to specific areas needing improvement. Classroom observations provide a wealth of information that could support teachers in improving their practice. But, by themselves, these measures are not highly reliable, and they are only modestly related to student achievement gains. Student feedback promises greater reliability because it includes many more perspectives based on many more hours in the classroom, but not surprisingly, it is not as predictive of a teacher’s achievement gains with other students as value-added. Each shines in its own way, either in terms of predictive power, reliability, or diagnostic usefulness.

Combining new approaches to measuring effective teaching—while not perfect—significantly outperforms traditional measures. Providing better evidence should lead to better decisions. No measure is perfect. But if every personnel decision carries consequences—for teachers and students—then school systems should learn which measures are better aligned to the outcomes they value. Combining classroom observations with student feedback and student achievement gains on state tests did a better job than master’s degrees and years of experience in predicting which teachers would have large gains with another group of students. But the combined measure also predicted larger differences on a range of other outcomes, including more cognitively challenging assessments and student- reported effort and positive emotional attachment. We should refine these tools and continue to develop better ways to provide feedback to teachers. In the meantime, it makes sense to compare measures based on the criteria of predictive power, reliability, and diagnostic usefulness.

MET Gathering Feedback Practioner Brief

Proving SB5 unnecessary, public schools show significant gains

The freshly released 2010-2011 state report card has some great news to demonstrate that public schools in Ohio are not in some crisis, and radical, extreme reforms are not needed in order for our students to recevie a quality education.

The percentage of students scoring proficient on state tests increased on 21 of 26 indicators, with the strongest gains in third-grade math, eighth-grade math and 10th-grade writing. Overall, students met the state goal on 17 out of 26 indicators, one less than last year. The statewide average for all students’ test scores, known as the Performance Index, jumped 1.7 points to 95, the biggest gain since 2004-2005.

For 2010-2011, the number of districts ranked Excellent with Distinction or Excellent increased by 56 to 352. The number of schools in those same categories grew by 186 to 1,769.

76% of traditional public schools statewide have a B or better this year.

Value-Added results, which show whether students meet the expected one year of growth for students in grades 3-8 in reading and math. In 2010-2011, 79.5 percent of districts and 81.4 percent of schools met or exceeded expected Value-Added gains.

The Performance Index looks at the performance of every student, not just those who score proficient or higher. In 2010-11, 89.3 percent of districts and 71 percent of schools improved their Performance Index scores.

We'll be taking a closer look at this results and bringing you all the latest findings.

To Understand The Impact Of Teacher-Focused Reforms, Pay Attention To Teachers

You don’ t need to be a policy analyst to know that huge changes in education are happening at the state- and local-levels right now – teacher performance pay, the restriction of teachers’ collective bargaining rights, the incorporation of heavily-weighted growth model estimates in teacher evaluations, the elimination of tenure, etc. Like many, I am concerned about the possible consequences of some of these new policies (particularly about their details), as well as about the apparent lack of serious efforts to monitor them.

Our “traditional” gauge of “what works” – cross-sectional test score gains – is totally inadequate, even under ideal circumstances. Even assuming high quality tests that are closely aligned to what has been taught, raw test scores alone cannot account for changes in the student population over time and are subject to measurement error. There is also no way to know whether fluctuations in test scores (even fluctuations that are real) are the result of any particular policy (or lack thereof).

[readon2 url=""]Continue Reading[/readon2]