On Teacher Evaluation: Slow Down And Get It Right

One of the primary policy levers now being employed in states and districts nationwide is teacher evaluation reform. Well-designed evaluations, which should include measures that capture both teacher practice and student learning, have great potential to inform and improve the performance of teachers and, thus, students. Furthermore, most everyone agrees that the previous systems were largely pro forma, failed to provide useful feedback, and needed replacement.

The attitude among many policymakers and advocates is that we must implement these systems and begin using them rapidly for decisions about teachers, while design flaws can be fixed later. Such urgency is undoubtedly influenced by the history of slow, incremental progress in education policy. However, we believe this attitude to be imprudent.

The risks to excessive haste are likely higher than whatever opportunity costs would be incurred by proceeding more cautiously. Moving too quickly gives policymakers and educators less time to devise and test the new systems, and to become familiar with how they work and the results they provide.

Moreover, careless rushing may result in avoidable erroneous high stakes decisions about individual teachers. Such decisions are harmful to the profession, they threaten the credibility of the evaluations, and they may well promote widespread backlash (such as the recent Florida lawsuits and the growing “opt-out” movement). Making things worse, the opposition will likely “spill over” into other promising policies, such as the already-fragile effort to enact the Common Core standards and aligned assessments.

[readon2 url=""]Continue reading...[/readon2]

How Stable are Value-Added Estimates



  • A teacher’s value-added score in one year is partially but not fully predictive of her performance in the next.
  • Value-added is unstable because true teacher performance varies and because value-added measures are subject to error.
  • Two years of data does a meaningfully better job at predicting value added than does just one. A teacher’s value added in one subject is only partially predictive of her value added in another, and a teacher’s value added for one group of students is only partially predictive of her valued added for others.
  • The variation of a teacher’s value added across time, subject, and student population depends in part on the model with which it is measured and the source of the data that is used.
  • Year-to-year instability suggests caution when using value-added measures to make decisions for which there are no mechanisms for re-evaluation and no other sources of information.


Value-added models measure teacher performance by the test score gains of their students, adjusted for a variety factors such as the performance of students when they enter the class. The measures are based on desired student outcomes such as math and reading scores, but they have a number of potential drawbacks. One of them is the inconsistency in estimates for the same teacher when value added is measured in a different year, or for different subjects, or for different groups of students.

Some of the differences in value added from year to year result from true differences in a teacher’s performance. Differences can also arise from classroom peer effects; the students themselves contribute to the quality of classroom life, and this contribution changes from year to year. Other differences come from the tests on which the value-added measures are based; because test scores are not perfectly accurate measures of student knowledge, it follows that they are not perfectly accurate gauges of teacher performance.

In this brief, we describe how value-added measures for individual teachers vary across time, subject, and student populations. We discuss how additional research could help educators use these measures more effectively, and we pose new questions, the answers to which depend not on empirical investigation but on human judgment. Finally, we consider how the current body of knowledge, and the gaps in that knowledge, can guide decisions about how to use value-added measures in evaluations of teacher effectiveness.

[readon2 url=""]Continue reading...[/readon2]

How Should Educators Interpret Value-Added Scores?



  • Each teacher, in principle, possesses one true value-added score each year, but we never see that "true" score. Instead, we see a single estimate within a range of plausible scores.
  • The range of plausible value-added scores -; the confidence interval -; can overlap considerably for many teachers. Consequently, for many teachers we cannot readily distinguish between them with respect to their true value-added scores.
  • Two conditions would enable us to achieve value-added estimates with high reliability: first, if teachers' value-added measurements were more precise, and second, if teachers’ true value-added scores varied more dramatically than they do.
  • Two kinds of errors of interpretation are possible when classifying teachers based on value-added: a) “false identifications” of teachers who are actually above a certain percentile but who are mistakenly classified as below it; and b) “false non-identifications” of teachers who are actually below a certain percentile but who are classified as above it. Falsely identifying teachers as being below a threshold poses risk to teachers, but failing to identify teachers who are truly ineffective poses risks to students.
  • Districts can conduct a procedure to identify how uncertainty about true value-added scores contributes to potential errors of classification. First, specify the group of teachers you wish to identify. Then, specify the fraction of false identifications you are willing to tolerate. Finally, specify the likely correlation between value-added score this year and next year. In most real-world settings, the degree of uncertainty will lead to considerable rates of misclassification of teachers.


A teacher's value-added score is intended to convey how much that teacher has contributed to student learning in a particular subject in a particular year. Different school districts define and compute value-added scores in different ways. But all of them share the idea that teachers who are particularly successful will help their students make large learning gains, that these gains can be measured by students' performance on achievement tests, and that the value-added score isolates the teacher's contribution to these gains.

A variety of people may see value-added estimates, and each group may use them for different purposes. Teachers themselves may want to compare their scores with those of others and use them to improve their work. Administrators may use them to make decisions about teaching assignments, professional development, pay, or promotion. Parents, if they see the scores, may use them to request particular teachers for their children. And, finally, researchers may use the estimates for studies on improving instruction.

Using value-added scores in any of these ways can be controversial. Some people doubt the validity of the achievement tests on which the scores are based, some question the emphasis on test scores to begin with, and others challenge the very idea that student learning gains reflect how well teachers do their jobs.

In order to sensibly interpret value-added scores, it is important to do two things: understand the sources of uncertainty and quantify its extent.

Our purpose is not to settle these controversies, but, rather, to answer a more limited, but essential, question: How might educators reasonably interpret value-added scores? Social science has yet to come up with a perfect measure of teacher effectiveness, so anyone who makes decisions on the basis of value-added estimates will be doing so in the midst of uncertainty. Making choices in the face of doubt is hardly unusual – we routinely contend with projected weather forecasts, financial predictions, medical diagnoses, and election polls. But as in these other areas, in order to sensibly interpret value-added scores, it is important to do two things: understand the sources of uncertainty and quantify its extent. Our aim is to identify possible errors of interpretation, to consider how likely these errors are to arise, and to help educators assess how consequential they are for different decisions.

We'll begin by asking how value-added scores are defined and computed. Next, we'll consider two sources of error: statistical bias and statistical imprecision.

[readon2 url=""]Continue reading...[/readon2]

Popular modes of evaluating teachers are fraught with inaccuracies

In conclusion
New approaches to teacher evaluation should take advantage of research on teacher effectiveness. While there are considerable challenges in using value-added test scores to evaluate individual teachers directly, using value-added methods in research can help validate measures that are productive for teacher evaluation.

Research indicates that value-added measures of student achievement tied to individual teachers should not be used for high-stakes, individual-level decisions, or comparisons across highly dissimilar schools or student populations. Valid interpretations require aggregate-level data and should ensure that background factors — including overall classroom composition — are as similar as possible across groups being compared. In general, such measures should be used only in a low-stakes fashion when they’re part of an integrated analysis of teachers’ practices.

Standards-based evaluation processes have also been found to be predictive of student learning gains and productive for teacher learning. These include systems like National Board certification and performance assessments for beginning teacher licensing as well as district and school-level instruments based on professional teaching standards. Effective systems have developed an integrated set of measures that show what teachers do and what happens as a result. These measures may include evidence of student work and learning, as well as evidence of teacher practices derived from observations, video- tapes, artifacts, and even student surveys.

These tools are most effective when embedded in systems that support evaluation expertise and well- grounded decisions, by ensuring that evaluators are trained, evaluation and feedback are frequent, mentoring and professional development are available, and processes are in place to support due process and timely decision making by an appropriate body.

With these features in place, evaluation can be- come a more useful part of a productive teaching and learning system, supporting accurate information about teachers, helpful feedback, and well-grounded personnel decisions.

Kappan magazine - Teacher evaluation

The Stability Of Ohio’s School Value-Added Ratings

The Albert Shanker Institute has an important analysis of Ohio's school report card data, and finds a large amount of instability in the results. This should cause some pause, especially as we move towards using teacher level value add data for high stakes decisions. To say it will be critical to have reliable, trustworthy, and stable data when making hiring/firing and salary decisions is an obvious understatement. If there are serious and genuine questions about building level data stability, then the rush to go further ought to at least have some brakes applied.

On the other hand, there’s a degree to which instability is to be expected and even welcomed (see here and here). For one thing, school performance can exhibit “real” improvement (or degradation). In addition, nobody expects perfect precision, and part of the year-to-year instability might simply be due to small, completely “tolerable” amounts of random error

Some people might look at these results, in which most schools got different ratings between years, and be very skeptical of Ohio’s value-added measures. Others will have faith in them. It’s important to bear in mind that measuring school “quality” is far from an exact science, and all attempts to do so – using test scores or other metrics – will necessarily entail imprecision, both within and between years. It is good practice to always keep this in mind, and to interpret the results with caution.

So I can’t say definitively whether the two-year instability in ratings among Ohio’s public schools is “high” or “low” by any absolute standard. But I can say that the data suggest that schools really shouldn’t be judged to any significant extent based on just one or two years of value-added ratings.

Unfortunately, that’s exactly what’s happening in Ohio. Starting this year, all schools that come in “above expectations” in any given year are automatically bumped up a full “report card grade,” while schools that receive a “below expectations” ratings for two consecutive years are knocked down a grade (there are six possible grades). In both cases, the rules were changed (effective this year) such that fewer years were required to trigger the bumps – previously, it took two consecutive years “above expectations” to get a higher report card grade, and three consecutive years “below expectations” to lose a grade (see the state’s guide to ratings). These final grades can carry serious consequences, including closure, if they remain persistently low.

As I’ve said before, value-added and other growth models can be useful tools, if used properly. This is especially true of school-level value-added, since the samples are larger, and issues such as non-random assignment are less severe due to pooling of data for an entire school. However, given the rather high instability of ratings between years, and the fact that accuracy improves with additional years of data, the prudent move, if any, would be to require that more years of ratings be required to affect report card grades, not fewer. The state is once again moving in the wrong direction.

Check out the entire article here.