decisions

On Teacher Evaluation: Slow Down And Get It Right

May 23, 2013 in Article

One of the primary policy levers now being employed in states and districts nationwide is teacher evaluation reform. Well-designed evaluations, which should include measures that capture both teacher practice and student learning, have great potential to inform and improve the performance of teachers and, thus, students. Furthermore, most everyone agrees that the previous systems were largely pro forma, failed to provide useful feedback, and needed replacement.

The attitude among many policymakers and advocates is that we must implement these systems and begin using them rapidly for decisions about teachers, while design flaws can be fixed later. Such urgency is undoubtedly influenced by the history of slow, incremental progress in education policy. However, we believe this attitude to be imprudent.

The risks to excessive haste are likely higher than whatever opportunity costs would be incurred by proceeding more cautiously. Moving too quickly gives policymakers and educators less time to devise and test the new systems, and to become familiar with how they work and the results they provide.

Moreover, careless rushing may result in avoidable erroneous high stakes decisions about individual teachers. Such decisions are harmful to the profession, they threaten the credibility of the evaluations, and they may well promote widespread backlash (such as the recent Florida lawsuits and the growing “opt-out” movement). Making things worse, the opposition will likely “spill over” into other promising policies, such as the already-fragile effort to enact the Common Core standards and aligned assessments.

[readon2 url="http://shankerblog.org/?p=8358"]Continue reading...[/readon2]

On the Issue of Value-add

January 29, 2013 in Article

There is a growing body of research demonstrating that "Value-Added" measures (VAM) is simply unreliable as a stand-alone measure of teacher effectiveness. When the legislature inserted language into HB 555 with no hearings or public input (or news coverage, for that matter) to eliminate the possibility of using multiple measures of student performance for teachers with value-added scores, it moved in a direction utterly lacking in scientific evidence. The new language calls on teachers to be evaluated based on a methodology that, by its very design, cannot measure the true quality of the interaction between teacher and students in the classroom. This has serious implications for students and teachers alike.

The Governor is advocating for expanded use of student test scores not only for teacher evaluation, but also for decisions involving teacher hiring, layoffs and pay. There simply is no credible expert testimony that supports such a move. Value-added measures are influenced by far too many variables beyond the control of the teacher to be used in such high-stakes decisions.

In other parts of the country where similar evaluation systems have been implemented, stories of great teachers who were branded as ineffective because of aberrations in student test data abound. (See, for example, the story of New York City 8th grade math teacher Carolyn Abbott or Washington, DC, 5th grade teacher Sarah Wysocki.) This isn't just a theoretical policy debate. Decisions made by our elected officials have real human consequences.

What follows is an accurate summation of the current scientific knwoledge of the use of VAM in evaluations.

Value Added in Evaluation

Many policy makers are enthusiastic about using value added measures (VAM) for teacher evaluation. Many states have incorporated it into teacher evaluations. Its use, however, is problematic due to concerns about accuracy, fairness, and the incentives it would create for teachers that are potentially harmful for students.

VAM has serious limitations in determining teacher effectiveness

A teacher can be ranked in the top quartile one year and sink to the middle or even the bottom the next independent of any changes they made in their own instructional practice.

A paper written for the Carnegie Knowledge Network examining this issue cited a study that found that half of the teachers in the top fifth of performance remained there the following year while 20% of them fell to the lowest two quintiles. This defies reason – how could one fifth of teachers be identified as top performers in one year but among the worst in the next?

There are many reasons for this: VAM doesn’t account for school effect, students don’t grow at the linear pace assumed by the models, the students aren’t randomly assigned and VAM seems to be worse for teachers of students who have limited English proficiency. According to a RAND corporation study, VAM scores varied depending on what test was used.

Many Researchers Caution Against Use of VAM in Teacher Evaluations as a Sole Measure

The Brookings Institute supports use of VAM but cautions that the error ranges in measurement are so wide that one can’t make precise differentiation between levels of teacher effectiveness. The RAND study mentioned above also made a similar recommendation.

Jesse Rothstein of UC Berkeley found that non-random assignment of students caused the model to demonstrate a teacher caused student growth in the year prior to having them as students.

A synthesis of available research conducted by Marzano found that teachers account for only about 13 percent of the variance in student achievement.

Student variables (including home environment, student motivation, and prior knowledge) account for 80 percent of the variance. VAM does not necessarily isolate the teacher’s contribution to student achievement growth.

Erik Hanushek, whom the Ohio General Assembly relies on for policy advice, also gives caution to the over-reliance on value added for high stakes decisions with respect to teachers:

“The bigger set of issues, however, relates to the use of teacher value-added estimates in compensation, employment, promotion, or assignment decisions. The possibility of introducing performance pay based on value-added estimates motivates much of the prior analysis of the properties of these estimates, but movement in this direction has so far been limited.” “Despite the strength of the research findings, concerns about accuracy, fairness, and potential adverse effects of incentives based on a limited set of outcomes raise worries about the use of value added estimates in education personnel and policy decisions. Many of the possible drawbacks are related to the measurement and estimation issues discussed above, but there are also concerns about incentives to cheat, adopt teaching methods that teach narrowly to tests, and ignore non-tested subjects.”

And…

“Although researchers can mitigate the effects of sampling error on estimates of teacher quality, such error would inevitably lead some successful teachers to receive low ratings and some unsuccessful teachers to receive high ratings.”

And, finally, it may have an adverse effect on students:

“In terms of fairness, any failure to account for sorting on unobservable characteristics would potentially penalize teachers given … more difficult classrooms and reward teachers given … less difficult classrooms. This could discourage educationally beneficial decisions including the assignment of more difficult or disruptive students to higher quality teachers.”

Hanushek recommends that these problems could be mitigated by combining value-added with subjective observations. Hanushek’s paper may be found here.

HB 555 Magnifies the Problematic Nature of Over-reliance on VAM to Evaluate Teachers

HB 156 and SB 316 set forth the framework for the Ohio Teacher Evaluation System (OTES) in requiring that student achievement growth accounts for 50% of a teacher’s evaluation. The law mandated that VAM, when available, must be part of the student growth calculation but didn’t specify to what degree. The Ohio Department of Education, in creating the OTES framework mandated that student growth be calculated using multiple measures and that VAM, when available, must account for at least 10% of the the whole evaluation. Presumably ODE constructed the model in this way in recognition of the limitations of VAM as a primary determinant of teacher effectiveness.

HB 555 changes the framework to require that, if VAM is available for a teacher, it must be used in proportion to the amount they teach subjects covered by VAM in their schedule. In other words, a middle school math teacher who teaches an entire day of 7th and 8th grade math would have the 50% growth measure solely determined by VAM.

The OTES model has an imbedded bias to overvalue student growth. For instance, if a teacher with a poor student growth measure can be rated no greater than “Developing” (the second lowest category) no matter how their evaluator rated their classroom performance.(Fig below)

Because of the overvaluing of student growth in the OTES teacher rating matrix, HB 555 magnifies the random errors in VAM due to selection bias, non-school factors, the effect from other teachers and the school itself which are out of the teacher’s control. When VAM is fully 50% of a teacher’s evaluation and is overvalued so that it essentially trumps any teacher rating from subjective observations, the inevitable errors that occur will cause teachers to be unfairly rated in the lowest two categories putting them at risk for dismissal or being first in line to be laid off through reduction in force.

Simply put, we don’t believe that teachers should have an element of randomness determine career risk.

Using VAM to De-Select Teachers May Have Adverse School and Labor Market Effects

If teachers believe that their VAM score can cause them to lose their jobs, they will be much more likely to hoard information and teaching methods from their colleagues. They will also resist assignment of difficult students to their class, believing that the very students who need the most help may cause them to suffer adverse career consequences.

If teachers are being asked to assume a greater amount of career risk without a commensurate rise in pay, it is less than clear that there will be a willing pool of candidates waiting to fill positions of deselected teachers. This is especially problematic in the mathematics field, where there are already shortages of willing and qualified candidates. This situation will likely be exacerbated if teachers believe that the evaluation system is inherently unfair.

There will likely be an adverse effect on students as well. Schools and teachers will choose to narrow the curriculum and in-class instruction to only that which will be tested. Such narrowing of the curriculum will strip away the enjoyable aspects of school from students’ lives.

Alternatives to the Current System

This is not to say that there is no place for VAM in a comprehensive teacher evaluation system. There are alternatives to the current system in which VAM is a prominent part of the teacher evaluation but not a primary determinant of quality and leaving sufficient margin of error.

Several states have a student growth component that is lower than 50% - DC’s impact system (the prototypical model for OTES) has recently been revised to de-escalate the role of VAM in response to the concerns about its accuracy.

Teacher resistance to VAM is not monolithic – it’s much less likely they will resist OTES if VAM were a much lower component than the current mandated level. Furthermore, there is evidence that multiple observations and VAM can work in concert to successfully identify top performers as well as laggards.

Policy Recommendations

Reverse the VAM requirement put forth in HB555
Reduce the overall proportion of student growth required in the teacher evaluation
Maintain flexibility to refine the evaluation system as needed – this is mostly new and unproven
Systematically solicit and incorporate large scale teacher input – efforts in this area have been at best inadequate

Some Value Added Research Resources from ASCD:

Using Value-Added Measures to Evaluate Teachers

Use Caution with Value-Added Measures

How Stable are Value-Added Estimates

November 28, 2012 in Article

Via

Highlights:

A teacher’s value-added score in one year is partially but not fully predictive of her performance in the next.
Value-added is unstable because true teacher performance varies and because value-added measures are subject to error.
Two years of data does a meaningfully better job at predicting value added than does just one. A teacher’s value added in one subject is only partially predictive of her value added in another, and a teacher’s value added for one group of students is only partially predictive of her valued added for others.
The variation of a teacher’s value added across time, subject, and student population depends in part on the model with which it is measured and the source of the data that is used.
Year-to-year instability suggests caution when using value-added measures to make decisions for which there are no mechanisms for re-evaluation and no other sources of information.

Introduction

Value-added models measure teacher performance by the test score gains of their students, adjusted for a variety factors such as the performance of students when they enter the class. The measures are based on desired student outcomes such as math and reading scores, but they have a number of potential drawbacks. One of them is the inconsistency in estimates for the same teacher when value added is measured in a different year, or for different subjects, or for different groups of students.

Some of the differences in value added from year to year result from true differences in a teacher’s performance. Differences can also arise from classroom peer effects; the students themselves contribute to the quality of classroom life, and this contribution changes from year to year. Other differences come from the tests on which the value-added measures are based; because test scores are not perfectly accurate measures of student knowledge, it follows that they are not perfectly accurate gauges of teacher performance.

In this brief, we describe how value-added measures for individual teachers vary across time, subject, and student populations. We discuss how additional research could help educators use these measures more effectively, and we pose new questions, the answers to which depend not on empirical investigation but on human judgment. Finally, we consider how the current body of knowledge, and the gaps in that knowledge, can guide decisions about how to use value-added measures in evaluations of teacher effectiveness.

[readon2 url="http://www.carnegieknowledgenetwork.org/briefs/value-added/value-added-stability/"]Continue reading...[/readon2]

How Should Educators Interpret Value-Added Scores?

November 26, 2012 in Article

Via

Highlights

Each teacher, in principle, possesses one true value-added score each year, but we never see that "true" score. Instead, we see a single estimate within a range of plausible scores.
The range of plausible value-added scores -; the confidence interval -; can overlap considerably for many teachers. Consequently, for many teachers we cannot readily distinguish between them with respect to their true value-added scores.
Two conditions would enable us to achieve value-added estimates with high reliability: first, if teachers' value-added measurements were more precise, and second, if teachers’ true value-added scores varied more dramatically than they do.
Two kinds of errors of interpretation are possible when classifying teachers based on value-added: a) “false identifications” of teachers who are actually above a certain percentile but who are mistakenly classified as below it; and b) “false non-identifications” of teachers who are actually below a certain percentile but who are classified as above it. Falsely identifying teachers as being below a threshold poses risk to teachers, but failing to identify teachers who are truly ineffective poses risks to students.
Districts can conduct a procedure to identify how uncertainty about true value-added scores contributes to potential errors of classification. First, specify the group of teachers you wish to identify. Then, specify the fraction of false identifications you are willing to tolerate. Finally, specify the likely correlation between value-added score this year and next year. In most real-world settings, the degree of uncertainty will lead to considerable rates of misclassification of teachers.

Introduction

A teacher's value-added score is intended to convey how much that teacher has contributed to student learning in a particular subject in a particular year. Different school districts define and compute value-added scores in different ways. But all of them share the idea that teachers who are particularly successful will help their students make large learning gains, that these gains can be measured by students' performance on achievement tests, and that the value-added score isolates the teacher's contribution to these gains.

A variety of people may see value-added estimates, and each group may use them for different purposes. Teachers themselves may want to compare their scores with those of others and use them to improve their work. Administrators may use them to make decisions about teaching assignments, professional development, pay, or promotion. Parents, if they see the scores, may use them to request particular teachers for their children. And, finally, researchers may use the estimates for studies on improving instruction.

Using value-added scores in any of these ways can be controversial. Some people doubt the validity of the achievement tests on which the scores are based, some question the emphasis on test scores to begin with, and others challenge the very idea that student learning gains reflect how well teachers do their jobs.

In order to sensibly interpret value-added scores, it is important to do two things: understand the sources of uncertainty and quantify its extent.

Our purpose is not to settle these controversies, but, rather, to answer a more limited, but essential, question: How might educators reasonably interpret value-added scores? Social science has yet to come up with a perfect measure of teacher effectiveness, so anyone who makes decisions on the basis of value-added estimates will be doing so in the midst of uncertainty. Making choices in the face of doubt is hardly unusual – we routinely contend with projected weather forecasts, financial predictions, medical diagnoses, and election polls. But as in these other areas, in order to sensibly interpret value-added scores, it is important to do two things: understand the sources of uncertainty and quantify its extent. Our aim is to identify possible errors of interpretation, to consider how likely these errors are to arise, and to help educators assess how consequential they are for different decisions.

We'll begin by asking how value-added scores are defined and computed. Next, we'll consider two sources of error: statistical bias and statistical imprecision.

[readon2 url="http://www.carnegieknowledgenetwork.org/briefs/value-added/interpreting-value-added/"]Continue reading...[/readon2]

Popular modes of evaluating teachers are fraught with inaccuracies

February 22, 2012 in Article

In conclusion
New approaches to teacher evaluation should take advantage of research on teacher effectiveness. While there are considerable challenges in using value-added test scores to evaluate individual teachers directly, using value-added methods in research can help validate measures that are productive for teacher evaluation.

Research indicates that value-added measures of student achievement tied to individual teachers should not be used for high-stakes, individual-level decisions, or comparisons across highly dissimilar schools or student populations. Valid interpretations require aggregate-level data and should ensure that background factors — including overall classroom composition — are as similar as possible across groups being compared. In general, such measures should be used only in a low-stakes fashion when they’re part of an integrated analysis of teachers’ practices.

Standards-based evaluation processes have also been found to be predictive of student learning gains and productive for teacher learning. These include systems like National Board certification and performance assessments for beginning teacher licensing as well as district and school-level instruments based on professional teaching standards. Effective systems have developed an integrated set of measures that show what teachers do and what happens as a result. These measures may include evidence of student work and learning, as well as evidence of teacher practices derived from observations, video- tapes, artifacts, and even student surveys.

These tools are most effective when embedded in systems that support evaluation expertise and well- grounded decisions, by ensuring that evaluators are trained, evaluation and feedback are frequent, mentoring and professional development are available, and processes are in place to support due process and timely decision making by an appropriate body.

With these features in place, evaluation can be- come a more useful part of a productive teaching and learning system, supporting accurate information about teachers, helpful feedback, and well-grounded personnel decisions.

Kappan magazine - Teacher evaluation

The Stability Of Ohio’s School Value-Added Ratings

September 28, 2011 in Article

The Albert Shanker Institute has an important analysis of Ohio's school report card data, and finds a large amount of instability in the results. This should cause some pause, especially as we move towards using teacher level value add data for high stakes decisions. To say it will be critical to have reliable, trustworthy, and stable data when making hiring/firing and salary decisions is an obvious understatement. If there are serious and genuine questions about building level data stability, then the rush to go further ought to at least have some brakes applied.

On the other hand, there’s a degree to which instability is to be expected and even welcomed (see here and here). For one thing, school performance can exhibit “real” improvement (or degradation). In addition, nobody expects perfect precision, and part of the year-to-year instability might simply be due to small, completely “tolerable” amounts of random error

Some people might look at these results, in which most schools got different ratings between years, and be very skeptical of Ohio’s value-added measures. Others will have faith in them. It’s important to bear in mind that measuring school “quality” is far from an exact science, and all attempts to do so – using test scores or other metrics – will necessarily entail imprecision, both within and between years. It is good practice to always keep this in mind, and to interpret the results with caution.

So I can’t say definitively whether the two-year instability in ratings among Ohio’s public schools is “high” or “low” by any absolute standard. But I can say that the data suggest that schools really shouldn’t be judged to any significant extent based on just one or two years of value-added ratings.

Unfortunately, that’s exactly what’s happening in Ohio. Starting this year, all schools that come in “above expectations” in any given year are automatically bumped up a full “report card grade,” while schools that receive a “below expectations” ratings for two consecutive years are knocked down a grade (there are six possible grades). In both cases, the rules were changed (effective this year) such that fewer years were required to trigger the bumps – previously, it took two consecutive years “above expectations” to get a higher report card grade, and three consecutive years “below expectations” to lose a grade (see the state’s guide to ratings). These final grades can carry serious consequences, including closure, if they remain persistently low.

As I’ve said before, value-added and other growth models can be useful tools, if used properly. This is especially true of school-level value-added, since the samples are larger, and issues such as non-random assignment are less severe due to pooling of data for an entire school. However, given the rather high instability of ratings between years, and the fact that accuracy improves with additional years of data, the prudent move, if any, would be to require that more years of ratings be required to affect report card grades, not fewer. The state is once again moving in the wrong direction.

Check out the entire article here.