The Arbitrary Albatross: Standardized Testing and Teacher Evaluation

On Chicago's streets and Hollywood's silver screens, education reform has been cast as a false dilemma between students and teachers. Reputable actresses and liberal mayors have both fallen prey. At the center of this drama lie teacher evaluations. A linchpin of the debate, they weigh especially heavily around the necks of educators like me.

Think: Shaky Foundation

With the arrival of spring, testing season is now upon us: America's new national pastime. I believe student results from standardized tests should not be used to evaluate teachers because the data are imprecise and the effects are pernicious. Including such inaccurate measures is both unfair to teachers and detrimental to student learning.

As a large body of research suggests, standardized test data are imprecise for two main reasons. First, they do not account for individual and environmental factors affecting student performance, factors over which teachers have no control. (Think: commitment, social class, family.) Second, high-stakes, one-time tests increase the likelihood of random variation so that scores fluctuate in arbitrary ways not linked to teacher efficacy. (Think: sleep, allergies, the heartache of a recent breakup.)

High-stakes assessments are also ruinous to student learning. They encourage, at least, teaching to the test and, at most, outright cheating. This phenomenon is supported by Campbell's law, which states statistics are more likely to be corrupted when used in making decisions, which in turn corrupts the decision making process itself. (Think: presidential campaigns.)

As a teacher, if my livelihood is based on test results, then I will do everything possible to ensure high marks, including narrowing the curriculum and prepping fiercely for the test. The choice between an interesting project and a paycheck is no choice at all. These are amazing disincentives to student learning. Tying teachers' careers to standardized tests does not foster creative, passionate, skillful young adults. It does exactly the opposite.

[readon2 url=""]Continue reading...[/readon2]

The Toxic Trifecta in Current Legislative Models for Teacher Evaluation

A relatively consistent legislative framework for teacher evaluation has evolved across states in the past few years. Many of the legal concerns that arise do so because of inflexible, arbitrary and often ill-conceived yet standard components of this legislative template. There exist three basic features of the standard model, each of which is problematic on its own regard, and those problems become multiplied when used in combination.

First, the standard evaluation model proposed in legislation requires that objective measures of student achievement growth necessarily be considered in a weighting system of parallel components. Student achievement growth measures are assigned, for example, a 40 or 50% weight alongside observation and other evaluation measures. Placing the measures alongside one another in a weighting scheme assumes all measures in the scheme to be of equal validity and reliability but of varied importance (utility) – varied weight. Each measure must be included, and must be assigned the prescribed weight – with no opportunity to question the validity of any measure. [1]Such a system also assumes that the various measures included in the system are each scaled such that they can vary to similar degrees. That is, that the observational evaluations will be scaled to produce similar variation to the student growth measures, and that the variance in both measures is equally valid – not compromised by random error or bias. In fact, however, it remains highly likely that some components of the teacher evaluation model will vary far more than others if by no other reasons than that some measures contain more random noise than others or that some of the variation is attributable to factors beyond the teachers’ control. Regardless of the assigned weights and regardless of the cause of the variation (true or false measure) the measure that varies more will carry more weight in the final classification of the teacher as effective or not. In a system that places differential weight, but assumes equal validity across measures, even if the student achievement growth component is only a minority share of the weight, it may easily become the primary tipping point in most high stakes personnel decisions.

Second, the standard evaluation model proposed in legislation requires that teachers be placed into effectiveness categories by assigning arbitrary numerical cutoffs to the aggregated weighted evaluation components. That is, a teacher in the 25%ile or lower when combining all evaluation components might be assigned a rating of “ineffective,” whereas the teacher at the 26%ile might be labeled effective. Further, the teacher’s placement into these groupings may largely if not entirely hinge on their rating in the student achievement growth component of their evaluation. Teachers on either side of the arbitrary cutoff are undoubtedly statistically no different from one another. In many cases as with the recently released teacher effectiveness estimates on New York City teachers, the error ranges for the teacher percentile ranks have been on the order of 35%ile points (on average, up to 50% with one year of data). Assuming that there is any real difference between the teacher at the 25%ile and 26%ile (as their point estimate) is a huge unwarranted stretch. Placing an arbitrary, rigid, cut-off score into such noisy measures makes distinctions that simply cannot be justified especially when making high stakes employment decisions.

Third, the standard evaluation model proposed in legislation places exact timelines on the conditions for removal of tenure. Typical legislation dictates that teacher tenure either can or must be revoked and the teacher dismissed after 2 consecutive years of being rated ineffective (where tenure can only be achieved after 3 consecutive years of being rate effective).[2]As such, whether a teacher rightly or wrongly falls just below or just above the arbitrary cut-offs that define performance categories may have relatively inflexible consequences.

The Forced Choice between “Bad” Measures and “Wrong” Ones

[readon2 url=""]Continue reading...[/readon2]

Making (Up) The Grade In Ohio

Every year, most schools (and districts) in Ohio get one of six grades: Emergency, Watch, Continuous Improvement, Effective, Excellent, and Excellent with Distinction. Schools that receive poor grades over a period of years face a cascade of increasingly severe sanctions. This means that these report card grades are serious business.

The method for determining grades is a seemingly arbitrary step-by-step process outlined on page eight of this guidebook. I won’t bore you with the details, but suffice it to say that a huge factor determining a school’s grade is whether it meets certain benchmarks on one of two measures: The aforementioned “performance index” and the percentage of state standards that it meets. Both of these are “absolute” performance measures – they focus on how well students score on state tests (specifically, how many meet proficiency and other benchmarks), not on whether or not their scores improve. And neither accounts for differences in student characteristics, such as learning disabilities and income.

As I have discussed before, there is a growing consensus in education policy that, to the degree that schools and teachers should be judged on the basis of test results, the focus should be on whether students are improving (i.e., growth), not how highly they score (i.e., absolute performance). The reasoning is simple: Upon entry into the schooling system, poor kids (and those with disabilities, non-native English speakers, etc.) tend to score lower by absolute standards, and since schools have no control over this, they should be judged by the effect that they have on students, not on which students they happen to receive. That’s why high-profile schools like KIPP are considered effective, even though their overall scores are much lower than those in affluent suburbs.

The strong relationship between district poverty and one of these absolute performance measures – the state’s performance index that Fordham’s Terry Ryan discussed – is clear in the graph below, which I presented in a previous post.


Perhaps the people who designed the Ohio system made a good-faith effort to achieve “balance” between the various components – a very difficult endeavor to be sure. But what they ended up with was a somewhat arbitrary formula that produces troubling, implausible results based on contradictory notions of how to measure performance. The grades are as much a function of income and other student characteristics as anything else, and they’re more likely to change than stay the same between years. So, while I can’t say what the perfect system would look like, I can say that Ohio’s report card grades, without substantial changes, should be taken with a shaker full of salt.

Unfortunately, that’s easy for me to say, but parents, teachers, administrators, and other stakeholders have no such luxury. These grades are used in the highest-stakes decisions.

[readon2 url=""]Read the entire article...[/readon2]