VAM-Based Decisions Are Less Reliable Than Flipping a Coin

A new study of Value Added Measures (VAM) once again finds the use of this statistical tool to be inappropriate for measuring the effectiveness of teachers.

The question of stability [reliability/consistency] is not a question about whether average teacher performance rises, declines, or remains flat over time. The issue that concerns critics of VAM is whether individual teacher performance fluctuates over time in a way that invalidates inferences that an individual teacher is “low-” or “high-” performing. 

This distinction is crucial because VAM is increasingly being applied such that individual teachers who are identified as low-performing are to be terminated. From the perspective of individual teachers, it is inappropriate and invalid to fire a teacher whose performance is low this year but high the next year, and it is inappropriate to retain a teacher whose performance is high this year but low next year. 

Even if average teacher performance remains stable over time, individual teacher performance may fluctuate wildly from year to year.

After looking at numerous studies of VAM, the author concludes...

What this means is that value-added teacher rankings are insufficiently reliable for the purpose of high-stakes decisions regarding hiring and firing. High-Stakes decisions are clearly unwarranted if this volatility in the rankings is due to unmeasured variables or random measurement error. 

However, even in the unlikely event that there are no unmeasured variables and measurement error is zero, implying that all volatility is due to true variation in teacher performance, it would not be appropriate to hire or fire based on the ranking in a given year (designated “year t+1”) by such an extent as to invalidate the year t ranking. 

If VAM is used to identify and fire the bottom quartile (or quintile) of teachers, the results in Tables 1 and 2 indicate that this decision is incorrect, according to the year t+1 teacher rankings, between 59 and 70% of the time. If VAM-based culling is less reliable than flipping a coin, as these results suggest, then productive teachers would be culled more frequently than unproductive bottom quartile (or bottom quintile) teachers.

The whole notion that you could measure a teacher the same way a farmer measures a pig was always insulting and ridiculous. The ongoing science continues to prove that.

Here's the full paper