Policymakers and school administrators have embraced value-added models of teacher effectiveness as tools for educational improvement. Teacher value-added estimates may be viewed as complicated scores of a certain kind. This suggests using a test validation model to examine their reliability and validity. Validation begins with an interpretive argument for inferences or actions based on value-added scores. That argument addresses (a) the meaning of the scores themselves — whether they measure the intended construct; (b) their generalizability — whether the results are stable from year to year or using different student tests, for example; and (c) the relation of value-added scores to broader notions of teacher effectiveness — whether teachers’ effectiveness in raising test scores can serve as a proxy for other aspects of teaching quality. Next, the interpretive argument directs attention to rationales for the expected benefits of particular value-added score uses or interpretations, as well as plausible unintended consequences. This kind of systematic analysis raises serious questions about some popular policy prescriptions based on teacher value-added scoresThe whole report, included below is worth a read, or at least a skip to the conclusion
My first conclusion should come as no surprise: Teacher VAM scores should emphatically not be included as a substantial factor with a fixed weight in consequential teacher personnel decisions. The information they provide is simply not good enough to use in that way. It is not just that the information is noisy. Much more serious is the fact that the scores may be systematically biased for some teachers and against others, and major potential sources of bias stem from the way our school system is organized. No statistical manipulation can assure fair comparisons of teachers working in very different schools, with very different students, under very different conditions. One cannot do a good enough job of isolating the signal of teacher effects from the massive influences of students’ individual aptitudes, prior educational histories, out-of-school experiences, peer influences, and differential summer learning loss, nor can one adequately adjust away the varying academic climates of different schools. Even if acceptably small bias from all these factors could be assured, the resulting scores would still be highly unreliable and overly sensitive to the particular achievement test em- ployed. Some of these concerns can be addressed, by us- ing teacher scores averaged across several years of data, for example. But the interpretive argument is a chain of reasoning, and every proposition in the chain must be supported. Fixing one problem or another is not enough to make the case.