Same Teachers, Similar Students, Similar Tests, Different Answers

Via Vamboozled

One of my favorite studies to date about VAMs was conducted by John Papay, an economist once at Harvard and now at Brown University. In the study titled “Different Tests, Different Answers: The Stability of Teacher Value-Added Estimates Across Outcome Measures” published in 2009 by the 3rd best and most reputable peer-reviewed journal, American Educational Research Journal, Papay presents evidence that different yet similar tests (i.e., similar on content, and similar on when the tests were administered to similar sets of students) do not provide similar answers about teachers’ value-added performance. This is an issue with validity, in that, if a test is measuring the same things for the same folks at the same times, similar-to-the-same results should be realized. But they are not. Papay, rather, found moderate-sized rank correlations, ranging from r=0.15 to r=0.58, among the value-added estimates derived from the different tests.

Recently released, yet another study (albeit not yet peer-reviewed) has found similar results…potentially solidifying this finding further into our understandings about VAMs and their issues, particularly in terms of validity (or truth in VAM-based results). This study on “Comparing Estimates of Teacher Value-Added Based on Criterion- and Norm-Referenced Tests” released by the U.S. Department of Education and conducted by four researchers representing Notre Dame University, Basis Policy Research, and American Institutes of Research, provides evidence, again, that estimates of teacher value-added as based on different yet similar tests (i.e., in this case a criterion-referenced state assessment and a widely used norm-referenced test given in the same subject around the same time) yielded moderately correlated estimates of teacher-level value added, yet again.

If we had confidence in the validity of the inferences based on value-added measures, these correlations (or more simply put “relationships”) should be much higher than what they found, similar to what Papay found, in the range of 0.44 to 0.65. While the ideal correlation coefficient is a, in this case, r=+1.0, that is very rarely achieved. But for the purposes for which teacher-level value-added is currently being used, correlations above r=+.70/r=+.80 would (and should) be most desired, and possibly required before high-stakes decisions about teachers are to be made as based on these data.

In addition, researchers in this study found that on average, only 33.3% of teachers’ estimates from both sets of value-added estimates positioned them in the same range of scores (using quintiles or ranges including 20% bands of width) on both tests in the same school year. This too has implications for validity in that, again, teachers or teachers’ value-added estimates should fall in the same ranges, if and when using similar tests, if any valid inferences are to be made using value-added estimates.