Suspicious test scores

Here at JTF, we've been very quick to point out instances of cheating, either isolated, or systemic, as a quick search of our archives or twitter feeds will show. As public education is driven ever more into corporate types of management and measurement, coupled with high stakes tied to test scores, it should surprise no one that corporate types of behavior emerge - think Enron, Arthur Anderson, World Com, MF Global Holdings.

It is with that backdrop we turn to an investigative piece by the Dayton Daily News (DDN) in conjunction with the Atlanta Journal Constitution (AJC), titled "Suspect test scores found across Ohio schools".

Steep spikes and drops on standardized test scores, a pattern that has indicated cheating in Atlanta and other cities across the nation, have occurred in hundreds of school districts and charter schools across Ohio in the past seven years, a Dayton Daily News analysis found.

The analysis does not prove cheating has occurred in Ohio. But interviews and documents show that state officials do not employ vigorous statistical analyses to catch possible cheating, discipline only about a dozen teachers a year and direct Ohio’s test vendor to spend just $17,540 on analyzing suspicious scores out of its $39 million annual testing contract.

It's a weak piece that could be used and sensationalized by many, and the paper has come under almost instant withering criticism for it's approach.

One of the researchers involved in analyzing data for USA today's ground breaking cheating series took a look at the DDN analysis

Given my past role in reviewing data and methods used for detecting systematic cheating, I was delighted to have the opportunity a week ago to review Ohio assessment data that was being used as part of a national study released today by The Atlanta Constitution-Journal and affiliated Cox newspapers. My review, however, yielded serious concerns about the data used, the methods of analysis employed, and the conclusions drawn.
In short, here are some of my concerns about the methods:
  • As noted, the analysis is based on school-level data and not individual student-level data. Accordingly, it was not possible to ensure that the same students were in the group in both years.
  • The analysis of irregular jumps in test scores should have been coupled with irregularities in erasure data where this data was available.
  • The analysis by Cox generates predicted values for schools, but this does not incorporate demographic characteristics of the student population.
  • The limited details available on the study methods made it impossible to replicate and verify what the journalists were doing. Further, the rationale was unclear for some of the steps they took.

He wasn't the only expert to consider the DDN findings. Stephen Dyer, former newspaper reporter, architect of Ohio's prematurely abandoned evidence based model, and think tank fellow had this to say, after discussing similar analytical shortcomings as pointed out above

If you're going to write a story that suggests massive, statewide (and in AJC's case, national) cheating on standardized tests, you'd better be prepared to name the offenders and feel solid enough in your methodology to refute the state's education agency and largest teachers union, both of whom knocked the papers' methods. If you have to spend a large chunk of your story having competing experts defend and knock your statistical analysis, you need to re-do the analysis. Though it showed integrity for the paper to allow those critical comments in the story.

As a former reporter, I can say these issues would invariably pop up before big stories ran. Sometimes, it means delaying your story for a day or two, or in a few cases, never run them at all. As a journalist, you, as a general rule, cannot spend any time in your story defending your story. If you have to, it means you don't have it nailed down yet; it needs more time in the oven.

The DDN spend almost the entirety of their story defending their story.

Greg Mild, over at Plunderbund has an even harsher response, and points out some great absurdities of the DDN analysis

Furthermore, note that the “2,600 improbable changes” include spikes and drops in test results. These journalists are putting out this theory of irregularities and cheating by schools based on numbers that include falling scores! Right, because so many educators are interested in risking their careers by encouraging children to change their scores to incorrect answers to suffer a significant DROP in their test scores. Yet those numbers are touted by these “journalists” in their sweeping accusations of improbable scores and cheating.

We continue to believe that cheating is totally unacceptable and ought to be exposed when and where found, but the Dayton Daily News story, as they point out themselves, does not come close to demonstrating what they seem to want to sensationalize - widespread cheating, Atlanta style.

As we begin to rely more and more upon student test scores to measure schools and teachers, suspicions are going to grow, a few might be borne out, but many will be baseless - but each accusation serves to undermine public education and people's trust in it. It's another unintended failing of the corporate education reform schemes we're currently pursuing.

Like an untested drug?

If there was a new drug that had shown some promise in curing the flu in lab trials, but there were also some indicators that it had some nasty, in some cases fatal, side effects, do you think that drug required more testing and trials, or should be rushed into production and given out as widely as possible?

That's basically the scenario we have with using value add scores for high stakes decision making when it comes to teachers. Sure no one is actually going to die, but if corporate education reformers have their way, many might falsely lose their jobs, and the money wasted will never be used to actually educate a student, and what of the opportunity cost of missing out on getting effective reforms into the classroom being missed?

Given the context-dependency of the estimators’ ability to produce accurate results, however, and our current lack of knowledge regarding prevailing assignment practices, VAM-based measures of teacher performance, as currently applied in practice and research, must be subjected to close scrutiny regarding the methods used and interpreted with a high degree of caution.

Methods of constructing estimates of teacher effects that we can trust for high-stakes evaluative purposes must be further studied, and there is much left to investigate. In future research, we will explore the extent to which various estimation methods, including more sophisticated dynamic treatment effects estimators, can handle further complexity in the DGPs.

The addition of test measurement error, school effects, time-varying teacher effects, and different types of interactions among teachers and students are a few of many possible dimensions of complexity that must be studied. Finally, diagnostics are needed to identify the structure of decay and prevailing teacher assignment mechanisms. If contextual norms with regard to grouping and assignment mechanisms can be deduced from available data, then it may be possible to determine which estimators should be applied in a given context.

We must be able to prove that evaluations and the metrics that make them up are fair, accurate and stable, and if they are to have any real benefit they must ultimately demonstrate a cost effective way to improve student achievement and education quality. We're simply not there yet and pretending we are is dangerous and carries some very real risks.