The Science of Value-Added Evaluation

"A value-added analysis constitutes a series of personal, high-stakes experiments conducted under extremely uncontrolled conditions".

If drug experiments were conduted like VAM we might all have 3 legs or worse

Value-added teacher evaluation has been extensively criticized and strongly defended, but less frequently examined from a dispassionate scientific perspective. Among the value-added movement's most fervent advocates is a respected scientific school of thought that believes reliable causal conclusions can be teased out of huge data sets by economists or statisticians using sophisticated statistical models that control for extraneous factors.

Another scientific school of thought, especially prevalent in medical research, holds that the most reliable method for arriving at defensible causal conclusions involves conducting randomized controlled trials, or RCTs, in which (a) individuals are premeasured on an outcome, (b) randomly assigned to receive different treatments, and (c) measured again to ascertain if changes in the outcome differed based upon the treatments received.

The purpose of this brief essay is not to argue the pros and cons of the two approaches, but to frame value-added teacher evaluation from the latter, experimental perspective. For conceptually, what else is an evaluation of perhaps 500 4th grade teachers in a moderate-size urban school district but 500 high-stakes individual experiments? Are not students premeasured, assigned to receive a particular intervention (the teacher), and measured again to see which teachers were the more (or less) efficacious?

Granted, a number of structural differences exist between a medical randomized controlled trial and a districtwide value-added teacher evaluation. Medical trials normally employ only one intervention instead of 500, but the basic logic is the same. Each medical RCT is also privy to its own comparison group, while individual teachers share a common one (consisting of the entire district's average 4th grade results).

From a methodological perspective, however, both medical and teacher-evaluation trials are designed to generate causal conclusions: namely, that the intervention was statistically superior to the comparison group, statistically inferior, or just the same. But a degree in statistics shouldn't be required to recognize that an individual medical experiment is designed to produce a more defensible causal conclusion than the collected assortment of 500 teacher-evaluation experiments.

How? Let us count the ways:

  • Random assignment is considered the gold standard in medical research because it helps to ensure that the participants in different experimental groups are initially equivalent and therefore have the same propensity to change relative to a specified variable. In controlled clinical trials, the process involves a rigidly prescribed computerized procedure whereby every participant is afforded an equal chance of receiving any given treatment. Public school students cannot be randomly assigned to teachers between schools for logistical reasons and are seldom if ever truly randomly assigned within schools because of (a) individual parent requests for a given teacher; (b) professional judgments regarding which teachers might benefit certain types of students; (c) grouping of classrooms by ability level; and (d) other, often unknown, possibly idiosyncratic reasons. Suffice it to say that no medical trial would ever be published in any reputable journal (or reputable newspaper) which assigned its patients in the haphazard manner in which students are assigned to teachers at the beginning of a school year.
  • Medical experiments are designed to purposefully minimize the occurrence of extraneous events that might potentially influence changes on the outcome variable. (In drug trials, for example, it is customary to ensure that only the experimental drug is received by the intervention group, only the placebo is received by the comparison group, and no auxiliary treatments are received by either.) However, no comparable procedural control is attempted in a value-added teacher-evaluation experiment (either for the current year or for prior student performance) so any student assigned to any teacher can receive auxiliary tutoring, be helped at home, team-taught, or subjected to any number of naturally occurring positive or disruptive learning experiences.
  • When medical trials are reported in the scientific literature, their statistical analysis involves only the patients assigned to an intervention and its comparison group (which could quite conceivably constitute a comparison between two groups of 30 individuals). This means that statistical significance is computed to facilitate a single causal conclusion based upon a total of 60 observations. The statistical analyses reported for a teacher evaluation, on the other hand, would be reported in terms of all 500 combined experiments, which in this example would constitute a total of 15,000 observations (or 30 students times 500 teachers). The 500 causal conclusions published in the newspaper (or on a school district website), on the other hand, are based upon separate contrasts of 500 "treatment groups" (each composed of changes in outcomes for a single teacher's 30 students) versus essentially the same "comparison group."
  • Explicit guidelines exist for the reporting of medical experiments, such as the (a) specification of how many observations were lost between the beginning and the end of the experiment (which is seldom done in value-added experiments, but would entail reporting student transfers, dropouts, missing test data, scoring errors, improperly marked test sheets, clerical errors resulting in incorrect class lists, and so forth for each teacher); and (b) whether statistical significance was obtained—which is impractical for each teacher in a value-added experiment since the reporting of so many individual results would violate multiple statistical principles.

[readon2 url="http://www.edweek.org/ew/articles/2013/01/16/17bausell.h32.html"]Continue reading...[/readon2]

Merit pay and the candle problem

Let us pretend for one moment that many of the corporate education reforms being proposed offered more than just a metaphorical big stick with which to fire teachers more easily, but also a few carrots too in the form of extra money toward paying high performers as determined by their students test scores. Yes, yes, we know.

Let's go even further, and pretend that student test scores were the perfect means with which to judge the effectiveness of any teacher. What do we know of financial incentives? From Nature magazine.

here's a simple fact we've known since 1962: using money as a motivator makes us less capable at problem-solving. It actually makes us dumber.

In the 1940, an experiment was carried out, now referred to as the "Candle Problem". The experiment has the participant try to solve the problem of how to fix a lit candle on a wall in a way so the candle wax won't drip to the floor. The participant can only use (along with the candle) a book of matches and a box of thumbtacks.

Let's go back to that Nature article to explain the rest of the experiment, and it' counterintuitive results

The only answer that really works is this: 1.Dump the tacks out of the box, 2.Tack the box to the wall, 3.Light the candle and affix it atop the box as if it were a candle-holder. Incidentally, the problem was much easier to solve if the tacks weren't in the box at the beginning. When the tacks were in the box the participant saw it only as a tack-box, not something they could use to solve the problem. This phenomenon is called "Functional fixedness."

Sam Glucksberg added a fascinating twist to this finding in his 1962 paper, "Influence of strength of drive on functional fixedness and perceptual recognition." (Journal of Experimental Psychology 1962. Vol. 63, No. 1, 36-41). He studied the effect of financial incentives on solving the candle problem. To one group he offered no money. To the other group he offered an amount of money for solving the problem fast.

Remember, there are two candle problems. Let the "Simple Candle Problem" be the one where the tacks are outside the box -- no functional fixedness. The solution is straightforward. Here are the results for those who solved it:

Simple Candle Problem Mean Times :
WITHOUT a financial incentive : 4.99 min
WITH a financial incentive : 3.67 min
Nothing unexpected here. This is a classical incentivization effect anybody would intuitively expect.

Now, let "In-Box Candle Problem" refer to the original description where the tacks start off in the box.

In-Box Candle Problem Mean Times :
WITHOUT a financial incentive : 7:41 min
WITH a financial incentive : 11:08 min
How could this be? The financial incentive made people slower? It gets worse -- the slowness increases with the incentive. The higher the monetary reward, the worse the performance! This result has been repeated many times since the original experiment.

We've published a video on this phenomenon before, titled "As Teacher Merit Pay Spreads, One Noted Voice Cries, ‘It Doesn’t Work’", and an article from the Harvard Business Review, titled "Stop Tying Pay To Performance".

Here's another video - The surprising truth about what motivates us

Knowing all this begs the question, why are we going down the path of some of these corporate education reforms, when we have known for over half a century many of them are flawed concepts that have been demonstrated to fail time and time again?

A $715 million experiment

The Youngstown Vindicator echoes some of the issues we highlighted in a weekend guest column - CHARTER SCHOOLS AND OUR TAX DOLLARS.

In an article titled "State continues to blindly shift funding to charter schools"

The salaries of public school teachers and administrators are readily available, most notably on the website of the Buckeye Institute, which lists every Ohio public school employee by name and salary. But good luck finding a data base that provides the same insight into charter operations. And Ohio taxpayers can only dream of knowing how much of their $74 million that White Hat collects will end up as profit for owner David Brenner, a longtime proponent of charter schools and a financial supporter of politicians who share his view.

Atty. Charles R. Saxbe, who represents White Hat in a lawsuit brought by some of its charter school boards, said public funds become private once they enter White Hat’s accounts.

Charter schools were first sold to Ohio voters as an experiment. The results of that experiment are not in, but the General Assembly continues to increase funding for charter schools. What was a $51 million experiment in 2000 has ballooned to a $715 million experiment in 2011. While charter schools get an ever big bite out of the education pie in Ohio, funding for public schools, adjusted for inflation, has flat-lined.

Millions of dollars are invested in promoting this failed experiment, because many millions more dollars are at stake in profits.