observations

$50 million. 3 years. No clue.

More on that awful Gates study

Though science does sometimes prove things that are not intuitive, science does depend on accurate premises. So, in this case, IF the conclusion is that “you can’t believe your eyes” in teacher evaluation — just because you watch a teacher doing a great job, this could be a mirage since that teacher doesn’t necessarily get the same ‘gains’ as the other teacher that you thought was terrible based on your observation — well, it could also mean that one of the initial premises was incorrect. To me, the initial premise that has caused this counter-intuitive conclusion is that value-added — which says that teacher quality can be determined by comparing student test scores to what a computer would predict those same students would have gotten with an ‘average’ teacher — is the faulty premise. Would we accept it if a new computer programmed to evaluate music told us that The Beatles’ ‘Yesterday’ is a bad song?

One thing that struck me right away with this report is that the inclusion of student surveys — something that aren’t realistically ever going to be a significant part of high stakes teacher evaluations — is given such a large percentage in each of the three main weightings they consider (these three scenarios are, for test scores-classroom observations-student surveys, 50-25-25, 33-33-33, and 25-50-25.)

Conspicuously missing from the various weighting schemes they compare is one with 100% classroom observations. As this is what many districts currently do and since this report is supposed to guide those who are designing new systems, wouldn’t it be scientifically necessary to include the existing system as the ‘control’ group? As implementing a change is a costly and difficult process, shouldn’t we know what we could expect to gain over the already existing system?

[readon2 url="http://garyrubinstein.teachforus.org/2013/01/13/50-million-3-years-no-clue/"]Read the whole piece[/readon2]

Catastrophic failure of teacher evaluations in TN

If the on going teacher evaluation failure in Tennessee is any guide, Ohio has some rough waters ahead. Tennessee's recently passed system is very similar to Ohio's.

It requires 50 percent of the evaluation to be comprised of student achievement data—35 percent based on student growth as represented by the Tennessee Value-Added Assessment System (TVAAS) or a comparable measure and the other 15 percent based on additional measures of student achievement adopted by the State Board of Education and chosen through mutual agreement by the educator and evaluator. The remaining 50 percent of the evaluation is determined through qualitative measures such as teacher observations, personal conferences and review of prior evaluations and work.

Tennessee’s new way of evaluating classrooms “systematically failed” to identify bad teachers and provide them more training, according to a state report published Monday.

The Tennessee Department of Education found that instructors who got failing grades when measured by their students’ test scores tended to get much higher marks from principals who watched them in classrooms. State officials expected to see similar scores from both methods.

“Evaluators are telling teachers they exceed expectations in their observation feedback when in fact student outcomes paint a very different picture,” the report states. “This behavior skirts managerial responsibility.”

The education administration in Tennessee are pointing the fingers towards the in classroom evaluations, but as one commentator on the article notes,

Perhaps what we are seeing with these disparities is not a need to retrain the evaluators, but rather confirmation of what many know but the Commissioner and other proponents of this hastily conceived evaluation system refuse to see -- the evaluation criteria mistakenly relies too much on TVAAS scores when they do not in fact accurately measure teacher effectiveness.

It has been stated over, and over, that the use of value add at the teacher level is not appropriate, subject to too much variation, instability, and error. Yet when these oft warned about problems continue to manifest, they are ignored, excused and other factors scapegoated instead.

As if to make matters worse, the report (read it below) suggests that "school-wide value-added scores should be based on a one-year score rather than a three-year score. While it makes sense, where possible, to use three-year averages for individuals because of smaller sample sizes, school-wide scores can and should be based on one-year data."

So how did the value add scores stack up against observations? With 1 being the lowest grade and 5 the highest

Are we really supposed to believe that a highly educated and trained workforce, such as teachers are failing at a 24.6% rate (grades 1's and 2's). Not even the most ardent corporate education reformer has claimed that kind of number! It becomes even more absurd when one looks at student achievement. It's hard to argue that a quarter of our workforce is substandard when your student achievement is at record highs.

Instead it seems more reasonable that a more modest 2%-3% of teachers are ineffective and that the observations by professional, experienced evaluators are accurately capturing that situation.

Sadly, no where in the Tennessee report is a call for further analysis of their value add calculations.

Teacher Evaluation in Tennessee: A Report on Year 1 Implementation

Value-Added Versus Observations

Value-Added Versus Observations, Part One: Reliability

Although most new teacher evaluations are still in various phases of pre-implementation, it’s safe to say that classroom observations and/or value-added (VA) scores will be the most heavily-weighted components toward teachers’ final scores, depending on whether teachers are in tested grades and subjects. One gets the general sense that many – perhaps most – teachers strongly prefer the former (observations, especially peer observations) over the latter (VA).

One of the most common arguments against VA is that the scores are error-prone and unstable over time – i.e., that they are unreliable. And it’s true that the scores fluctuate between years (also see here), with much of this instability due to measurement error, rather than “real” performance changes. On a related note, different model specifications and different tests can yield very different results for the same teacher/class.

These findings are very important, and often too casually dismissed by VA supporters, but the issue of reliability is, to varying degrees, endemic to all performance measurement. Actually, many of the standard reliability-based criticisms of value-added could also be leveled against observations. Since we cannot observe “true” teacher performance, it’s tough to say which is “better” or “worse,” despite the certainty with which both “sides” often present their respective cases. And, the fact that both entail some level of measurement error doesn’t by itself speak to whether they should be part of evaluations.*

Nevertheless, many states and districts have already made the choice to use both measures, and in these places, the existence of imprecision is less important than how to deal with it. Viewed from this perspective, VA and observations are in many respects more alike than different.

[readon2 url="http://shankerblog.org/?p=5621"]Continue reading part I[/readon2]

Value-Added Versus Observations, Part Two: Validity

In a previous post, I compared value-added (VA) and classroom observations in terms of reliability – the degree to which they are free of error and stable over repeated measurements. But even the most reliable measures aren’t useful unless they are valid – that is, unless they’re measuring what we want them to measure.

Arguments over the validity of teacher performance measures, especially value-added, dominate our discourse on evaluations. There are, in my view, three interrelated issues to keep in mind when discussing the validity of VA and observations. The first is definitional – in a research context, validity is less about a measure itself than the inferences one draws from it. The second point might follow from the first: The validity of VA and observations should be assessed in the context of how they’re being used.

Third and finally, given the difficulties in determining whether either measure is valid in and of itself, as well as the fact that so many states and districts are already moving ahead with new systems, the best approach at this point may be to judge validity in terms of whether the evaluations are improving outcomes. And, unfortunately, there is little indication that this is happening in most places.

Let’s start by quickly defining what is usually meant by validity. Put simply, whereas reliability is about the precision of the answers, validity addresses whether we’re using them to answer the correct questions. For example, a person’s weight is a reliable measure, but this doesn’t necessarily mean it’s valid for gauging the risk of heart disease. Similarly, in the context of VA and observations, the question is: Are these indicators, even if they can be precisely estimated (i.e., they are reliable), measuring teacher performance in a manner that is meaningful for student learning?

[readon2 url="http://shankerblog.org/?p=5670"]Continue reading part II[/readon2]

School Principals Swamped by Teacher Evaluations

"School Principals Swamped by Teacher Evaluations", that's the title of an article on an ABC News report this past weekend.

Sharon McNary believes in having tough teacher evaluations.

But these days, the Memphis principal finds herself rushing to cram in what amounts to 20 times the number of observations previously required for veteran teachers – including those she knows are excellent – sometimes to the detriment of her other duties.

"I don't think there's a principal that would say they don't agree we don't need a more rigorous evaluation system," says Ms. McNary, who is president of the Tennessee Principals Association as well as principal at Richland Elementary. "But now it seems that we've gone to [the opposite] extreme."
[...]
"There is no evidence that any of this works," says Carol Burris, a Long Island principal who co-authored an open letter of concern with more than 1,200 other principals in the state. "Our worry is that over time these practices are going to hurt kids and destroy the positive culture of our schools."
[...]
In Tennessee, the biggest complaint from many principals is simply the amount of time required from them for the new observation system. Veteran teachers, who in the past only needed to be evaluated every five years, now get four observations a year. Untenured teachers need six.

Each observation involves a complicated rubric and scoring system, discussions with the teacher before and afterward, and a written report – a total of perhaps two to four hours for each one, Ms. McNary estimates.

This last observation is one JTF talked about in one of our most popular articles.

Let's just think for a minute about these observations.

There must be 2 per year per teacher of at least 30 minutes each. 30 minutes + 30 minutes = 1 hour. 1 hour x 146,000 teachers = 146,000 hours of observation per year.

But these observers aren't just going to magically appear. They will need time to organize the observations, to get to the classes, to record their findings and to issue a report. Conservatively this adds another hour per year per teacher to the effort.

Now we are at 292,000 hours per year just for this provision alone.

If someone were to work 8 hours a day, 5 days a week, 52 weeks a year it would take them over 140 years to complete this task. Since these observations have to be completed annually that means we're going to need at least 140 more administrators just for this provision alone!

This dawning realization is also hitting home in Ohio now too,

Nordonia Hills is one of dozens of school districts across the state that are piloting the new evaluation program -- which state education officials have been working on for the past several years.

Superintendent Joe Clark said the district has been involved in the state's move to revamp the teacher evaluation process since he came on board in 2009 as assistant superintendent. Charged with performing human resource and personnel management for the district, Clark said he felt the teacher evaluation system needed a drastic upgrade.

This year, pilot evaluations are being conducted on six teachers -- three each at Nordonia High School and Ledgeview Elementary.

Nordonia hills has 236 teachers according to the Department of Education. It's taken them 3 years to get to the point of observing 6 of them.

Clark said many aspects of the program remain to be worked out. He said "student growth," one factor in the process, has yet to be specified, for example.

That student growth measure is 50% of the mandated evaluation. You can begin to see when we say Teacher evaluations are years away from completion, we're not exaggerating.

The Nordonia Hills superintendent did his own math

Clark said the process requires an evaluator -- typically the building principal or assistant principal -- to observe teachers in class twice for at least 30 minutes each time. The process also involves meetings prior to, and after each observation session.

Likewise, the new process is much more time consuming. Clark said evaluating 80 teachers at Nordonia High School would require 480 meetings.

"And that's not counting the time to write up the evaluations," Clark said, adding "How is that possible? There's only 180 school days in the year."

Teacher observations are an important and valuable tool for professional development and evaluation. Few would argue that. The problem becomes one of time and resources. HB153 was passed without any consideration to the mammoth amount of work needed to implement these corporate education reforms. Indeed, HB153, rather than add resources, cuts almost $2 billion dollars from public education.

It's going to be very convenient indeed for corporate education reformers to look upon this impending failure and blame everyone but themselves for not getting results. Why, it might even let them engage in more teacher and union bashing, and argue that their reforms failed because the status quo stood in the way.

New Gates Study on teacher evaluations

A new Gates study released today finds effective teacher evaluations require high standards, with multiple measures.

ABOUT THIS REPORT: This report is intended for policymakers and practitioners wanting to understand the implications of the Measures of Effective Teaching (MET) project’s interim analysis of classroom observations. Those wanting to explore all the technical aspects of the study and analysis also should read the companion research report, available at www.metproject.org.

Together, these two documents on classroom observations represent the second pair of publications from the MET project. In December 2010, the project released its initial analysis of measures of student perceptions and student achievement in Learning about Teaching: Initial Findings from the Measures of Effective Teaching Project. Two more reports are planned for mid-2012: one on the implications of assigning weights to different measures; another using random assignment to study the extent to which student assignment may affect teacher effectiveness results. ABOUT THE MET PROJECT: The MET project is a research partnership of academics, teachers, and education organizations committed to investigating better ways to identify and develop effective teaching. Funding is provided by the Bill & Melinda Gates Foundation.

The report provides for 3 takeaways.

High-quality classroom observations will require clear standards, certified raters, and multiple observations per teacher. Clear standards and high-quality training and certification of observers are fundamental to increasing inter-rater reliability. However, when measuring consistent aspects of a teacher’s practice, reliability will require more than inter- rater agreement on a single lesson. Because teaching practice varies from lesson to lesson, multiple observations will be necessary when high-stakes decisions are to be made. But how will school systems know when they have implemented a fair system? Ultimately, the most direct way is to periodically audit a representative sample of official observations, by having impartial observers perform additional observations. In our companion research report, we describe one approach to doing this.

Combining the three approaches (classroom observations, student feedback, and value-added student achievement gains) capitalizes on their strengths and offsets their weaknesses. For example, value-added is the best single predictor of a teacher’s student achievement gains in the future. But value-added is often not as reliable as some other measures and it does not point a teacher to specific areas needing improvement. Classroom observations provide a wealth of information that could support teachers in improving their practice. But, by themselves, these measures are not highly reliable, and they are only modestly related to student achievement gains. Student feedback promises greater reliability because it includes many more perspectives based on many more hours in the classroom, but not surprisingly, it is not as predictive of a teacher’s achievement gains with other students as value-added. Each shines in its own way, either in terms of predictive power, reliability, or diagnostic usefulness.

Combining new approaches to measuring effective teaching—while not perfect—significantly outperforms traditional measures. Providing better evidence should lead to better decisions. No measure is perfect. But if every personnel decision carries consequences—for teachers and students—then school systems should learn which measures are better aligned to the outcomes they value. Combining classroom observations with student feedback and student achievement gains on state tests did a better job than master’s degrees and years of experience in predicting which teachers would have large gains with another group of students. But the combined measure also predicted larger differences on a range of other outcomes, including more cognitively challenging assessments and student- reported effort and positive emotional attachment. We should refine these tools and continue to develop better ways to provide feedback to teachers. In the meantime, it makes sense to compare measures based on the criteria of predictive power, reliability, and diagnostic usefulness.

MET Gathering Feedback Practioner Brief

Teacher Grades: Pass or Be Fired

Stealing the headline from this NYT article, to bring to your attention a report on the IMPACT rubric for teacher evaluation in Washington DC. Ohio's new evaluation system passed in the state budget draws some of its heritage from this, so we thought it would be valuable to consider it for a moment.

Emily Strzelecki, a first-year science teacher here, was about as eager for a classroom visit by one of the city’s roving teacher evaluators as she would be to get a tooth drilled. “It really stressed me out because, oh my gosh, I could lose my job,” Ms. Strzelecki said.

Her fears were not unfounded: 165 Washington teachers were fired last year based on a pioneering evaluation system that places significant emphasis on classroom observations; next month, 200 to 600 of the city’s 4,200 educators are expected to get similar bad news, in the nation’s highest rate of dismissal for poor performance.

The evaluation system, known as Impact, is disliked by many unionized teachers but has become a model for many educators. Spurred by President Obama and his $5 billion Race to the Top grant competition, some 20 states, including New York, and thousands of school districts are overhauling the way they grade teachers, and many have sent people to study Impact.

Ohio's new system involves each teacher receiving two 30 minute in-class observations also. Education Sector, a non-profit think tank recently produced a paper on IMPACT and took at look at some of the ways this new system has affected Washinton DC teachers. We urge you to read the paper in full, below, but we've also pulled out some of the interesting pieces to entice you.

The observations take 30 minutes—usually no more and never any less—and all but one of the administrator visits are unannounced. Based on these observations, teachers are assigned a crucial ranking, from 1 to 4. Combined with other factors, they produce an overall IMPACT score of from 100 to 400, which translates into“highly effective,” “effective,” “minimally effective,” or “ineffective.” A rating of ineffective means the teacher is immediately subject to dismissal; a rating of minimally effective gives him one year to improve or be fired; effective gets him a standard contract raise; and highly effective qualifies him for a bonus and an invitation to a fancy award ceremony at the Kennedy Center.

It is a measure of how weak and meaningless observations used to be that these pop visits can fill teachers, especially the less experienced ones, with the anxiety of a 10th-grader assigned an impromptu essay on this week’s history unit for a letter grade. The stress can show up in two ways—the teacher chokes under the pressure, thereby earning a poor score, or she changes her lesson in a way that can stifle creativity and does not always serve students. Describing these observations, IMPACT detractors use words like “humiliating,” “infantilizing,” “paternalistic,” and “punitive.” “It’s like somebody is always looking over your shoulder,” said a high school teacher who, like most, did not wish to be named publicly for fear of hurting her career.

[…]

“Out of 22 students, I have five non-readers, eight with IEPs [individual educational plans, which are required by federal law for students with disabilities], and no co-teacher,” says the middle school teacher. “The observers don’t know that going in, and there is no way of equalizing those variables.”

[…]

Bill Rope is not young, or particularly bubbly, but he is a respected teacher who sees this unusual relationship from the confident perspective of an older man who went into education after a 30-year career in the foreign service. Rope, who now teaches third grade at Hearst Elementary School in an affluent neighborhood of Northwest D.C., was rated “highly effective” last year and awarded a bonus that he refused to accept in a show of union solidarity.

But a more recent evaluation served to undermine whatever validation the first one may have offered. In the later one, a different master educator gave him an overall score of 2.78—toward the low end of “effective.”

[…]

So how did it all shake out? At the end of IMPACT’s first year, 15 percent of teachers were rated highly effective, 67 percent were judged effective, 16 percent were deemed minimally effective, and 2 percent were rated ineffective and fired.

[…]

Theoretically, a teacher’s value-added score should show a high correlation with his rating from classroom observations. In other words, a teacher who got high marks on performance should also see his students making big gains. And yet DCPS has found the correlation between these two measures to be only modest, with master educators’ evaluations onlyslightly more aligned with test scores than those of principals.

Impact Report Release