New Gates Study on teacher evaluations

A new Gates study released today finds effective teacher evaluations require high standards, with multiple measures.

ABOUT THIS REPORT: This report is intended for policymakers and practitioners wanting to understand the implications of the Measures of Effective Teaching (MET) project’s interim analysis of classroom observations. Those wanting to explore all the technical aspects of the study and analysis also should read the companion research report, available at

Together, these two documents on classroom observations represent the second pair of publications from the MET project. In December 2010, the project released its initial analysis of measures of student perceptions and student achievement in Learning about Teaching: Initial Findings from the Measures of Effective Teaching Project. Two more reports are planned for mid-2012: one on the implications of assigning weights to different measures; another using random assignment to study the extent to which student assignment may affect teacher effectiveness results. ABOUT THE MET PROJECT: The MET project is a research partnership of academics, teachers, and education organizations committed to investigating better ways to identify and develop effective teaching. Funding is provided by the Bill & Melinda Gates Foundation.

The report provides for 3 takeaways.

High-quality classroom observations will require clear standards, certified raters, and multiple observations per teacher. Clear standards and high-quality training and certification of observers are fundamental to increasing inter-rater reliability. However, when measuring consistent aspects of a teacher’s practice, reliability will require more than inter- rater agreement on a single lesson. Because teaching practice varies from lesson to lesson, multiple observations will be necessary when high-stakes decisions are to be made. But how will school systems know when they have implemented a fair system? Ultimately, the most direct way is to periodically audit a representative sample of official observations, by having impartial observers perform additional observations. In our companion research report, we describe one approach to doing this.

Combining the three approaches (classroom observations, student feedback, and value-added student achievement gains) capitalizes on their strengths and offsets their weaknesses. For example, value-added is the best single predictor of a teacher’s student achievement gains in the future. But value-added is often not as reliable as some other measures and it does not point a teacher to specific areas needing improvement. Classroom observations provide a wealth of information that could support teachers in improving their practice. But, by themselves, these measures are not highly reliable, and they are only modestly related to student achievement gains. Student feedback promises greater reliability because it includes many more perspectives based on many more hours in the classroom, but not surprisingly, it is not as predictive of a teacher’s achievement gains with other students as value-added. Each shines in its own way, either in terms of predictive power, reliability, or diagnostic usefulness.

Combining new approaches to measuring effective teaching—while not perfect—significantly outperforms traditional measures. Providing better evidence should lead to better decisions. No measure is perfect. But if every personnel decision carries consequences—for teachers and students—then school systems should learn which measures are better aligned to the outcomes they value. Combining classroom observations with student feedback and student achievement gains on state tests did a better job than master’s degrees and years of experience in predicting which teachers would have large gains with another group of students. But the combined measure also predicted larger differences on a range of other outcomes, including more cognitively challenging assessments and student- reported effort and positive emotional attachment. We should refine these tools and continue to develop better ways to provide feedback to teachers. In the meantime, it makes sense to compare measures based on the criteria of predictive power, reliability, and diagnostic usefulness.

MET Gathering Feedback Practioner Brief