The first in a new series of two-page briefs summarizing the state of play in education policy research offers suggestions for policymakers designing teacher evaluation systems.

The paper is written by Dr. William Mathis, managing director of the National Education Policy Center, housed at the University of Colorado Boulder School of Education.

Teachers are important, and policies mandating high-stakes evaluations of teachers are at the forefront of popular school reforms. Today’s dominant approach labels teachers as effective or ineffective based in large part on a statistical analysis of students’ test-score performance. Teachers judged effective are rewarded, and those found ineffective are sanctioned.

While such summative evaluations can be useful, lawmakers should be wary of approaches based in large part on test scores: the error in the measurements is large—which results in many teachers being incorrectly labeled as effective or ineffective;1 relevant test scores are not available for the students taught by most teachers, given that only certain grade levels and subject areas are tested; and the incentives created by high-stakes use of test scores drive undesirable teaching practices such as curriculum narrowing and teaching to the test.

Summative initiatives should also be balanced with formative approaches, which identify strengths and weaknesses of teachers and directly focus on developing and improving their teaching. Measures that de-emphasize test scores are more labor intensive but have far greater potential to enrich instruction and improve education.

The paper goes on to give some key research points and advice for policy makers

If the objective is improving educational practice, formative evaluations that guide a teacher’s improvement provide greater benefits than summative evaluations.
If the objective is to improve educational performance, outside-school factors must also be addressed. Teacher evaluation cannot replace or compensate for these much stronger determinants of student learning. The importance of these outside-school factors should also caution against policies that simplistically attribute student test scores to teachers.
The results produced by value-added (test-score growth) models alone are highly unstable. They vary from year to year, from classroom to classroom, and from one test to another. Substantial reliance on these models can lead to practical, ethical and legal problems.
High-stakes evaluations based in substantial part on students’ test scores narrow the curriculum by diminishing or pushing out non-tested subjects, knowledge, and skills.
Teacher evaluation systems necessarily involve trade-offs, and specific design choices are controversial, so it is important to involve all key stakeholders in system design or selection.
To be successful, schools must invest in their teacher evaluation systems. An adequate number of highly trained evaluators must be available.
Given the wide variety of teacher roles and the many factors that influence learning that are outside the control of the teacher, a wide variety of measures of teacher effectiveness is also indicated. By diversifying, the weakness of any single measure is offset by the strengths of another.
High-quality research on existing evaluative programs and tools should inform the design of teacher evaluation systems. States and districts should investigate balanced models such as PAR and the Danielson Framework, closely examine the evidence concerning strengths and weaknesses of each model, and never attach high-stakes consequences to teachers which the evidence cannot validly support.

The paper can be read in full below

Research-Based Options for Education Policy Making

Last month the state released a preliminary look at their new school rankings list. After digesting this list and its construction, people are asking interesting questions and observing uncomfortable patterns.

Former state legislator and former State Board of Education member Colleen Grady actually calls these performance index rankings “the most confusing and least useful of the accountability ratings, lists and rankings” because:

The PI calculation is based on passage rates of Ohio Achievement Assessments (grades 3–8) and the Ohio Graduation Test (grades 10 and 11). The proficiency “cut scores” are so low that students can be determined “proficient” even when they answer less than 50% of test questions correctly.
The PI calculation gives schools and districts “partial” credit for students who fail to meet the proficient standard.
The PI calculation does not include a growth component. Districts and schools can be highly ranked even if students are learning little from year to year. The PI is a clumsy instrument that does not allow the average person to distinguish the true performance of districts. For example, 50 districts have PI scores of 100.XXXX [with the X’s representing the digits after the decimal point]. Is there any real difference in performance between the district ranked 210 of 611 or 260 of 611 districts?

Indeed, with the somewhat arbitrary nature of the weightings of the PI calculation, how much of variation in these scores is a consequence of those design choices?

The most disturbing result however is this

Shocker: Poverty Hurts Ranking

In general, districts’ rankings are directly related to how many low-income students they enroll. Even just looking at the rankings of urban school districts, for most (but not all) of the districts in the top 25 percent, less than half of their students are from low-income families.

There's about twelves months before these preliminary results become real ones, and one can only hope that some of these design problems and errata are resolved by then, but we're not hopeful.