The Rise and Fall of the Teacher Evaluation Reform Empire

Teacher evaluation reform during the late 2000s and 2010s was one of the fastest and widespread education policy changes in recent history. Thanks mostly to Race to the Top and ESEA “waivers,” over a period of about 10 years, the vast majority of the nation’s school districts installed new teacher evaluations. These new systems were quite different from their predecessors in terms of design, with 3-5 (rather than dichotomous) rating categories incorporating multiple measures (including some based on student testing results). And, in many states, there were varying degrees of rewards and/or consequences tied to the ratings (Steinberg and Donaldson 2016).

A recent working paper offers what is to date the most sweeping assessment of the impact of teacher evaluation reform on student outcomes, with data from 44 states and D.C. As usual, I would encourage you to read the whole paper (here's an earlier ungated version released in late 2021). It is terrific work by a great team of researchers (Joshua Bleiberg, Eric Brunner, Erika Harbatkin, Matthew Kraft, and Matthew Springer), and I’m going to describe the findings only superficially. We’ll get into a little more detail below, but the long and short of it is that evaluation reform had no statistically detectable aggregate effect on student test scores or attainment (i.e., graduation or college enrollment).

This timely analysis, in combination with the research on evaluations over the past few years, provides an opportunity to look back on this enormous reform effort, and whether and how states and districts might move forward.

What's the verdict on evaluation reform?

I would be uncomfortable slapping a label like “success” or “failure” on most any recent national reform, especially one in which the design and implementation of the policy varied so much, within and between states, among thousands of heterogeneous districts. As with charter schools, I am just as interested in why evaluations worked in some places but not others. And this is where we will end up in this post. But, first, we should be realistic about the reductive political environment in education, and what that means.

National evaluation reform has not been successful. This Bleiberg et al. paper is a huge contribution, and it is not the first strong analysis of multiple locations that finds no aggregate effect of the new systems (e.g., Stecher et al. 2016). So, just to be clear, evaluation reform, on average, has not been successful in improving measurable student outcomes. 

The evidence on teacher evaluations overall is quite mixed… On the other hand, there are several other analyses that find mixed or positive effects of newer and older teacher evaluation systems on various outcomes (and that includes, of course, this Bleiberg et al. paper, in which there were some states and districts that showed results). Some of these systems that have fared better are the higher-stakes models, whereas others are lower-stakes, largely observation- and feedback-based systems (Taylor and Tyler 2012; Dee and Wyckoff 2015; Steinberg and Sartain 2015; Adnot et al. 2016; Kraft et al. 2020; Cullen et al,. 2021; Phipps and Wiseman 2021; Song et al. 2021). There’s also potentially relevant research on other forms of accountability systems for teachers (Pham et al. 2020), as well as those for schools (Figlio and Loeb 2011). So, there is some good evidence out there, but it is far from perfectly consistent, and it is still outweighed by what we don’t know about teacher evaluations (including, most crucially, why systems do or do not work).

…but implicit overpromising, combined with our short-sighted policy analysis environment, meant that evaluations needed to work quickly or be judged a failure. The rush and rigidity of reform was justified—at least implicitly—by suggestions that the systems would generate large, positive effects on short-term testing results. We have the benefit of hindsight here, but even at the time it seemed questionable that a goal like improving the teacher effectiveness distribution at scale is something that could necessarily be accomplished in just a few years (and even positive initial results would have to be sustained). Bleiberg et al.’s results are based on no more than 6 years of post-implementation data in virtually all states, and 4 or fewer years in about three-quarters of states. This is long enough to be a very important “checkpoint” but not long enough to render final judgment and forget the whole thing. Right now, we should be interpreting this important paper in the context of the research literature, trying to understand the variation in results, asking educators about their experiences and suggestions, and figuring out how to adjust the systems to move forward. Instead, thanks in no small part to the inflation of expectations and our “produce short term gains or perish” incentive structure, this massive education reform effort almost feels like a very expensive bet (and for educators yet another fad solution that was all the rage until it wasn't).

No matter your view of evaluation reform, this is not a healthy environment, particularly in a field where even the most effective policy changes have smaller, slower actual effects than is generally acknowledged. So long as every policy needs to harvest quick testing gains to be considered successful, there won’t be many acknowledged successes, many potentially successful policies won’t be tried, and those that are tried will be in danger of being shut down prematurely.

Why is it that evaluations can work, but didn’t work?

This, of course, is a key question. One will find no shortage of speculation (including some below from me). These are just (over)generalizations, but opponents of the new systems say they failed because they focused on firing teachers over helping them improve, or because they relied too heavily on test-based teacher performance measures. Supporters, in contrast, point to the lack of high-stakes decisions attached to the ratings, or to the trivial amount of differentiation in those ratings.

In retrospect, the cost of rushing wasn’t worth it. National reform on any timeline is going to be difficult. Yet the push for remarkably rapid change in teacher evaluations—the strong incentives to pass laws and get the systems up and running within a couple of years—proceeded despite scarce evidence on how to design and implement those systems. This left district administrators and educators scrambling to erect new data infrastructure, design new measures, and build systems to comply with state laws (while also dealing with lingering budget problems, Common Core adoption/implementation, and, you know, their day jobs of running schools). States and districts also had to decide whether and how to use these new evaluation ratings in high-stakes decisions, in most cases before they even got the first couple of rounds of results.1

Maybe a more gradual approach—with multi-year pilot programs, greater efforts to train administrators and get educators on board, phase-in, more funding, realistic expectations, etc.—would have worked better, maybe not. All reforms proceed with imperfect information. But I think it’s at least fair to say that the process was not the kind of slow, careful policymaking and testing that characterizes effective interventions at scale. And the warning signs were there. Good policymaking (and good research) takes time.

The underlying disappointment in this paper is its finding no association between (state requirements of) system features and estimated effects. This is a bit of an early gut punch to those of us who, while skeptical of how evaluation reform went down in most places, were at least looking forward to results exploiting the enormous variation in system design within and between states. This might have helped to identify, even descriptively, the features driving variation in the results. Bleiberg and colleagues looked at 10 system features sorted into three (non-mutually exclusive) groups focused on measurement (e.g., do ratings weight test-based measures 20-50 percent?), incentives (e.g., are ratings tied to tenure decisions?), and feedback/PD (e.g., do ratings help trigger coaching or PD?). States that required more of these 10 features didn't do any better than states that required fewer, and estimated effects were not different among states that required a minimum number of features in each group versus those that did not (see Appendix Table B3 in the paper). They did, however, find positive results across a set of six locations identified beforehand as “exemplary” systems.2

There may be more evidence on these associations going forward (it's really tough to unpack these effects). In the meantime, the authors, to their credit, discuss several strong alternative possibilities for the failure of evaluation reform. These include: limitations of federal incentives to shape system design; financial and time constraints; lack of buy-in from districts and educators; variation in the labor supply (to replace leaving teachers); and the failure to offset greater risk with increased rewards. I’m very receptive to all these possibilities. I wrote about most of them at the time, but I also wrote far more about system design details that seem far less important in retrospect.

What about D.C.’s IMPACT system? The most common basis for the argument that the new evaluations should have been “tougher” is the evidence of the positive effects of the IMPACT evaluation system in the District of Columbia Public Schools (Dee and Wyckoff 2015; Adnot et al. 2016; Dee et al. 2021; Dotter et al. 2021; Phipps and Wiseman 2021). This system was installed rather quickly, with ratings attached to high-stakes decisions, including large salary bonuses and dismissals. I think the evidence on IMPACT is really interesting and important, particularly since virtually all of the analyses focus on the mechanisms underlying improvement. But I disagree with the idea that all states should have adopted this system, and this discussion needs its own space in a future post. In the meantime, let’s just frame the issue here in terms of whether a stronger version of the higher-stakes approach would have worked nationally.

There’s shaky ground for strong arguments that one particular approach to new evaluations would have changed the results at scale. This includes the lower-stakes, feedback-reliant approach as well as the higher-stakes model with more differentiation, firing, and monetary incentives. Many advocates of the latter approach think the distributions of ratings in most states were unrealistic, and that the stakes in all states should have been higher. I get this viewpoint (particularly the differentiation part). It’s plausible that ramping things up might have produced better early results in some districts. But there were many locations where these things happened without any detectable aggregate impact. 

And I am also unsatisfied by the idea that we can judge evaluation systems a failure based solely on the proportion of teachers rated below satisfactory. First, to reiterate, there is ample evidence that evaluations (observations) can work with no ratings categories at all. Second, teachers may respond to receiving the second highest (e.g., “effective”) rather than the highest rating (e.g., “highly effective”). Third, given the rush to design and install the new systems, and the uncertainty as to their quality, the hesitation of principals to give their teachers low observation scores during the first few years is hardly surprising; it might even seem prudent, depending on your perspective. Fourth, and most importantly, none of this matters if incentives aren't accompanied by actionable feedback as to how teachers can improve (both elements matter), and we really have no idea how common this is (for example, only 14 states require teachers to be observed more than once per year).

On the flip side, there are also those who dismiss the importance of incentives and differentiation, and that too is not supportable. Most likely, the optimal model varies within and between states, perhaps in counterintuitive ways (e.g., in some respects, the higher stakes model seems better-suited for more affluent districts, where, for example, turnover tends to be lower and higher salaries can offset the risks). In any case, I hope we have these sorts of discussions going forward, which brings me to my final question.

Can teacher evaluation reform be salvaged?

The large scale reform ship has likely sailed, possibly for a while. But I certainly don’t think that means states and districts necessarily need to cut their losses, at least not completely. Improving teacher performance is still the most promising end goal for improving student outcomes. Whether your preferred approach is paying teachers more, giving them feedback and coaching to improve their practice, deselection, or something else, evaluations are an important tool. So one needn’t go down the baby/bathwater route here.

One potentially useful key going forward is the importance of credibility and voluntary choices. It bears quickly remembering that teacher (and other employee) evaluations affect outcomes as part of larger accountability systems, and these systems work by changing behavior. This takes one of two forms: 1) changing who is a teacher through recruitment, retention, and dismissals (i.e., composition); and 2) improving the performance of existing teachers (e.g., via feedback, effort, etc.). Note, first, that this menu of possible avenues to improvement requires the system to be credible. If employees don’t “buy in” to the system—if they view dismissals and bonuses as a kind of lottery and the feedback they receive as useless—everything can fall apart (Feeney 2007; Cherasaro et al. 2016). Second, whereas the debate about evaluation reform was so often dominated by the firing option, virtually all of the improvement mechanisms above (and, I would argue, the most potentially powerful ones) are actually voluntary behavioral changes.

We can’t redo national evaluation reform, but states and districts can improve the systems they do have. [Optimism=on.] Perhaps now is a good time to take things slow and steady and learn how to build the foundation for teacher accountability systems that have a better shot of leading to (realistic) improvement in student outcomes. In many districts, this will entail relying on largely informational mechanisms to take hold with limited formal rewards and consequences. I understand that some find this approach too “soft,” and, again, there is evidence that feedback and stakes can work in tandem (Phipps and Wiseman 2021). That said, the informational factors are at least necessary (if not always sufficient), and everything, including the potential impact of formal accountability pressures, depends on them; if they are established first, formal pressures may even fare better. And I also think this context may also be good for achieving greater differentiation in ratings.

And there is a growing evidence base here upon which to build. For instance, just to name a couple of areas, there is consistent evidence that principals—their training, the time and resources they have to conduct observations, the culture they create—are vital to the success of evaluations and accountability systems (Herlihy et al. 2014; Steinberg and Sartain 2015; Kraft and Gilmour 2016; Fryer 2017; Donaldson and Woulfin 2018; Cohen et al. 2019; Kraft and Christian 2022). There is also some indication that evaluations are effective when teachers are observing their colleagues (Taylor and Tyler 2012; Papay et al. 2020; Burgess et al. 2021 ). 

On the other hand, there is clearly much work to be done on pretty much all key fronts, and a lot of open questions. I acknowledge the long odds of many districts taking on formal efforts to improve their evaluations (especially given the overall political environment right now). And it's easier to resort to tribalism when there's so little evidence as to how and why evaluations work. But the new systems are still in force, affecting teachers and students, even if we stop talking about them, and research is going to keep coming, probably for quite some time. We talked so much about teacher quality for 15 years, surely we're not going to walk away now that we're getting good evidence about how (or how not) to improve it.


1 Value-added and other growth models—i.e., test-based teacher performance measures derived from complicated statistical models that attempt to isolate teachers’ causal effect on their students’ testing progress—dominated the debate about and process of evaluation reform. But this is another issue that I think deserves its own post at some point in the future.

2 In supplemental models, Bleiberg et al. compare the results of six “exemplary” evaluation systems with the rest of the nation. These locations include three districts (Dallas, Denver, and D.C.) and two states (New Mexico and Tennessee) chosen in 2018 by the National Council on Teaching Quality (NCTQ) based on their: 1) having a host of different features NCTQ deemed desirable (with something for everyone); and 2) being judged as having already produced early results on several different preliminary descriptive measures, including teacher surveys and attrition by rating distributions (Putman et al. 2018). In these locations, Bleiberg and co-authors find positive estimated effects in year two and beyond, but they are very careful to avoid interpreting these estimates as causal given that the districts were identified beforehand based on performance.

Issues Areas