The Rise and Fall of the Teacher Evaluation Reform Empire

Teacher evaluation reform during the late 2000s and 2010s was one of the fastest and widespread education policy changes in recent history. Thanks mostly to Race to the Top and ESEA “waivers,” over a period of about 10 years, the vast majority of the nation’s school districts installed new teacher evaluations. These new systems were quite different from their predecessors in terms of design, with 3-5 (rather than dichotomous) rating categories incorporating multiple measures (including some based on student testing results). And, in many states, there were varying degrees of rewards and/or consequences tied to the ratings (Steinberg and Donaldson 2016).

A recent working paper offers what is to date the most sweeping assessment of the impact of teacher evaluation reform on student outcomes, with data from 44 states and D.C. As usual, I would encourage you to read the whole paper (here's an earlier ungated version released in late 2021). It is terrific work by a great team of researchers (Joshua Bleiberg, Eric Brunner, Erika Harbatkin, Matthew Kraft, and Matthew Springer), and I’m going to describe the findings only superficially. We’ll get into a little more detail below, but the long and short of it is that evaluation reform had no statistically detectable aggregate effect on student test scores or attainment (i.e., graduation or college enrollment).

This timely analysis, in combination with the research on evaluations over the past few years, provides an opportunity to look back on this enormous reform effort, and whether and how states and districts might move forward.

Why Teacher Evaluation Reform Is Not A Failure

The RAND Corporation recently released an important report on the impact of the Gates Foundation’s “Intensive Partnerships for Effective Teaching” (IPET) initiative. IPET was a very thorough and well-funded attempt to improve teaching quality in schools in three districts and four charter management organizations (CMOs). The initiative was multi-faceted, but its centerpiece was the implementation of multi-measure teacher evaluation systems and the linking of ratings from those systems to professional development and high stakes personnel decisions, including compensation, tenure, and dismissal. This policy, particularly the inclusion in teacher evaluations of test-based productivity measures (e.g., value-added scores), has been among the most controversial issues in education policy throughout the past 10 years.

The report is extremely rich and there's a lot of interesting findings in there, so I would encourage everyone to read it themselves (at least the executive summary), but the headline finding was that the IPET had no discernible effect on student outcomes, namely test scores and graduation rates, in the districts that participated, vis-à-vis similar districts that did not. Given that IPET was so thoroughly designed and implemented, and that it was well-funded, it can potentially be viewed as a "best case scenario" test of the type of evaluation reform that most states have enacted. Accordingly, critics of these reforms, who typically focus their opposition on the high stakes use of evaluation measures, particularly value-added and other test-based measures, in these evaluations, have portrayed the findings as vindication of their opposition. 

This reaction has merit. The most important reason why is that evaluation reform was portrayed by advocates as a means to immediate and drastic improvements in student outcomes. This promise was misguided from the outset, and evaluation reform opponents are (and were) correct in pointing this out. At the same time, however, it would be wise not to dismiss evaluation reform as a whole, for several reasons, a few of which are discussed below.

What Happened To Teacher Quality?

Starting around 2005 and up until a few years ago, education policy discourse and policymaking was dominated by the issue of improving “teacher quality.” We don’t really hear too much about it the past couple of years, or at least not nearly as much. One of the major reasons why is that the vast majority of states have enacted policies ostensibly designed to improve teacher quality.

Thanks in no small part to the Race to the Top grant program, and the subsequent ESEA waiver program, virtually all states reformed their teacher evaluation systems, the “flagship” policy of the teacher quality push. Many of these states also tied their new evaluation results to high stakes personnel decisions, such as granting tenure, dismissals, layoffs, and compensation. Predictably, the details of these new systems vary quite a bit, both within and between states. Many advocates are unsatisfied with how the new policies were designed, and one could write a book on all the different issues. Yet it would be tough to deny that this national policy effort was among the fastest shifts in recent educational history, particularly given the controversy surrounding it.

So, what happened to all the attention to teacher quality? It was put into practice. The evidence on its effects is already emerging, but this will take a while, and so it is still a quiet time in teacher quality land, at least compared to the previous 5-7 years. Even so, there are already many lessons out there, too many for a post. Looking back, though, one big picture lesson – and definitely not a new one – is about how the evaluation reform effort stands out (in a very competitive field) for the degree to which it was driven by the promise of immediate, large results.

Teacher Evaluations And Turnover In Houston

We are now entering a time period in which we might start to see a lot of studies released about the impact of new teacher evaluations. This incredibly rapid policy shift, perhaps the centerpiece of the Obama Administration’s education efforts, was sold based on illustrations of the importance of teacher quality.

The basic argument was that teacher effectiveness is perhaps the most important factor under schools’ control, and the best way to improve that effectiveness was to identify and remove ineffective teachers via new teacher evaluations. Without question, there was a logic to this approach, but dismissing or compelling the exits of low performing teachers does not occur in a vacuum. Even if a given policy causes more low performers to exit, the effects of this shift can be attenuated by turnover among higher performers, not to mention other important factors, such as the quality of applicants (Adnot et al. 2016).

A new NBER working paper by Julie Berry Cullen, Cory Koedel, and Eric Parsons, addresses this dynamic directly by looking at the impact on turnover of a new evaluation system in Houston, Texas. It is an important piece of early evidence on one new evaluation system, but the results also speak more broadly to how these systems work.

New Teacher Evaluations And Teacher Job Satisfaction

Job satisfaction among teachers is a perenially popular topic of conversation in education policy circles. There is good reason for this. For example, whether or not teachers are satisfied with their work has been linked to their likelihood of changing schools or professions (e.g., Ingersoll 2001).

Yet much of the discussion of teacher satisfaction consists of advocates’ speculation that their policy preferences will make for a more rewarding profession, whereas opponents’ policies are sure to disillusion masses of educators. This was certainly true of the debate surrounding the rapid wave of teacher evaluation reform over the past ten or so years.

A paper just published in the American Education Research Journal addresses directly the impact of new evaluation systems on teacher job satisfaction. It is, therefore, not only among the first analyses to examine the impact of these systems, but also the first to look at their effect on teachers’ attitudes.

Social And Emotional Skills In School: Pivoting From Accountability To Development

Our guest authors today are David Blazar and Matthew A. Kraft. Blazar is a Lecturer on Education and Postdoctoral Research Fellow at Harvard Graduate School of Education and Kraft is an Assistant Professor of Education and Economics at Brown University.

With the passage of the Every Student Succeeds Act (ESSA) in December 2015, Congress required that states select a nonacademic indicator with which to assess students’ success in school and, in turn, hold schools accountable. We believe that broadening what it means to be a successful student and school is good policy. Students learn and grow in multifaceted ways, only some of which are captured by standardized achievement tests. Measures such as students’ effort, initiative, and behavior also are key indicators for their long-term success (see here). Thus, by gathering data on students’ progress on a range of measures, both academic and what we refer to as “social and emotional” development, teachers and school leaders may be better equipped to help students improve in these areas.

In the months following the passage of ESSA, questions about use of social and emotional skills in accountability systems have dominated the debate. What measures should districts use? Is it appropriate to use these measures in high-stakes setting if they are susceptible to potential biases and can be easily coached or manipulated? Many others have written about this important topic before us (see, for example, here, here, here, and here). Like some of them, we agree that including measures of students’ social and emotional development in accountability systems, even with very small associated weights, could serve as a strong signal that schools and educators should value and attend to developing these skills in the classroom. We also recognize concerns about the use of measures that really were developed for research purposes rather than large-scale high-stakes testing with repeated administrations.

The Details Matter In Teacher Evaluations

Throughout the process of reforming teacher evaluation systems over the past 5-10 years, perhaps the most contentious, discussed issue was the importance, or weights, assigned to different components. Specifically, there was a great deal of debate about the proper weight to assign to test-based teacher productivity measures, such estimates from value-added and other growth models.

Some commentators, particularly those more enthusiastic about test-based accountability, argued that the new teacher evaluations somehow were not meaningful unless value-added or growth model estimates constituted a substantial proportion of teachers’ final evaluation ratings. Skeptics of test-based accountability, on the other hand, tended toward a rather different viewpoint – that test-based teacher performance measures should play little or no role in the new evaluation systems. Moreover, virtually all of the discussion of these systems’ results, once they were finally implemented, focused on the distribution of final ratings, particularly the proportions of teachers rated “ineffective.”

A recent working paper by Matthew Steinberg and Matthew Kraft directly addresses and informs this debate. Their very straightforward analysis shows just how consequential these weighting decisions, as well as choices of where to set the cutpoints for final rating categories (e.g., how many points does a teacher need to be given an “effective” versus “ineffective” rating), are for the distribution of final ratings.

Teachers' Opinions Of Teacher Evaluation Systems

The primary test of the new teacher evaluation systems implemented throughout the nation over the past 5-10 years is whether they improve teacher and ultimately student performance. Although the kinds of policy evaluations that will address these critical questions are just beginning to surface (e.g., Dee and Wyckoff 2015), among the most important early indicators of how well the new systems are working is their credibility among educators. Put simply, if teachers and administrators don’t believe in the systems, they are unlikely to respond productively to them.

A new report from the Institute of Education Sciences (IES) provides a useful little snapshot of teachers’ opinions of their evaluation systems using a nationally representative survey. It is important to bear in mind that the data are from the 2011-12 Schools and Staffing Survey (SASS) and the 2012-13 Teacher Follow Up Survey, a time in which most of the new evaluations in force today were either still on the drawing board, or in their first year or two of implementation. But the results reported by IES might still serve as a useful baseline going forward.

The primary outcome in this particular analysis is a survey item querying whether teachers were “satisfied” with their evaluation process. And almost four in five respondents either strongly or somewhat agreed that they were satisfied with their evaluation. Of course, satisfaction with an evaluation system does not necessarily signal anything about its potential to improve or capture teacher performance, but it certainly tells us something about teachers’ overall views of how they are evaluated.

Getting Serious About Measuring Collaborative Teacher Practice

Our guest author today is Nathan D. Jones, an assistant professor of special education at Boston University. His research focuses on teacher quality, teacher development, and school improvement. Dr. Jones previously worked as a middle school special education teacher in the Mississippi Delta. In this column, he introduces a new Albert Shanker Institute publication, which was written with colleagues Elizabeth Bettini and Mary Brownell.

The current policy landscape presents a dilemma. Teacher evaluation has dominated recent state and local reform efforts, resulting in broad changes in teacher evaluation systems nationwide. The reforms have spawned countless research studies on whether emerging evaluation systems use measures that are reliable and valid, whether they result in changes in how teachers are rated, what happens to teachers who receive particularly high or low ratings, and whether the net results of these changes have had an effect on student learning.

At the same time,  there has been increasing enthusiasm about the promise of teacher collaboration (see here and here), spurred in part by new empirical evidence linking teacher collaboration to student outcomes (see Goddard et al., 2007; Ronfeldt, 2015; Sun, Grissom, & Loeb, 2016). When teachers work together, such as when they jointly analyze student achievement data (Gallimore et al., 2009; Saunders, Gollenberg, & Gallimore, 2009) or when high-performing teachers are matched with low-performing peers (Papay, Taylor, Tyler, & Laski, 2016), students have shown substantially better growth on standardized tests.

This new work adds to a long line of descriptive research on the importance of colleagues and other social aspects of the school organization.  Research has documented that informal relationships with colleagues play an important role in promoting positive teacher outcomes, such as planned and actual retention decisions (e.g., Bryk & Schneider, 2002; Pogodzisnki, Youngs, & Frank, 2013; Youngs, Pogodzinski, Grogan, & Perrone, 2015). Further, a number of initiatives aimed at improving teacher learning – e.g., professional learning communities (Giles & Hargreaves, 2006) and lesson study (Lewis, Perry, & Murrata, 2006) – rely on teachers planning instruction collaboratively.

Evaluating The Results Of New Teacher Evaluation Systems

A new working paper by researchers Matthew Kraft and Allison Gilmour presents a useful summary of teacher evaluation results in 19 states, all of which designed and implemented new evaluation systems at some point over the past five years. As with previous evaluation results, the headline result of this paper is that only a small proportion of teachers (2-5 percent) were given the low, “below proficiency” ratings under the new systems, and the vast majority of teachers continue to be rated as satisfactory or better.

Kraft and Gilmour present their results in the context of the “Widget Effect,” a well-known 2009 report by the New Teacher Project showing that the overwhelming majority of teachers in the 12 districts for which they had data received “satisfactory” ratings. The more recent results from Kraft and Gilmour indicate that this hasn’t changed much due to the adoption of new evaluation systems, or, at least, not enough to satisfy some policymakers and commentators who read the paper.

The paper also presents a set of findings from surveys of and interviews with observers (e.g., principals). These are in many respects more interesting and important results from a research and policy perspective, but let’s nevertheless focus a bit on the findings on the distribution of teachers across rating categories, as they caused a bit of a stir. I have several comments to make about them, but will concentrate on three in particular (all of which, by the way, pertain not to the paper’s discussion, which is cautious and thorough, but rather to some of the reaction to it in our education policy discourse).