The RAND Corporation recently released an important report on the impact of the Gates Foundation’s “Intensive Partnerships for Effective Teaching” (IPET) initiative. IPET was a very thorough and well-funded attempt to improve teaching quality in schools in three districts and four charter management organizations (CMOs). The initiative was multi-faceted, but its centerpiece was the implementation of multi-measure teacher evaluation systems and the linking of ratings from those systems to professional development and high stakes personnel decisions, including compensation, tenure, and dismissal. This policy, particularly the inclusion in teacher evaluations of test-based productivity measures (e.g., value-added scores), has been among the most controversial issues in education policy throughout the past 10 years.
The report is extremely rich and there's a lot of interesting findings in there, so I would encourage everyone to read it themselves (at least the executive summary), but the headline finding was that the IPET had no discernible effect on student outcomes, namely test scores and graduation rates, in the districts that participated, vis-à-vis similar districts that did not. Given that IPET was so thoroughly designed and implemented, and that it was well-funded, it can potentially be viewed as a "best case scenario" test of the type of evaluation reform that most states have enacted. Accordingly, critics of these reforms, who typically focus their opposition on the high stakes use of evaluation measures, particularly value-added and other test-based measures, in these evaluations, have portrayed the findings as vindication of their opposition.
This reaction has merit. The most important reason why is that evaluation reform was portrayed by advocates as a means to immediate and drastic improvements in student outcomes. This promise was misguided from the outset, and evaluation reform opponents are (and were) correct in pointing this out. At the same time, however, it would be wise not to dismiss evaluation reform as a whole, for several reasons, a few of which are discussed below.
First, and most basically, it’s too soon to proclaim failure (or success, really). That is, we are still in the intermediate phases of understanding the impact of evaluation reform (and there really isn’t evidence from more than a few places, some of it positive [e.g., Taylor and Tyler 2012, Adnot et al. 2017]). Supporters’ promises of immediate, drastic improvements belied the fact that education policy changes, particularly on a scale comparable to that of evaluation reform, take time to show effects. Real improvement is most often slow and gradual.
Conversely, unrealistic expectations are a poor basis for policy evaluation. Reforming teacher personnel systems, even done correctly, could take many years to have a discernible effect on the recruitment, performance, and retention of teachers. Any effect on aggregate outcomes, to the degree it can be measured (and hopefully such measurement will include different types of outcomes), is unlikely to be massive in the short- or even medium-term. It might, however, be meaningful and positive in some places, and that would be a good thing, both in itself and in the possibility of it being used to improve systems in places where such effects are not found.
On a second and related note, there is not going to be one “success/failure” verdict here. States varied in their requirements for the new systems, and districts within those states had varying degrees of autonomy in and capacity for design and implementation. This is a massive endeavor. Evaluation reform, in my view, proceeded too quickly and with insufficient flexibility, but there are still thousands of different combinations of designs and outcomes to look at, and this kind of variation is the best way to identify what works and what doesn’t.
Remember also that part of the variation in outcomes will be a result of “external” factors. For example, the impact of dismissing low performing teachers depends a great deal on the supply of candidates to replace them (this, in my view, was the big factor in D.C., where researchers found positive effects of teacher turnover in the first few years after evaluation reform).
Evaluation reform, even without a definitive “up or down” verdict, has also provided a foundation for adjusting the new systems going forward. There is now some infrastructure in place (e.g., observation systems, data systems) that can, at least in many states, support different types of evaluation designs. Remember that most of the states that went ahead with evaluation reform did so before there was even preliminary evidence about how the new systems should be designed. It would be crazy to think that they would get it right on the first try (or even the second or third).
Third, it’s fair to say that the new evaluation systems in many districts, for all their shortcomings, are better than the previous systems, which in too many cases were infrequent and functioned more as formalities than actual performance assessments geared toward improvement. For example, a majority of U.S. public school districts are now using high quality classroom observation systems that are comparable across schools/districts. And most teachers are now evaluated annually instead of occasionally.
This may be somewhat cold comfort for critics of how evaluation reform went down, but there is something to be said for assessing policies vis-à-vis their predecessors, and it’s not easy to argue that the new systems are inferior to the old ones.
The fourth and final reason why evaluation reform should not be considered a failure (at least not yet) is a bit abstract and not at all original, but it bears mention nevertheless. There are many lessons that can be drawn from this process, but one of the big ones for me is just how much people and their behavior matter in education reform.
During the design process, so much attention (including mine) was focused on the proper weights for each measure, the distributions of final ratings, and other somewhat technical concerns. These questions were too often approached as if evaluations would work if we could just get the technical details correct.
To be clear, details are important. They are very important. But they are insufficient for policy success, and it's too easy to lose sight of that. No performance evaluation system will have the desired effects without buy-in from the evaluators and the evaluated. For example, teachers won’t use systems to improve their practice if they don’t consider those systems credible. Principals will hesitate to give bad ratings to their teachers if they believe improvement is the preferable approach, or if they fear that doing so will hurt morale, or they are skeptical of the availability of superior replacements.
This can be a frustrating reality for non-practitioners involved in the policymaking process, because its clear implication is that there is a human factor here that cannot always be predicted or accounted for by even the most carefully designed and implemented policies. Stakeholders' responses can in fact contradict decisions that seem straightforward from a policy design process, particularly when it comes to accountability policy (e.g., what educators and technocrats consider good measurement is not always aligned).
Given the magnitude of investment in evaluation reform in most states, it’s a stretch to say that learning this lesson, which is almost a cliché, means that evaluation reform was a success. But it should certainly be a lasting legacy of this endeavor, if enough people internalize it.
In any case, there is still ample hope for evaluation reform’s fate, given a little patience, willingness to adapt and improve the systems, and acknowledgment that the promises of miracles were empty promises, and should not play any role going forward.