Evaluating The Results Of New Teacher Evaluation Systems

A new working paper by researchers Matthew Kraft and Allison Gilmour presents a useful summary of teacher evaluation results in 19 states, all of which designed and implemented new evaluation systems at some point over the past five years. As with previous evaluation results, the headline result of this paper is that only a small proportion of teachers (2-5 percent) were given the low, “below proficiency” ratings under the new systems, and the vast majority of teachers continue to be rated as satisfactory or better.

Kraft and Gilmour present their results in the context of the “Widget Effect,” a well-known 2009 report by the New Teacher Project showing that the overwhelming majority of teachers in the 12 districts for which they had data received “satisfactory” ratings. The more recent results from Kraft and Gilmour indicate that this hasn’t changed much due to the adoption of new evaluation systems, or, at least, not enough to satisfy some policymakers and commentators who read the paper.

The paper also presents a set of findings from surveys of and interviews with observers (e.g., principals). These are in many respects more interesting and important results from a research and policy perspective, but let’s nevertheless focus a bit on the findings on the distribution of teachers across rating categories, as they caused a bit of a stir. I have several comments to make about them, but will concentrate on three in particular (all of which, by the way, pertain not to the paper’s discussion, which is cautious and thorough, but rather to some of the reaction to it in our education policy discourse).

Differentiation is important, but must be kept in perspective. Let’s be clear: It’s important that teacher evaluation systems not rate all teaching performance as the same. For one thing, this might serve to discourage the development of targeted improvement efforts, the identification of effective teaching practice, the voluntary exit of low performers, and other desirable outcomes. It precludes the use of any meaningful concrete incentives, whether rewards or punishments. It follows, then, that if the results of these new systems are implausible, they should be adjusted. Period.

That said, the end goal of these systems must be to shift the distribution of teacher performance. It is not necessarily the case that a teacher evaluation system will fail without a given spread of teachers across rating categories, and there certainly is no reason to believe that such a spread necessarily means the system will be successful. The distribution of teachers across categories is only one of the many possible tools to assess an evaluation system, and it’s not really the sharpest tool at that.

To the degree these systems achieve their purpose of improving teacher performance (and I think that many people have vastly inflated expectations regarding the magnitude and speed of this impact), the bulk of this effect may very well stem from teachers who adjust their practice based on feedback from principals and colleagues, an impact which may not rely heavily upon the designation teachers receive at the end of the year. And we should also bear in mind that the credibility of these systems is of vital importance. If teachers don’t believe in the ratings, they won’t matter anyway (I personally am as interested in teachers’ and principals’ opinions of the systems in these initial stages of implementation as I am in how the ratings turn out).

One big problem here is that it way too early to determine the effects of these new systems, and these preliminary results are all we have to go on in many places. After all the contentious debates about teacher evaluation over the past 5-10 years, opinions can be heated.

The endgame, of course, is whether the systems generate improvement, which will take a few years to assess. As we wait for better evidence, it makes perfect sense to monitor the distributions, and perhaps even to tweak systems based on those results (though I would wait for at least a couple of years of results before making any big decisions). But states and districts really must do so cautiously, and avoid the temptation to adjust their systems solely to produce a better spread across categories (New York is a case study in how not to react).

Expectations for results should depend on incentives. There is a tendency to judge the distribution of teachers across ratings categories using our gut feelings about how the spread should look, or in comparison with pre-conceptions as to the distribution of “true” teacher effectiveness. But teacher evaluation systems, like all accountability systems, are less about capturing “reality” (which we cannot observe), but rather about changing behavior (whether voluntary or involuntary).

One of the key means of encouraging that behavioral change is attaching rewards and punishment to the results. And it makes very little sense to assess the “appropriate” distribution of teachers across categories without paying close attention to the incentives attached to those ratings, especially given that those incentives, particularly the explicit rewards and consequences, are the primary reason why we would sort teachers into categories in the first place.

For example, we would probably have very different expectations (and desires) for how many teachers “should” receive the lowest rating where teachers are at risk of being fired for receiving that rating (or fired immediately), compared with a situation in which the stakes are considerably lower, such as professional development. In the former case, the one with high stakes, is 2-5 percent of teachers across an entire state too low?

If so, how high should it be? In other words, putting aside the (very important) fact that there is, as yet, only moderate room for confidence in the degree to which these brand new systems actually capture teacher performance, precisely what percentage of each state’s teachers do we believe are: 1) sufficiently low performing that we are confident they should be placed at significant risk of dismissal; and 2) likely to be succeeded by a better replacement?

Is 2-5 percent statewide, on an annual basis, really that far outside the realm of plausibility? Perhaps, but I'm not so sure. Now, of course, the stakes are much lower in many states. In these contexts, I can see the argument either way, but I don’t think that 2-5 percent statewide is necessarily implausible. It depends on how each state and district conceives of the category’s “meaning” (see here for more).*

In any case, the point here is that the distribution of ratings cannot be assessed independent of the incentives embedded in each system. And, more generally, evaluation systems’ results don’t have to match our preconceived notions of the “real” teacher performance distribution in order to improve that distribution.

Within-state variation is very important here. Finally, this Kraft and Gilmour paper is comparable to the “Widget” report in form, but the data are a bit different. The latter included only 12 districts in four states, most of which were larger and primarily urban districts. The results of this recent working paper are statewide (overall state-level results in 19 states). The distribution of evaluation ratings varies widely within many of these states. Moreover, in many cases, the ratings spread for large urban districts is quite different than it is statewide – i.e., a much larger proportion of teachers receive lower ratings in urban districts than do teachers across the entire state (see here, for example).

This matters for two reasons. First, when we think about the need to improve teaching quality, however you define it, we are usually focused on districts serving relatively large proportions of impoverished students. In these districts, for several reasons (e.g., recruitment and retention are more challenging), teaching quality is a far bigger concern than it is in the typical middle-class school within a given state. In this sense, then, the statewide results may give a distorted picture of the systems' results in the places on which we want to focus.

The second, more important reason why intra-state variation in results matters is that it represents a wonderful opportunity for states to examine that variation, why it occurs (e.g., system design), and whether the designs or ratings distributions are associated with different results (e.g., student achievement, teacher retention). To me, this should be the most urgent project of this new teacher evaluation enterprise – exploiting that intra-state variation to see what works and why it works, and then replicating it. The statewide distributions are an important initial diagnostic tool, but they receive so much attention that I often worry we are missing the efficacy forest for the “ineffective” trees.

*****

* For their part, Kraft and Gilmour actually present results showing that evaluators in the third year of a new evaluation system thought that roughly four percent of teachers deserved “below proficient” ratings, but that 2.4 percent would actually receive them (the actual proportion in these evaluators’ schools ended up at 1.1 percent). These discrepancies are meaningful in size, as well as very interesting, and they are not directly comparable to the statewide results (the schools of these evaluators may be different). Still, they hardly suggest that 2-5 percent is completely out of the ballpark, particularly in a situation where high stakes are attached to the lowest ratings.

Issues Areas