Success Via The Presumption Of Accuracy

In our previous post, Professor David K. Cohen argued that reforms such as D.C.’s new teacher evaluation system (IMPACT) will not by themselves lead to real educational improvement, because they focus on the individual rather than systemic causes of low performance. He framed this argument in terms of the new round of IMPACT results, which were released two weeks ago. While the preliminary information was limited, it seems that the distribution of teachers across the four ratings categories (highly effective, effective, minimally effective, and ineffective) were roughly similar to last year’s - including a small group of teachers fired for receiving the lowest “ineffective” rating, and a somewhat larger group (roughly 200) fired for having received the “minimally effective” label for two consecutive years.

Cohen’s argument on the importance of infrastructure does not necessarily mean that we should abandon the testing of new evaluation systems, only that we should be very careful about how we interpret their results and the policy conclusions we draw from them (which is good advice at all times). Unfortunately, however, it seems that caution is in short supply. For instance, shortly after the IMPACT results were announced, the Washington Post ran an editorial, entitled “DC Teacher Performance Evaluations Are Working," in which a couple of pieces of “powerful evidence” were put forward in an attempt to support this bold claim. The first was that 58 percent of the teachers who received a “minimally effective” rating last year and remained in the district were rated either “effective” or “highly effective” this year. The second was that around 16 percent of DC teachers were rated “highly effective” this year, and will be offered bonuses, which the editorial writers argued shows that most teachers “are doing a good job” and being rewarded for it.

The Post’s claim that these facts represent evidence - much less “powerful evidence” - of IMPACT’s success is a picture-perfect example of the flawed evidentiary standards that too often drive our education debate. The unfortunate reality is that we have virtually no idea whether IMPACT is actually “working," and we won’t have even a preliminary grasp for some time. Let’s quickly review the Post’s evidence.

The editorial writers’ first datum - that 58 percent of last year’s “minimally effective” teachers received a higher rating this year – relies on an unfounded presumption that these ratings are accurate, but stretches it even further. Their rationale is that over half the “minimally effective” teachers from last year (at least the two-thirds who didn’t leave the district) received a higher IMPACT rating this year, which shows that they improved. It is difficult to overstate the emptiness of this argument.

In any evaluation system, there will be some instability between years (i.e., teachers getting a different rating than the year before). Some will be “real” changes, such as teachers improving. But, even in a well-functioning system, there will be “noise." Evaluations are never perfect, so it is totally reasonable to tolerate a degree of fluctuation between years. But for the Post to chalk up to true improvement the fact that most remaining “minimally effective” teachers from last year received a higher rating this year is little more than speculation.

We have no idea of the degree to which this instability is due to actual performance changes versus measurement error of various types. Again, there will always be some year-to-year fluctuation, both real and error-generated, perhaps even a good amount of it. But any time that a majority of teachers who receive a given rating in the first year are found to receive a different rating the next year, this might be actually be considered grounds for further investigation. At the very least, it is not a cause for celebration (especially since a roughly equal number of teachers moved from “effective” or “highly effective” down to “minimally effective”).

The editorial’s second piece of “evidence” - that 16 percent of DC teachers received a “highly effective” rating, and will be rewarded for it, shows that IMPACT is working - is more of a giant non-sequitur than anything else. Like the first argument, it is almost entirely reliant on the assumption that the ratings are accurate, and that bonuses are being used to help retain the “best” teachers. (The same criticism, by the way, applies to the Post’s view, also put forth in the editorial, that one-third of last year’s “minimally effective” teachers leaving the district represents a “good outcome.")

And, again, this is the most important question about IMPACT - whether the ratings are valid and reliable. That the Post is willing to stare this important question - an empirical question - in the eye, and not only miss its significance, but also presume to have answered it, is at best incompetent and at worst irresponsible.

Look, whether you love or hate IMPACT, it is a brand new system. We know almost nothing about whether it is “working." All we know is that its results are already being used to fire some teachers, and give others bonuses. If the Post and those making similar arguments believe that firing and bonuses are the entire purpose of IMPACT, with no regard for accuracy or the effect on educational quality, then sure - the system is working perfectly.

If, on the other hand, they believe that evaluation systems should be used to assess performance accurately, and subsequently improve it, then the system’s efficacy remains to be demonstrated. It will take time to build this body of evidence, which will include, for instance, inter-outcome correlations, both within and between years, copious feedback from teachers and administrators, and an examination of variation in teachers’ scores by individual-, class- and school-level characteristics. We will obviously never achieve absolute certainty about any evaluation system’s quality, but it will take a lot more than descriptive tallies over two years to even begin to draw conclusions.

In the meantime, we need to very careful about declaring IMPACT - or any system - a success, based purely on blind faith. Doing so represents either a cynical political gambit or a terrible misinterpretation of data. We don’t need either.

- Matt Di Carlo

Blog Topics

Good post!