## The War On Error

The debate on the use of value-added models (VAM) in teacher evaluations has reached an impasse of sorts. Opponents of VAM [1] use contend that the imprecision is too high for the measures to be used in evaluation; supporters [2] argue that current systems are inadequate, that all measures entail error but this doesn’t preclude using the estimates.

This back-and-forth may be missing the mark, and it is not particularly useful in the states and districts that are already moving ahead. The more salient issue, in my view, is less about the *amount* of error than about *how it is dealt with* when the estimates are used (along with other measures [3]) in evaluation systems.

Teachers certainly understand that some level of imprecision is inherent in any evaluation method—indeed, many will tell you about colleagues who shouldn’t be in the classroom, but receive good evaluation ratings from principals year after year. Proponents of VAM often point to this tendency of current evaluation systems [4] to give “false positive” ratings as a reason to push forward quickly. But moving so carelessly that we disregard the error in current VAM estimates—and possible methods to reduce its negative impacts—is no different than ignoring false positives in existing systems.

Mostly as a result of random statistical error [5], value-added estimates are reasonably effective only at identifying those teachers at the extremes (or “tails”) of the performance distribution with any degree of acceptable precision (sometimes gauged using the convention of statistical significance). Depending on the amount of data available, this means we only get strong results—results in which we can have confidence—for about 10-30 percent of teachers (the 5-15 percent top, and the 5-15 percent bottom). All other teachers should be regarded statistically as no different from average. Any credible researcher, including staunch VAM advocates like William Sanders [6], will acknowledge this limitation.

Interpreting a teacher’s VAM score without examining the error margin is, in many respects, meaningless. For instance, a recent analysis [7] of VAM scores in New York City shows that the *average* error margin is plus or minus 30 percentile points. That puts the “true score” (which we can’t know) of a 50th percentile teacher at somewhere between the 20th and 80th percentile—an incredible 60 point spread (though, to be fair, the “true score” is much more likely to be 50th percentile than 20th or 80th, and many individual teacher’s error margins are less wide than the average). If evaluation systems don’t pay any attention to the margin of error, the estimate is little more than a good guess (and often not a very good one at that).

Now, here’s the problem: Many, if not most teacher evaluation systems that include VAM—current, enacted or under consideration—*completely ignore this*. Many of the systems with which I’m familiar just take VAM estimates at face value, often using them to assign teachers to categories (there are exceptions, such as Hillsborough’s (FL) plan [8] to use three-year cumulative estimates, and most systems are still on the drawing board [please comment if you know of others]).

While the vast majority of these teachers, including many in the top and bottom categories, are actually indistinguishable from average, their scores are being accorded an unwarranted legitimacy, especially when they count for 40-50 percent of teachers’ final evaluations, as is the case in an increasing number of places. Some teachers have even been fired [9] based on evaluations that include heavily-weighted VAM estimates from only one year of data.

To knowingly build this level of imprecision into a system makes no sense, especially when it is unnecessary. VAM estimates can be incorporated into evaluations in a more responsible fashion, one which pays attention to error.

One very simple idea, for example, would be to employ a three-category scheme—above average, average, below average—that very directly accounts for error margins (the threshold for statistical significance might be relaxed a bit). The model [10] used in Tennessee, Ohio, and elsewhere (designed by William Sanders), reports results to teachers/schools in this fashion, but it’s not yet clear whether the same scheme will be used in actual evaluations.

Another, equally simple idea is to set a minimum sample size (i.e., number of students or years) that must be available for a given teacher before the estimates can be incorporated into his or her evaluation. This is essentially what’s happening in Hillsborough.

These and similar methods would, obviously, reduce the number of teachers who get “actionable” estimates, particularly during their first years of teaching. But they would also go a long way towards reducing (though not eliminating) the alarming degree of imprecision in VAM, much of which stems from little more than random variation. Also, other problems, such as bias [11] from non-random classroom assignment, get better with larger sample sizes [12].

And researchers who support using VAM agree. For example, a recent paper from the Brookings Institution [2] argued that “any practical application of value-added measures should make use of confidence intervals in order to avoid false precision” (the above-mentioned 60-point spread is a confidence interval). A recent RAND/CAP report [13] provides similar recommendations.

Regardless of how it is done, accounting for VAM error rates is a critical issue in those states and districts that have already decided to use these estimates as a factor in teacher evaluation. The fact that it is rarely discussed – and may not be part of the design of many systems, new and existing – is very troubling. After years of effort and millions of dollars in investment, we might end up with almost as many false positives and many, many more false negatives—excellent or average teachers who are erroneously identified as subpar. If we’re going to do this, we should at least do it correctly.