The Categorical Imperative In New Teacher Evaluations
There is a push among many individuals and groups advocating new teacher evaluations to predetermine the number of outcome categories – e.g., highly effective, effective, developing, ineffective, etc. - that these new systems will include. For instance, a "statement of principles" signed by 25 education advocacy organizations recommends that the reauthorized ESEA law require “four or more levels of teacher performance." The New Teacher Project’s primary report on redesigning evaluations made the same suggestion.* For their part, many states have followed suit, mandating new systems with a minimum of 4-5 categories.
The rationale here is pretty simple on the surface: Those pushing for a minimum number of outcome categories believe that teacher performance must be adequately differentiated, a goal on which prior systems, most of which relied on dichotomous satisfactory/unsatisfactory schemes, fell short. In other words, the categories in new evaluation systems must reflect the variation in teacher performance, and that cannot be accomplished when there are only a couple of categories.
It’s certainly true that the number of categories matters – it is an implicit statement as to the system’s ability to tease out the “true” variation in teacher performance. The number of categories a teacher evaluation system employs should depend on how on how well it can differentiate teachers with a reasonable degree of accuracy. If a system is unable to pick up this “true” variation, then using several categories may end up doing more harm than good, because it will be providing faulty information. And, at this early stage, despite the appearance of certainty among some advocates, it remains unclear whether all new teacher evaluation systems should require four or more levels of “effectiveness."
One body of research that may be relevant here is the literature on value-added and other growth models. The estimates from even the best of these models are subject to high degrees of random error, even when samples are relatively large (also here). As a result, growth models are typically only useful in identifying teachers at the very “bottom” and “top” of the distribution (as even their most ardent proponents acknowledge). Most teachers’ estimates are too noisy to be distinguished from average.
It might therefore be said that these measures, which are of course being used for fairly large proportions of many teachers’ final evaluation scores, are most appropriately used to identify teachers who fall into one of three groups: those who are above and below average by statistically meaningful margins; and the larger group of teachers in between. Insofar as it is reasonable to speculate that other measures of teacher performance may be similarly imprecise mid-distribution (see below), the literature on test-based teacher productivity suggests 4-5 categories may not be the best way to go, at least initially. In addition, to whatever degree growth model scores contribute to the actual differentiation of teachers in the “middle two” categories of four-category systems that incorporate these scores, this reflects a misinterpretation of these estimates.
Growth model estimates are, of course, far from the only measure being used in new evaluations. There is considerable variation within and between states, but the other very common measure being used is observations. It’s a bit more difficult to examine the "accuracy" of observation scores. One common (if not uncontroversial) way to assess observations and other components is to see how they match up with growth model estimates.
The available evidence based on this approach, though limited, also suggests that differentiation in the middle of the distribution is not especially precise. A paper published in 2008 found that principals were able to identify teachers who had the very highest and lowest value-added scores, but their ability to distinguish between those in the middle was far more limited (also see here, here and here). So, to whatever degree the relationship between observational and test-based measures of teacher productivity is meaningful, the research once again suggests that evaluation systems will have most of their success at the tails of the distribution (i.e., the “top” and “bottom” performers), with far less toward the middle. And, once again, 4-5-category systems necessarily require the latter.
This is a particularly important issue given widespread plans to use the results of these evaluations to hire, fire, pay and make other high-stakes decisions. In some cases, there are even stakes being attached to receiving one of the “middle” ratings. For example, under DC’s new four-category evaluation system, teachers who receive the second lowest rating for two consecutive years are automatically dismissed.
Based on the evidence discussed above, as well as the fact that these systems are brand new and will be used in high-stakes decisions, the "4-5 categories" directive at the very least deserves more research-based scrutiny. It is unclear why it is being pushed so aggressively, and why some states and districts shouldn't have the option of trying alternatives.
The fact that so many states have mandated 4-5 category schemes right out of the gate, seemingly based on little more than speculation among advocacy groups, is yet another instance of the rushed, ill-considered drive to overhaul teacher evaluations. It’s amazing how certain some people seem about what these new systems are supposed to look like, given the fact that there is barely a shred of evidence as to their optimal form. In these situations, it's often wise to encourage experimentation, see how different configurations turn out and learn from this variation.
As is usually the case in policymaking, it is the details – even seemingly minor issues such as the number of categories – that go a long way toward determining success and failure, and, in too many instances, attention to these details appears to have taken a backseat to presumption.
- Matt Di Carlo
* The Race to the Top regulations include the vague requirement that teacher evaluation systems “differentiate teacher effectiveness using multiple rating categories."