Trial And Error Is Fine, So Long As You Know The Difference

It’s fair to say that improved teacher evaluation is the cornerstone of most current education reform efforts. Although very few people have disagreed on the need to design and implement new evaluation systems, there has been a great deal of disagreement over how best to do so – specifically with regard to the incorporation of test-based measures of teacher productivity (i.e., value-added and other growth model estimates).

The use of these measures has become a polarizing issue. Opponents tend to adamantly object to any degree of incorporation, while many proponents do not consider new evaluations meaningful unless they include test-based measures as a major element (say, at least 40-50 percent). Despite the air of certainty on both sides, this debate has mostly been proceeding based on speculation. The new evaluations are just getting up and running, and there is virtually no evidence as to their effects under actual high-stakes implementation.

For my part, I’ve said many times that I'm receptive to trying value-added as a component in evaluations (see here and here), though I disagree strongly with the details of how it’s being done in most places. But there’s nothing necessarily wrong with divergent opinions over an untested policy intervention, or with trying one. There is, however, something wrong with fully implementing such a policy without adequate field testing, or at least ensuring that the costs and effects will be carefully evaluated post-implementation. To date, virtually no states/districts of which I'm aware have mandated large-scale, independent evaluations of their new systems.*

If this is indeed the case, the breathless, speculative debate happening now will only continue in perpetuity.

Since we cannot really observe “true” teaching quality, it’s difficult to test whether a particular teacher evaluation does so accurately. In some cases, such as this recent, much-discussed value-added working paper, researchers attempt to examine the association between specific measures and future student outcomes. Most often, however, they tend to rely on a series of more “indirect” assessments, such as whether the ratings are stable (reliable) over time or the relationships between the various components (also here). States/districts can also conduct surveys of teachers and administrators to get their opinions of the new systems. In any case, these types of analyses, which are geared toward seeing whether new evaluations exhibit the properties of consistent, effective measurement instruments, are always possible (and important).

But the big policy question is less about the quality of the evaluations themselves than how they are used – e.g., particularly in hiring, firing, compensation, and other high-stakes decisions. Human beings have a tendency to complicate actual policy implementation, and even if the new evaluations are fantastic on paper, there is no guarantee that they will yield benefits if they aren’t used properly. For example, applications that are perceived as unfair could exacerbate teacher turnover, or affect the supply of new teacher candidates. In addition, a poorly-designed implementation could actually threaten the signal of the measures themselves if they serve to alter the behavior of teachers and administrators (e.g., by compelling them to “teach to the test”).

Therefore, the key to assessing whether new evaluations are “working” is to formally, rigorously test them in practice, after they are actually put in place. This too is complicated, even if you predefine the desired outcome in terms of test scores (as will likely be the case). You need to isolate the effect of the new evaluation system (i.e., how it is used) from all the other factors that can influence performance and behavior. Within a given district, all teachers will be evaluated by the same system (though it will differ between tested and untested teachers).

This means, of course, that it will be very difficult to tell whether the new policy is working - since it will apply to all teachers, there will be no contemporaneous reference group. In some instances, researchers can exploit variations in program design to test effects, as in this evaluation of Cincinnati’s system, which used the timing of evaluations (once every four years) to see whether changes in outcomes coincided with the years in which teachers were evaluated.

Alternatively, it may be possible to test whether districts’ varying configurations (different weights assigned to different components) are associated with better or worse outcomes, but this will be complicated by variations in district-level policies, student demographics, and other factors. In addition, most states with new systems have required that test-based measures comprise very high proportions (usually 40-50 percent) of total evaluation scores, which means that this particular component's weights are unlikely to vary much between districts – districts can’t go lower, and are unlikely to go higher.

The best way to meaningfully address most of these issues – and rigorously test the effect of how new evaluations are used – is to bake this assessment into the implementation, preferably in a manner that exploits the power of random assignment. There are any number of possibilities here, which could be devised by researchers far more experienced in these designs than I. For instance, one idea would be to vary systematically the design of evaluations and how they are used between schools in a small but diverse set of districts (perhaps by letting them work together with teachers and evaluation experts to design a few viable plans of their own), while randomly assigning the plans to different schools within each district. One could then test whether, all else being equal, these schools’ outcomes were different, and why.

Regardless of how it was done, it would have to be executed in a careful, coordinated manner, preferably right at the outset. It also would require significant resources, and multiple years of gathering data. In other words, it would be very difficult (and it's unlikely that the results would be uniformly interpreted). But, given the importance of getting these new systems right – and the amount of time, money, and political capital that have already been expended in putting these systems into place – it’s a good bet that the benefits of such research would have been substantial.

None of this, to my knowledge, is happening, even on a limited scale.

Many (but not all) states and districts are rushing ahead with full-blown implementation, in some cases without so much as a pilot year. Making things worse, many are ignoring crucial details, such as random error and the quality of the statistical models they employ. Indeed, many states and districts don’t even seem to be bothering to make sure that what they’re doing is working.

The result of this apparent lack of foresight is predictable. There will be some good research on the new evaluations, but it will be limited in the degree to which it can measure the actual causal effects of the new systems. It will also be impossible to tell whether and how the results might be generalized to other contexts (e.g., different states and districts). And there will be plenty of purely speculative analyses and commentary by advocates on both sides, probably amounting to little more than comparing changes in NAEP scores between states that did and did not implement new systems. We will be doomed to a future of dueling preexisting beliefs, dressed up in inconclusive empirical evidence.

And that’s the potential tragedy of this sea-change in how teachers are evaluated. It’s touted as the key to improving education quality across the nation, but few people are asking how we’ll know whether or not it’s working. There’s nothing wrong with trial and error – it is the foundation of good research and good policymaking. The problem is when you don’t know the difference and don’t much seem to care.

- Matt Di Carlo


* The one exception of which I’m aware is the District of Columbia (see here). But I should note that I have not carefully reviewed all states’ legislation pertaining to evaluations, and it is possible (even likely) that there are some places with concrete plans to evaluate their new systems. If any readers have heard of such plans, please leave a comment. It's also worth mentioning that some states/districts have not yet decided how to use their new evaluations in decisions about pay, employment status, etc.