Certainty And Good Policymaking Don't Mix

Using value-added and other types of growth model estimates in teacher evaluations is probably the most controversial and oft-discussed issue in education policy over the past few years.

Many people (including a large proportion of teachers) are opposed to using student test scores in their evaluations, as they feel that the measures are not valid or reliable, and that they will incentivize perverse behavior, such as cheating or competition between teachers. Advocates, on the other hand, argue that student performance is a vital part of teachers’ performance evaluations, and that the growth model estimates, while imperfect, represent the best available option.

I am sympathetic to both views. In fact, in my opinion, there are only two unsupportable positions in this debate: Certainty that using these measures in evaluations will work; and certainty that it won’t. Unfortunately, that’s often how the debate has proceeded – two deeply-entrenched sides convinced of their absolutist positions, and resolved that any nuance in or compromise of their views will only preclude the success of their efforts. You’re with them or against them. The problem is that it's the nuance - the details - that determine policy effects.

Let’s be clear about something: I'm not aware of a shred of evidence – not a shred – that the use of growth model estimates in teacher evaluations improves performance of either teachers or students.

Now, don’t get me wrong – there’s no direct evidence that using VA measures has a positive effect because there’s really no evidence at all. This stuff is all very new, and it will take time before researchers get some idea of the effects. There is some newer evidence that well-designed teacher evaluations can have positive effects on teacher performance (see here, for example), but these systems did not include state test-based measures.

This situation would seem to call for not simple “yes/no” answers, but rather proceeding carefully, using established methods of policy evaluation and design. That is not what is happening. Thanks in large part to Race to the Top, almost half of public school students in the U.S. are now enrolled in states/districts that already have or will soon have incorporated growth estimates into their evaluations. Most (but not all) of these states and districts are mandating that test-based productivity measures comprise incredibly high proportions of evaluation scores, and most have failed to address key issues such as random error and the accuracy of their data collection systems. Many refused to allow for a year or two of piloting these new systems, while few have commissioned independent evaluations of these systems’ effects on achievement and other outcomes, which means that, in most places, we’ll have no rigorous means of assessing the impact of these systems.

In my view, this failure to address basic issues reflects extreme polarization between the “sides” in this debate. When positions are black and white, details and implementation get the short end of the stick.

For example, addressing the critical issue of random error might be seen as a political liability, tantamount to admitting that these measures are inaccurate. Mandating a year or two for no-stakes testing of these systems only gives one’s adversaries more time to organize against the plans, while a rigorous policy evaluation by independent researchers entails the risk that the results will be used to shut the programs down.

On the other “side” of the divide, any admission that growth measures might play even a small, responsible role in evaluations risks the dreaded slippery slope, while a cautious acknowledgment that standardized testing data do provide “actionable” information somehow represents a foot in the door for an evil technocratic regime that will sap public education of all its humanity.

These are exactly the kinds of attitudes – on both “sides” of this debate – that enable bad policy. And that is, from my perspective, exactly what we’ve seen in many states and districts. The mere presence or absence of test-based productivity measures in evaluations is seen as a victory, with little or no regard for the fact that how you use them will determine whether they work.

- Matt Di Carlo

Blog Topics

I'm afraid all this will be the new "Reading First." A massive, expensive uncontrolled experiment with mostly no significant impact that, most importantly, almost everyone forgets ever happened. There are lots of people out there who seem convinced that we've never seriously tried teaching phonics in the last 50 years.

I think much of the polarization arises from a sense of urgency and panic when it comes to the state of education in our nation. You can argue about how much of the "panic" is feigned for political gain, but either way there is a valid concern for the amount of time it takes to create reliable and valid assessments, and the number of students (mostly low-income minorities) that are "lost" during that time. It also takes time to convince states and districts to make significant changes to existing systems, which is why tying evaluation mandates to money seems to work so well. While I agree with your headline, what would you say to those who want change effective immediately? Why should we wait while researchers and evaluators take their time?