A Quick Look At The ASA Statement On Value-Added

Several months ago, the American Statistical Association (ASA) released a statement on the use of value-added models in education policy. I’m a little late getting to this (and might be repeating points that others made at the time), but I wanted to comment on the statement, not only because I think it's useful to have ASA add their perspective to the debate on this issue, but also because their statement seems to have become one of the staple citations for those who oppose the use of these models in teacher evaluations and other policies.

Some of these folks claimed that the ASA supported their viewpoint – i.e., that value-added models should play no role in accountability policy. I don’t agree with this interpretation. To be sure, the ASA authors described the limitations of these estimates, and urged caution, but I think that the statement rather explicitly reaches a more nuanced conclusion: That value-added estimates might play a useful role in education policy, as one among several measures used in formal accountability systems, but this must be done carefully and appropriately.*

Much of the statement puts forth the standard, albeit important, points about value-added (e.g., moderate stability between years/models, potential for bias, etc.). But there are, from my reading, three important takeaways that bear on the public debate about the use of these measures, which are not always so widely acknowledged.

First, the authors state: "Estimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model." The imprecision of value-added estimates gets a great deal of attention (not all of it well-informed), but, while there are exceptions, many states/districts haven't made much of an effort to report or discuss candidly this imprecision (or, even better, what can be done about it).

Moreover, states' and districts' discussion of even the most basic assumptions and issues hasn't always been particularly robust, and has in some cases become highly politicized, with officials (again, with exceptions) downplaying the concerns. In Florida, for instance, some value-added supporters (including state officials) have claimed that because the estimates from the state's model are not associated strongly with subsidized lunch eligibility, this means that factors associated with poverty don't influence individual teachers' scores. This is a painfully oversimplified conclusion (the reality is far more nuanced).

It could be that states and districts are hesitant to lay it all out because these models are already controversial, and any statements in which states/districts are blunt about their limitations, or about addressing imprecision, would foster even more opposition. Perhaps so, but the fact remains that there is, understandably, an incredible amount of confusion surrounding the value-added/growth model approach, and any productive use of these estimates, whether in a high- or low-stakes context, requires not only that they be used appropriately, but also that stakeholders understand and find credible the information they may (or may not) transmit. You can't bluff your way to buy-in and effective implementation.

The second, related point that the ASA statement’s authors make, which I would like to discuss, is their caution that, "Under some conditions, VAM scores and rankings can change substantially when a different model or test is used, and a thorough analysis should be undertaken to evaluate the sensitivity of estimates to different models."

The picture emerging from this research seems to be that there is a rather high correlation between estimates from some (but not all) of the more common models used in teacher evaluations, but that the models can produce different results depending on the choice and use of control variables, as well as the composition of schools and teachers' classrooms (see, for example, Ehlert et al. 2013 and Goldhaber et al. 2014). Results from the same model can also vary quite a bit when different tests are used (e.g., Papay 2010; MET 2013).

The selection of model is a tough issue. There are not necessarily “right” or “wrong” choices, but rather trade-offs (and the same basic point applies to other measures, including classroom observations). Still, relatively few of the states and districts that are implementing new evaluation systems have shown much interest in directly comparing alternative models and reporting their results publicly (which, again, is important). Moreover, many states are choosing models that are known to be associated with characteristics such as free/reduced-price lunch eligibility, and in some cases doing so based on absurd arguments.

Checking whether different tests or, at the very least, different models produce different results may seem burdensome, particularly to state and local education agencies coping with scarce resources, but these costs are worthwhile given the importance of making an informed choice, and the potential for these choices to shape the impact of new evaluations and other policies that use value-added estimates. And this issue seems particularly relevant going forward, as states transition to new Common Core-aligned assessments. Some states, to their credit, are delaying or pausing on attaching stakes to these results (hopefully to give themselves time to examine the questions above).

The third and final takeaway from the ASA statement is the fact (which is perhaps obvious but is also sometimes underemphasized) that "it is unknown how full implementation of an accountability system incorporating test-based indicators, such as those derived from VAMs, will affect the actions and dispositions of teachers, principals and other educators."

I sometimes get the feeling, perhaps unfairly, that VAM proponents think this concern is trivial. If so, that is incorrect. As uncomfortable as this may be, to the degree jobs and compensation ride on growth model estimates (or, for that matter, any other measurable outcome), there will be incentives for teachers and administrators to alter their practices in ways that will influence what these measures are telling us. Moreover, the use of testing results to evaluate, hire and compensate teachers may affect, for better or worse, the labor market behavior of teachers (and those considering the profession).

It's still too early to get a good sense of the situation, but the potential for unintended behavioral consequences should at least be acknowledged and, hopefully, monitored. Going forward, states and districts that use these estimates in high stakes decisions will have to keep a close eye on how various stakeholders respond, and to consider possible ways to address any problems should they arise. This is perhaps the most important unanswered question regarding the role of value-added in educational accountability systems.

***

Overall, then, what the ASA has done is less a statement of “yes/no” policy preference regarding value-added than a recognition of these models’ strengths and weaknesses in the context of accountability systems, and the presentation of a set of sensible recommendations and cautions for how they might be used.

- Matt Di Carlo

*****

* This, of course, is not to say that I agreed with everything in the statement. For example, it contains the assertion that "VAMs typically measure correlation, not causation." With all due respect to these authors, I thought this was an overstatement.

Blog Topics

Good stuff.

I agree with your #3 in particular. How can that be figured out?

Would you support implementing a teacher evaluation system as an RCT, precisely to measure "the actions and dispositions of teachers, principals and other educators.”

I.e., does teacher attrition rise or fall; what about morale via survey; what about total student achievement; what about quality of new teachers attracted (or not) to district?

Matt,
For me, this was the big takeaway. I am surprised you did not mention it.

"VAMs should be viewed within the context of quality improvement, which distinguishes aspects of quality that can be attributed to the system from those that can be attributed to individual teachers, teacher preparation programs, or schools. Most VAM studies find that teachers account for about 1% to 14% of the variability in test scores, and that the majority of opportunities for quality improvement are found in the system-level conditions. Ranking teachers by their VAM scores can have unintended consequences that reduce quality."

If you think VAM is potentially useful, isn't the burden of proof on VAM supporters not critics. Have you seen _any_ district avoid the serious issues you caution against?