A Few Points About The Instability Of Value-Added Estimates

One of the most frequent criticisms of value-added and other growth models is that they are "unstable" (or, more accurately, modestly stable). For instance, a teacher who is rated highly in one year might very well score toward the middle of the distribution – or even lower – in the next year (see here, here and here, or this accessible review).

Some of this year-to-year variation is “real." A teacher might get better over the course of a year, or might have a personal problem that impedes their job performance. In addition, there could be changes in educational circumstances that are not captured by the models – e.g., a change in school leadership, new instructional policies, etc. However, a great deal of the the recorded variation is actually due to sampling error, or idiosyncrasies in student testing performance. In other words, there is a lot of “purely statistical” imprecision in any given year, and so the scores don’t always “match up” so well between years. As a result, value-added critics, including many teachers, argue that it’s not only unfair to use such error-prone measures for any decisions, but that it’s also bad policy, since we might reward or punish teachers based on estimates that could be completely different the next year.

The concerns underlying these arguments are well-founded (and, often, casually dismissed by supporters and policymakers). At the same time, however, there are a few points about the stability of value-added (or lack thereof) that are frequently ignored or downplayed in our public discourse. All of them are pretty basic and have been noted many times elsewhere, but it might be useful to discuss them very briefly. Three in particular stand out.

Even a “perfect” value-added measure would not be perfectly stable. Let’s say that all students were randomly assigned to schools and classes, or that some brilliant econometrician devised the “perfect” value-added model. In other words, suppose we could calculate unbiased estimates of teachers’ causal impact on their students’ test performances. Unfortunately, these scores could still be rather unstable between years, not only because "true" teacher performance does actually vary over time, but also because students, for a variety of reasons (e.g., distractions, sickness, etc.), can take the same test twice and get different scores. Larger samples and statistical adjustment techniques can help “smooth out” this imprecision, but they cannot eliminate it. One implication here is that there's a “ceiling” on the consistency of value-added estimates (and other measures). We should keep that in mind when interpreting measures of consistency, such as year-to-year correlations.

More stable does not always mean better. This is kind of the flip side of the first point. Just as unbiased estimates might still be unstable, more stable scores are not necessarily less biased. If a time-strapped principal gives the same rating to all her teachers over a two-year period, those scores will be perfectly stable. But they’re probably not very accurate. Similarly, if teachers are routinely assigned higher- or lower-performing students, and the model doesn't pick up on that, this can generate what appears to be "real" stability in their scores. In other words, the estimates are more consistent, but only because they are consistently biased (see Bruce Baker's discussion). Thus, we should be very careful not to base our opinions about any performance measure’s utility just on precision or stability. There is a difference between precise measurements (“reliability”) and measurements that are telling us what we want to know (“validity”). Both are necessary, neither is sufficient.

Alternative measures are also noisy and fluctuate between years. Measuring performance is a difficult endeavor in any context, but it is particularly tough when you’re dealing with something as complicated and elusive as good teaching. This is evident in the fact that classroom observation scores, even when done carefully and correctly, are also only modestly stable, and can vary quite a bit by evaluator. Needless to say, the fact that noise is everywhere doesn’t mean that we should just ignore it. It does, however, mean that any reliability-based criticism of value-added must also address the reliability of alternatives. If instability is a dealbreaker, you'll have a tough time finding anything suitable.

Again, the precision of value-added – or of any measure - is an extremely important consideration. You don’t want to reward or punish teachers based on scores that are completely off-base and very different the next year. However, one must keep in mind that some instability is to be expected (and it's not always bad), that each measure’s imprecision must be viewed in relation to the alternatives, and that there is a difference between reliability and validity.

- Matt Di Carlo


Is teacher evaluation an art or a science? Any VAM score will be subject to statistical "noise," including instability, the traditional supervisory assessment may vary from rater to rater ... it is an imperfect system. Teachers are dismissed soley on supervisory assessments ... is adding VAM into a multiple measures mix a plus or a con?


One result of Obama's decision to circumvent legislation to change or reauthorize NCLB: there was never a public hearing on the use of VAM, which is one of the keystones of RTTT. I hope that at some point congress will have hearings on the elements of RTTT so the public can be made aware of the instability of these measures. Absent a full public debate, the mainstream media is reporting that "teachers don't want to measured using student test scores".