Does It Matter How We Measure Schools' Test-Based Performance?

In education policy debates, we like the "big picture." We love to say things like “hold schools accountable” and “set high expectations." Much less frequent are substantive discussions about the details of accountability systems, but it’s these details that make or break policy. The technical specs just aren’t that sexy. But even the best ideas with the sexiest catchphrases won’t improve things a bit unless they’re designed and executed well.

In this vein, I want to recommend a very interesting CALDER working paper by Mark Ehlert, Cory Koedel, Eric Parsons and Michael Podgursky. The paper takes a quick look at one of these extremely important, yet frequently under-discussed details in school (and teacher) accountability systems: The choice of growth model.

When value-added or other growth models come up in our debates, they’re usually discussed en masse, as if they’re all the same. They’re not. It's well-known (though perhaps overstated) that different models can, in many cases, lead to different conclusions for the same school or teacher. This paper, which focuses on school-level models but might easily be extended to teacher evaluations as well, helps illustrate this point in a policy-relevant manner.

The analysis looks at three general types of growth models, and it’s worth describing them briefly.

  1. Student growth percentiles (SGPs): This is the simplest specification of the three. The model basically compares each student’s test score increases (in percentiles) to that of their peers who started out at similar percentiles, with no control variables for student characteristics (e.g., income).
  2. One-step value-added (VAM): Like the SGPs, this model controls for prior student performance (in a different manner, but that’s not worth elaborating for our purposes), but also incorporates additional controls for student and school characteristics, as well as performance in other subjects.
  3. Two-step VAM: This is a two-stage model. The best way to describe it quickly is to say that the data are first “purged” of all differences in performance by school and student characteristics before the actual school effects are estimated. In other words, the model attempts to “level the playing field” by design, such that any test score differences between students or schools that can be attributed to measured characteristics such as income are backed out of the data.
As might be expected, these models produce different results (the first two types are in common use, while the third is rare).

For example, the estimates from the SGPs, which do not control for student characteristics (though their omission is a choice, not a requirement), are most strongly associated with those characteristics – e.g., growth scores tend to be higher among higher-income schools. You can find the same types of results in other states, such as Colorado, using SGPs, as well as in places using cruder alternatives, such as Florida and Louisiana).

Such associations are much less prevalent in the one-step VAMs, which directly control for measurable characteristics, and they are largely erased in the two-step VAMs, which, by design, purge them from the data before actually calculating school effects.

There are two basic ways to interpret these results.

The first is to ask which models provide the best approximation of “true” school performance. In other words, put simply, which model is “correct?"

This is an exceedingly difficult question to answer, since we can’t really measure “true” performance. For example, it’s possible that an association between school growth scores and, say, school poverty is actually due to the fact that higher-poverty schools are less effective, on average (and there are reasons, such as resources and teacher recruitment/turnover, to believe this may be the case). In this sense, the one- and especially the two-step VAMs could be “overcorrecting” for student characteristics.

The second, more useful way to view the results is to ask which model is most appropriate in the context of an accountability system (e.g., as part of a school rating system).

This paper addresses this question using a nice, accessible illustration, one which draws on the issues I’ve been discussing for a while (see here, for instance). They present an example of two schools – one of them is high-poverty, the other is low-poverty.

The high-poverty school receives below-average growth scores from the SGP and one-step VAM models, but gets relatively high scores from the two-step VAM model (because the latter thoroughly account for school-level poverty).

However, from a different angle, all three models yield one common interpretation: This school does very well compared with other schools that serve similar students (i.e., similar poverty rates).

But you would of course never know this from a typical school rating system that used the SGP or one-step VAM (or even necessarily from correlations between the measures). These models would assign a low growth rating to this school. As a result, it might end up being closed or restructured. Or, it might respond to the low rating by making drastic, counterproductive changes to instruction or other policies. And, of course, it might be more difficult to recruit good teachers.

All these outcomes might come about despite the fact that whatever the school is doing actually seems to be working.  That is, it is compelling better than expected growth from its student vis-à-vis other schools in similar circumstances.

(The converse situation applies to the second, low-poverty school, which receives high marks from the SGP and one-step VAM, but relatively low scores from the two-step VAM.)

For these reasons, the authors of this paper argue that the two-step VAM model, which “purges” all these associations from the data, is the preferred choice.*

So, the takeaway here is that the choice of model is not some side issue. It is important, and, since we can't say which model is "right" or "wrong," we should assess them in terms of the signals they send and how people react to those signals, with particular attention to the fact that schools in different circumstances face different challenges.

And, while I fully acknowledge that I may be wrong here, it's not clear that states are considering these issues when choosing models (e.g., the SGPs proliferating throughout the nation). If so, this reflects a common problem in education reform – the tendency to rush toward implementation, ignoring crucial details in the process. The call for accountability is easy; the practice of accountability is much more difficult.

- Matt Di Carlo


* There are, as always, trade-offs here, and they relate to the issue of growth versus absolute performance, which I have discussed a couple of dozen times.. For instance, as the authors discuss, the first-stage purging of impacts associated with school and student characteristics might actually “hide” lower performance at some schools. They explain that such concerns are misguided, and might be remedied by the fact that states can (and do) still report absolute performance (i.e., not growth, but levels, such as proficiency rates). I agree, and would actually take this a step further by saying that absolute performance might not only be reported, but even used in some kinds of decisions (e.g., resource allocation), whereas growth is to be strongly preferred for other, more drastic interventions (e.g., closure, restructuring).