The Uncertain Short-Term Future Of School Growth Models

Over the past 20 years, public schools in the U.S. have come to rely more and more on standardized tests, and the COVID-19 pandemic has halted the flow of these data. This is hardly among the most important disruptions that teachers, parents, and students have endured over the past year or so. But one of the corollaries of skipping a year (or more) of testing is its implications for estimating growth models, which are statistical approaches for assessing the association between students' testing progress and those students' teachers, schools, or districts. 

This type of information, used properly, is always potentially useful, but it may be particularly timely right now, as we seek to understand how the COVID-19 pandemic affected educational outcomes, and, perhaps, how those outcomes varied by different peri-pandemic approaches to schooling. This includes the extent to which there were meaningful differences by student subgroup (e.g., low-income students who may have had more issues with virtual schooling). 

To be clear, the question of when states should resume testing should be evaluated based on what’s best for schools and students, and in my view this decision should not include consideration of any impact on accountability systems (the latest development is that states will not be allowed to cancel testing entirely but may be allowed to curtail it). In either case, though, the fate of growth models over the next couple of years is highly uncertain. The models rely on tracking student test scores over time, and so skipping a year (and maybe even more) is obviously a potential problem. A new working paper takes a first step toward assessing the short-term feasibility of growth estimates (specifically school and district scores). But this analysis also provides a good context for a deeper discussion of how we use (and sometimes misuse) testing data in education policy.

The paper, which is written by Ishtiaque Fazlul, Cory Koedel, Eric Parsons, and Cheng Qian, was published last month by CALDER. (As always, there is a lot more in the paper/report than I'll be discussing here, so I encourage you to read it in full.)

The purpose of the analysis is to approximate the impact of a "gap year" in testing on growth scores. Before discussing the findings, let's clarify a couple of things. First, this paper focuses exclusively on growth scores for schools and districts, and not for individual teachers. Teacher scores are not really on the table here. In addition, to reiterate, this paper and the discussion below are not about when testing should resume, only what would happen under different scenarios (nor is the paper really about whether test results, whenever they resume, should be used in high-stakes accountability systems, an issue discussed below).

That said, the authors' approach to checking out the effect of a gap year on school and district scores is relatively simple. They put together a dataset of Missouri test scores from multiple pre-pandemic school years—e.g., the spring semesters of 2017, 2018, and 2019. First, they estimate school and district growth scores using all the years (the "all is normal" scenario). Then they perform a "simulation" of sorts by removing the middle year (2018) and calculating value-added scores for the two-year period between 2017 and 2019. Finally, they compare the "simulated" scores to the "all is normal" scores. The idea is to assess the degree to which foregoing the middle year of information changes the results.

What does the paper find?

In short, the missing year has some impact, but it’s not huge. The "simulated" and "normal" scores for the two-year period between 2017 and 2019 are highly correlated (about 0.90 for districts, and 0.85 or so for schools). These results are quite consistent across different model specifications, and for both math and ELA. (Note also that the district correlations are probably pushed down a bit by the fact that Missouri districts tend to be small compared to their counterparts in other states.) 

This suggests that, in theory, the inconsistency caused by the gap year does not appear to reach prohibitive proportions.

Unfortunately, as Fazlul and colleagues note, there is theory and there is practice. Even if testing resumes, this "simulation" may not resemble what actually ends up happening in many states and districts. For example, it's not clear that all students (in tested grades and subjects) will actually be tested, and there may be differences between the tested and non-tested student groups that affect results. In addition, states and districts may do some in-person testing and some remote testing, and there's not much research on what this does to growth model estimates. What this paper shows is that growth model estimates may be feasible with a gap year under normal conditions. The degree to which reality will conform is unknown (and states that use different models from those used in this paper may have different results as well).

And, of course, there's the question: what happens if testing doesn't resume until 2022 (or resumes such that the data are unfit for growth model estimation in 2021)? This would be a huge problem for the models just in "logistical" terms, as an even larger proportion of the students who tested in the baseline year (2019) will be gone by 2022—e.g., they will have moved on to middle or high school. This basically means that growth scores for schools will present prohibitive challenges even under the "ideal" conditions.

District scores, in contrast, may be possible, since promotion-based mobility is less of an issue when pooling students across an entire district. Fazlul et al. simulate a two-year gap by expanding their dataset to four years (2016 to 2019) and removing two gap years (2017 and 2018). They find that the correlations between the "all is normal" and "simulated" scores (about 0.80) are still rather high by conventional standards, but lower than they are with a one-year gap. 

What do these results mean?

Let's say for the sake of argument that all students are tested this year, at least in some states (and, even better, that further analyses in these states generally confirm the results of this paper). What might be the implications for education policy?

In the more contentious sub-arenas of our education debate, the proverbial brass tacks here are whether the school and district scores can be used in high-stakes accountability systems. At this point, for whatever it’s worth, I personally am a no. This is in part because of the potential theory/practice issues mentioned above. Perhaps more importantly, though, after the past year, my (atypically non-technocratic) inclination is that going back to pre-pandemic accountability systems in the reopening year is questionable under any circumstances. Even if test-based accountability systems were perfectly-designed (they are not), there is too much uncertainty and too many other priorities to rush back to business as usual. Schools and districts have enough to deal with already.

But, agree or disagree, I would also say that this is a rather short-sighted and narrow takeaway. To the degree people react to these results by focusing reflexively on their implications for formal accountability systems, it's because the test-based accountability debate fosters and reflects our limited view of the potential role of growth models in education (and I include myself in that). 

Certainly, the accountability applications of growth models (and test scores in general) are important. And, for the purposes of gauging school/district performance, growth models, though necessarily imperfect, are far superior to the status measures, such as simple proficiency rates, that still play a huge role in virtually all states’ systems. But growth models have other uses that are arguably no less important. 

In an accountability context, we view growth scores with a somewhat exclusionary causal lens, but at their core growth models are descriptive. They can tell us, for example, whether students in a given school or district made strong progress relative to their similar peers in similar schools and districts. This, in turn, can lead us to the equally important question: why? What did the high-growth schools and district do differently, and can we test these policies further? 

Unless you think that standardized tests tell us absolutely nothing about students' knowledge of a given content domain, growth model estimates, when used correctly, are probably the best available means of answering these questions at scale, and that includes evaluations of the pandemic's academic impact at any aggregate level, and how this impact was mediated by state and district policy. 

School and district accountability systems are necessary, and I personally think standardized tests should play an important role in those systems, even if I often disagree with the details of how that is done in most states. But I can also understand why some people, including a lot of teachers, oppose test-based accountability, and it’s even easier to fathom why many educators would not trust test-based accountability measures of any type after the profound upheaval caused by the COVID-19 pandemic. And, like it or not, that is important; accountability systems are unlikely to work if the measures aren’t viewed as credible. 

(Side note: I would also say that the extreme controversy about using teacher growth scores is often unproductively conflated with the debate about using school and district scores, which, used properly, are less imprecise and less problematic than their teacher-level analogues.)

The next year or two may be an opportunity to use growth models, and testing data more generally, in a productive manner that can be widely appreciated as such, perhaps even by testing skeptics. I wouldn't put money on it, but the opening is there.

There is a flip side here, though: it would be a terrible idea to rush or force the resumption of testing this year strictly to enable fitting growth models (and certainly for inferior accountability measures, such as proficiency rates). Testing should be a servant of the policymaking process, not the other way around.

Issues Areas