Do Subgroup Accountability Measures Affect School Ratings Systems?

The school accountability provisions of No Child Left Behind (NCLB) institutionalized a focus on the (test-based) performance of student subgroups, such as English language learners, racial and ethnic groups, and students eligible for free- and reduced-price lunch (FRL). The idea was to shine a spotlight on achievement gaps in the U.S., and to hold schools accountable for serving all students.

This was a laudable goal, and disaggregating data by student subgroups is a wise policy, as there is much to learn from such comparisons. Unfortunately, however, NCLB also institutionalized the poor measurement of school performance, and so-called subgroup accountability was not immune. The problem, which we’ve discussed here many times, is that test-based accountability systems in the U.S. tend to interpret how highly students score as a measure of school performance, when it is largely a function of factors out of schools' control, such as student background. In other words, schools (or subgroups of those students) may exhibit higher average scores or proficiency rates simply because their students entered the schools at higher levels, regardless of how effective the school may be in raising scores. Although NCLB’s successor, the Every Student Succeeds Act (ESSA), perpetuates many of these misinterpretations, it still represents some limited progress, as it encourages greater reliance on growth-based measures, which look at how quickly students progress while they attend a school, rather than how highly they score in any given year (see here for more on this).

Yet this evolution, slow though it may be, presents a somewhat unique challenge for the inclusion of subgroup-based measures in formal school accountability systems. That is, if we stipulate that growth model estimates are the best available test-based way to measure school (rather than student) performance, how should accountability systems apply these models to traditionally lower scoring student subgroups?

The easy answer, put simply, is to produce growth estimates for the traditional NCLB-style student subgroups. For example, one might assess whether low scoring students, racial and ethnic minority students, or FRL-eligible students are making strong progress, and include those measures as components of a school’s final rating.

Several states already do this (one is discussed below), and more are sure to follow under the new systems mandated by ESSA. So, this is an answer, but is it a good answer?

Needless to say, there are several important issues involved in addressing this question, including sample size (subgroup samples are often small, which leads to imprecise estimates), incentives (e.g., does focusing on, say, low scoring students come at the expense of higher scoring students?), and the choice of which groups to include.

Even more basic, though, is the very simple question of what these subgroup growth indicators look like in practice,. For example, how much variation there is between schools’ overall growth scores (i.e., those that apply to all students) and their growth scores for subgroups? Put differently, how many schools are effective at boosting test scores of all students but not effective at doing so for particular groups of students, or vice-versa?

Let’s take a quick (but admittedly incomplete) look at this issue using data from Colorado’s school accountability system, which includes a component for schools’ overall growth score, as well as one for growth among students who need to catch up (i.e., students below the proficiency threshold).*

In the simple scatterplot below, each red dot is a Colorado school. On the vertical axis are schools’ growth scores for students who are below the proficiency line, while plotted on the horizontal axis are scores for all students at the school.

You can see that there is a rather tight relationship between these two measures. Schools that score highly on growth among all students tend to receive similarly high scores for growth among non-proficient students, whereas schools that receive low scores on one measure tend to receive low scores on the other. The correlation coefficient here is 0.86, which, given how noisy these estimates are, is not so far away from the maximum one could reasonably expect.

So, while there are exceptions, in general, there are very few Colorado schools that compel strong testing growth from all their students but poor growth from their low scoring (i.e., non-proficient) students, and vice-versa.

And this is even more true when it comes to other subgroups, such as students eligible for free and reduced-price lunch (FRL). That relationship is presented in the second scatterplot, below.

As you can see, there is an even tighter relationship between schools growth scores for all students and their scores for FRL-eligible students (the correlation coefficient is 0.91). Schools that do well on one are highly likely to do well on the other.

It is, therefore, possible that the sometimes contentious arguments about whether to include subgroup growth measures in school accountability systems may not make a whole lot of difference in the final results.

Now, we need to be clear about a few things here. First, any subgroup measure is prone to being at least partially redundant with an overall (i.e., all students) measure of the same type, since the former are by definition part of the latter. Second, the fact that two measures in an accountability system are correlated is not by itself a cause for serious concern; indeed, we might expect to find relationships between components, since (we hope) they are picking up on underlying school performance. Third and finally, of course, this is just one state, and just one entirely test-based indicator.

All that said, the debates about the use of subgroup-based measures in accountability systems seem to assume the existence of schools that are, whether consciously or otherwise, underserving some groups of students but not others.  This is unquestionably true in at least some cases, but it may in fact be quite rare, at least judging by the tools we have available for measuring schools’ test-based impact.

(And it is not entirely clear why or how schools would systematically be providing inferior instruction to some students but not others based on observable characteristics such as race and ethnicity or parental income.**)

In any case, this is not intended as an argument against using subgroup growth measures. At the very least, they are far superior to the NCLB-style subgroup status measures that currently dominate accountability systems, as well as to simple achievement gap or "gap closing" indicators. Moreover, including subgroup growth measures may alter the relationship between subgroup and overall growth estimates. But the "redundancy" illustrated above is certainly relevant when thinking about how the inclusion of subgroup growth will affect not only the final results of accountability systems, but also the behavior of teachers and administrators, as such behavioral changes are the primary purpose of these systems in the first place.


* Florida would be another possible choice here, as their grading system also includes a component for low scoring students. For the reasons described here, however, the design of this measure has been deeply flawed.

** This is more plausible in the case of lower scoring students. For example, under NCLB, schools have an incentive to boost proficiency rates.

Issues Areas

Now wait a minute. I am struggling with the idea that growth measures are substantially better than standardized test scores as a measure of student achievement. If standardized test scores are confounded by out of school factors, then why are we to assume that growth measures aren't as well, especially when the correlation between the two types of assessment are so high?


NCLB hasn't been the mainstay and useful tactic Bush thought it would be. This DOES sound like an argument against using subgroup growth measures, which I can't comprehend the reasoning behind. However, super interesting post. its nice to read about research like this and on other communities for us educators such as educationweek,, and homeroom that explore how education and it's undersides. Would like to get your feedback on technology and its affect on testing.