Improving Accountability Measurement Under ESSA
Despite the recent repeal of federal guidelines for states’ compliance with the Every Student Succeeds Act (ESSA), states are steadily submitting their proposals, and they are rightfully receiving some attention. The policies in these proposals will have far-reaching consequences for the future of school accountability (among many other types of policies), as well as, of course, for educators and students in U.S. public schools.
There are plenty of positive signs in these proposals, which are indicative of progress in the role of proper measurement in school accountability policy. It is important to recognize this progress, but impossible not to see that ESSA perpetuates long-standing measurement problems that were institutionalized under No Child Left Behind (NCLB). These issues, particularly the ongoing failure to distinguish between student and school performance, continue to dominate accountability policy to this day. Part of the confusion stems from the fact that school and student performance are not independent of each other. For example, a test score, by itself, gauges student performance, but it also reflects, at least in part, school effectiveness (i.e., the score might have been higher or lower had the student attended a different school).
Both student and school performance measures have an important role to play in accountability, but distinguishing between them is crucial. States’ ESSA proposals make the distinction in some respects but not in others. The result may end up being accountability systems that, while better than those under NCLB, are still severely hampered by improper inference and misaligned incentives. Let’s take a look at some of the key areas where we find these issues manifested.
Status versus growth. ESSA grants states the flexibility to include growth measures in their school rating/dashboard systems. This is a major sign of progress in accountability measurement. Status measures (how highly students score on tests, measured by average scores or proficiency rates) are mostly a function of student background. How much students progress while attending schools, on the other hand, can tell us about schools’ actual effectiveness (at least in raising test scores). That is, status measures primarily gauge student performance, whereas growth measures are best classified as school performance indicators.
If I were designing an accountability system, I would use both, but for different purposes. Status measures can help determine which schools are serving students most in need of catching up, and resources can be directed to those schools. Growth model estimates, on the other hand, can identify schools that are (and are not) serving their students well, and target interventions accordingly, perhaps prioritizing schools with students that are most in need of help.
Among those states planning to calculate school ratings, the current trend is some roughly equal weighting of status and growth (along with subgroup-specific and a few additional measures). This seems to reflect a mindset of “mixing different things together to get a comprehensive rating.” This is a bit misguided, as the components of this system aren’t all equipped to measure the same underlying dimension. Mixing status and growth into a school performance measure is like rating the efficacy of health and fitness programs by mixing participants’ current weight and the weight they have lost. The former tells you something about people’s health, the latter about whether the program helped them lose weight. Both are important, but you probably wouldn’t combine them to evaluate the programs.
Again, this is better than the NCLB-style system, which was based almost entirely on status. Nevertheless, to the extent status is included in a single rating system, schools will continue to be rewarded or penalized based on the students they serve, not how well they serve them (see this recent article for more on this).
One can, for example, only cringe at the reaction of a highly effective inner city school, where students make tremendous progress every year, receiving a low rating simply because their students enter so far behind. This school would be ignored or perhaps even punished when it should be celebrated and copied.
Another big problem here is high schools. Since many high schools serve only one or two tested grades, and many others serve none at all, ratings systems for these schools tend to rely on measures such as graduation rates, which are status measures with the same implications as proficiency rates or average scores. From this perspective, then, whatever progress there has been in accountability measurement under ESSA will not accrue as much to high schools, the ratings of which will continue to be driven almost entirely by which students they serve.
Achievement gaps and “gap closing.” States also have the option of choosing achivement gap or “gap closing” measures, which measure the size of differences in achievement between subgroups (e.g., racial and ethnic groups), or progress in narrowing these discrepancies, respectively. This is very important information and should be reported. In the context of high stakes accountability systems, however, there is a strong case against the use of these measures, which I will not repeat here. For our purposes, suffice it to say that gaps don’t really tell you anything about the performance of schools, and changes in the gaps can and often do occur for undesirable reasons (e.g., both subgroups decline, but one faster than the other).
So, while achievement gaps are enormously important, and must be monitored constantly, they are in their simplest form not well-suited to play a direct role in high stakes accountability systems. Moreover, there are alternatives. One example would be growth model estimates for subgroups, such as low-scoring students. This would serve the same purpose as crude “gap closing” measures, but do so in a manner that gauges actual school performance (though it is not clear how much schools’ overall effectiveness varies from their effectiveness with subgroups, nor why such variation would occur).
Achievement targets. Perhaps the most well known provision of NCLB was the “requirement” that most schools exhibit 100 percent proficiency within 10 or so years. ESSA allows states to choose their own targets, overall and by subgroup. This is a good thing, at least in theory, even though states that attempt to set realistic goals – those that reflect the fact that real progress is slow and sustained – will likely incur political costs, such as accusations of “low expectations” and “lowering the bar.”
In any case, the problem here is that achievement targets, like proficiency rates and average scores, fail to account for the fact that schools serving higher-achieving students will meet their targets far more easily, whereas other schools will have enormous difficulty doing so. Once again, the information pertains to the students a school serves, not the school’s effectiveness.
Note also that whether or not schools meet achievement targets seems like a growth measure, but it is not. It does not necessarily indicate whether students in a given state, district, or school are making progress. Since students enter and exit the sample every year (those at the lowest and highest tested grades, respectively), year-to-year changes compare different groups of students. Even over the long term, average scores and rates can be flat or decrease even when students are making progress (see the illustration here).
Imagine a middle school that, every year, enrolls a new cohort of seventh graders who are, on average, a year or two behind proficiency (to say nothing of losing a cohort of ninth graders who had been making progress for three years). This school could work miracles and still fail to meet targets. There is a lot to be said for setting goals, and for setting them high. But you can’t really gauge how far you should go if you don’t pay attention to where you started.
Non-test measures. The ESSA requirement that states include some kind of performance measure that does not rely on state tests is a good example of a policy that enjoys wide agreement on the ends but not the means. In short, everyone agrees that non-test measures are a priority, but there are massive questions about what they should be (this issue deserves a separate post, or, more likely, a book). The reality is that this ESSA provision basically represents a requirement for field testing, but potentially high stakes field testing.
Moreover, the student versus school performance distinction is not just important when it comes to test scores. For example, thus far, many states are opting to fulfill the non-test measure requirement with absenteeism/attendance. By themselves, these are also just status measures (i.e., they tell you more about the students a school serves, rather than how well they serve those students). School “climate” surveys are also a possibility, but they too may vary systematically, in part, by observed and unobserved factors outside of schools’ control.
There are no easy answers here, and non-test indicators are still something of a measurement frontier. But adopting non-test measures won’t be much of an accomplishment if we use them in the same misguided way we’ve been using test-based measures.
Perfect measurement is not necessarily a requirement for accountability policies to have some positive impact (NCLB being a perfect example). And designing these systems is very difficult. Even if there was a strong consensus on the “correct” measures and how to use them, which most certainly is not the case, measurement issues are far from the only factors at hand. There is plenty of politics in policymaking.
That said, while ESSA is an improvement in terms of accountability measurement, and that should not be dismissed or downplayed, it is still a long way from where we need to be. In many respects, there are a lot more repeated than corrected mistakes.
The purpose of accountability policies is to change behavior productively, and measurement and incentives must be aligned to produce these behavioral changes. This is much harder when measures are misinterpreted.