We Should Only Hold Schools Accountable For Outcomes They Can Control

Let’s say we were trying to evaluate a teacher’s performance for this academic year, and part of that evaluation would use students’ test scores (if you object to using test scores this way, put that aside for a moment). We checked the data and reached two conclusions. First, we found that her students made fantastic progress this year. Second, we also saw that the students’ scores were still quite a bit lower than their peers’ in the district. Which measure should we use to evaluate this teacher?

Would we consider judging her even partially based on the latter – students’ average scores? Of course not. Those students made huge progress, and the only reason their absolute performance levels are relatively low is because they were low at the beginning of the year. This teacher could not control the fact that she was assigned lower-scoring students. All she can do is make sure that they improve. That’s why no teacher evaluation system places any importance on students’ absolute performance, instead focusing on growth (and, of course, non-test measures). In fact, growth models control for absolute performance (prior year’s test scores) so it doesn't bias the results.

If we would never judge teachers based on absolute performance, why are we judging schools that way? Why does virtually every school/district rating system place some emphasis – often the primary emphasis – on absolute performance?

The response to this question typically goes something like this: We need absolute performance measures (also called status measures) in accountability systems in order to set high expectations or standards for schools. This strikes me as a non-sequitur. High expectations are great, but the measures reflecting those expectations still have to be valid.

The bedrock principle of accountability is to hold people and institutions responsible for outcomes they can control. Within an accountability system, rating systems have to measure whether schools and districts are doing a good job controlling what they can control – in a test-based context, that means providing high-quality instruction to their students.

Do absolute performance measures, such as simple proficiency rates or average scores, do this?

Here’s one way to think about it (mentioned briefly in our last post): Roughly speaking, in addition to the inevitable measurement error, a school’s absolute performance level reflects a combination of two factors:

  1. Students' performance levels upon entry into the school;
  2. Their improvement while attending the school.
Schools cannot control the former (which students they serve). It varies widely, and schools only serve students for a few years at most. They can, however, control the latter (whether students improve while enrolled).

So, why not just use the latter – growth – directly? Why would we hold schools accountable for an outcome that is largely out of their hands, when we have the option of isolating (at least approximately) the portion that they actually can control?

This is less a matter of fairness than one of good policy. The most effective schools – those that, year after year, produce large testing gains among their students – might be closed or subject to punishments simply because they happen to serve students who were lower-scoring upon entry. And, making things worse, they’re unlikely to be replaced by a superior alternative. That’s a poor outcome, and it cuts both ways: Schools that provide poor instruction to their students may be given a pass because their students were higher-scoring from the outset.

Yet that’s exactly what almost all of these school rating systems are doing. It would be one thing if absolute performance measures counted for only a very small proportion of school ratings (though I personally would still question it). But, with relatively few exceptions, some form of average scores or rates are at least a major, and often the primary, component.*

If we’re going to use testing data to gauge whether schools provide high-quality instruction, the proper way to do so is to use growth measures. Granted, isolating school effects is more difficult, and even the best growth measures are not without drawbacks – e.g., they are imprecise and unstable over time. However, done correctly – using high-quality models and, even better, multiple years of data on a rolling basis – the measures can provide some signal of schools’ test-based effectiveness, overall and perhaps by subgroup as well.

Ratings systems that use these estimates combined with non-test measures (e.g., surveys) can help assess what we need them to assess – the quality of education schools provide. And they can do so in a manner that reflects high expectations for improvement. This is what we do for teachers, and the same should go for schools.

One final, very important note: This does not mean absolute performance measures are useless for school-based policy. Far from it. For example, they should be used to direct resources or (non-punitive) interventions at the schools with relatively low scores. In these instances, the scores are interpreted properly – as measures of student, not school performance – and the manner in which they’re used reflects the fact that individual schools have limited control over the students they serve, but that we as a nation have a responsibility to see that the students attending those schools are given the opportunity to achieve at the highest levels, regardless of their backgrounds.**

- Matt Di Carlo


* It's worth noting that the same basic argument applies to other measures one finds in many states' rating systems, even though, on the surface, they do not appear to be absolute performance metrics in the "traditional" sense. In a couple of cases, growth measures are so poorly constructed that they are heavily biased by absolute performance (see here, for instance). In addition, a bunch of states are rating schools based in part on the size of their race- and income-based achievement gaps. As is the case with average scores across all students, these gaps reflect the distribution of students across schools. The alternative - one that maintains the laudable focus on subgroups - would be to overweight growth estimates for certain sub-samples (e.g., free/reduced lunch-eligible students).

** I have also pointed out previously that absolute performance measures might be useful in assessing schools outside of a formal state accountability system. For example, average scores or rates might help parents choose schools precisely because they reflect student performance - parents have an interest in their children attending schools with higher-performing peers.