The new breed of school rating systems, some of which are still getting off the ground, will co-exist with federal proficiency targets in many states, and they are (or will be) used for a variety of purposes, including closure, resource allocation and informing parents and the public (see our posts on the systems in IN, FL, OH, CO, NYC).*
The approach that most states are using, in part due to the "ESEA flexibility" guidelines set by the U.S. Department of Education, is to combine different types of measures, often very crudely, into a single grade or categorical rating for each school. Administrators and media coverage usually characterize these ratings as measures of school performance - low-rated schools are called "low performing," while those receiving top ratings are characterized as "high performing." That's not accurate - or, at best, it's only partially true.
Some of the indicators that comprise the ratings, such as proficiency rates, are best interpreted as (imperfectly) describing student performance on tests, whereas other measures, such as growth model estimates, make some attempt to isolate schools’ contribution to that performance. Both might have a role to play in accountability systems, but they're more or less appropriate depending on how you’re trying to use them.
So, here’s my question: Why do we insist on throwing them all together into a single rating for each school? To illustrate why I think this question needs to be addressed, let’s take a quick look at four highly-simplified situations in which one might use ratings.
Tutoring program: Say there’s a state that has set aside some funding for a tutoring program (or other type of program or grant) that is supposed to help students who are most in need of improvement, but the state can only implement the program in a limited number of schools. From a test-based angle, one simple approach here would be to identify the schools that have, on average, the lowest measured performance levels (e.g., lowest scores or proficiency rates). These scores or rates won’t help the state determine which schools are effective in compelling growth from their students, but, in this case, that’s not the goal. They’re trying to identify the schools serving the lowest-scoring students – i.e., they need measures that describe student performance rather than explain it. Thus, absolute performance measures (e.g., average scores) might serve as the primary test-based tool (though growth and other types of measures could, of course, play some role).
Closure: Closing a school is a massively important decision, one that should never be made without considering many different types of information, qualitative and quantitative. Test-based measures are, for better or worse, often a big part of that calculation. But it’s a much different situation from the tutoring program. For the tutors, you’re trying to identify low-performing students, whereas closure requires you to identify low-performing schools. Many schools with low average test scores are still quite effective, but their scores remain relatively depressed because so many of their students enter at lower levels, and they only attend for a limited number of years. You definitely don’t want to close these schools, since their replacements are unlikely to do any better. Instead, to the degree you use test-based indicators, you should probably focus primarily on growth-oriented measures, since they’re designed to identify, albeit imprecisely, the schools that aren’t compelling improvement, given the context in which they operate and the resources available to them.**
Parents choosing a school: Parents who use school ratings to choose schools also have different informational needs. Obviously, they’re interested in finding schools that are effective – i.e., schools that provide high-quality instruction, which one might gauge partially in terms of whether they compel test-based growth from their students. However, unlike states, parents aren’t allocating resources or interventions; they are using them. In this context, aggregate scores or rates might actually serve as a school performance measure of sorts, since parents presumably want their children surrounded by higher-performing peers. In other words, parents are interested in both finding both an effective school and one serving large proportions of high-performing students. To the degree testing data are a factor in their determination, they might consider absolute performance and growth measures, without one or the other necessarily serving as the predominant criterion. Furthermore,while non-test measures are always potentially useful (when available), they would seem particularly relevant in this context (for example, parent and student surveys, etc.).
A concerned citizen: School ratings are not entirely about incentives or making decisions. They’re also supposed to keep the public informed. For example, even non-parents might be interested in how well the children in their neighborhood are performing; absolute performance would give them an idea about this. On the other hand, they might also want to keep an eye on the quality and cost-effectiveness of the schools that their tax dollars fund, for which purpose growth-oriented indicators might be helpful. The type of information that might prove most useful depends on what the individual is looking for.
Okay, so we have constructed a little matrix of sorts, in which each decision/stakeholder might be best-served using different combinations of growth and absolute performance. Obviously, these are extremely simplified characterizations, but they reveal the uncontroversial, intuitive fact that different measures are more or less useful in different contexts.***
My point, accordingly, is to suggest that we stop trying to reduce schools to a single grade. Insofar as different people make different decisions with the data, it seems poor practice to risk cloaking the individual measures in one aggregate rating that cannot possibly serve well in diverse contexts (if any at all). In fact, it might even make sense to try constructing multiple formulas for specific purposes. There’s no question that this approach would also entail imprecision and human judgment, but it would at least reflect the reality of the situation, and it couldn't be any harder than the impossible task of designing one-size-fits-all ratings, when one size most definitely does not fit all.
Moreover, I believe that this alternative approach would enhance the incentives and informational power of the systems. Teachers and administrators would be less likely to be held accountable for outcomes, such as absolute performance, that are largely out of their control (and, in almost all states, that's precisely what's happening; see the reviews of individual systems linked in the first paragraph).
States would be better-able to target interventions effectively, and less likely to make catastrophic mistakes such as closing or restructuring schools that are actually relatively effective, given the students they serve and the resources available to them. And, finally, there would be more appropriate interpretation of the data among reporters, parents and citizens, all of whom are perfectly capable of reviewing more than one piece of information.
The power of data is in the inferences we draw from them, and there’s a thin line between parsimony and oversimplification. The more we try to design ratings that work in every situation, the more risk there is that they won't work very well in any.
- Matt Di Carlo
* One of the primary purposes of ratings is to incentivize improvement (discussed here). From this perspective, even poorly-designed systems might have a positive effect. Even so, the properties of the ratings are important, since different measures have different incentive effects. And, increasingly, these systems are being used not just to compel improvement, but also to make consequential decisions (and inform the public).
** Insofar as states might also wish to focus closure and other drastic interventions (e.g., turnaround) on schools with relatively low-scoring students, growth measures might be used to assess a pool of “eligible” schools toward the bottom of the absolute score distribution. Similarly, as mentioned above, states choosing schools for a tutoring program (or other similar interventions) might also wish to factor in growth scores, since some schools with low average schools might be more in need of assistance than others. In either case, neither growth nor absolute performance (nor other measures) need be the sole criterion, but their relative role, I would argue, should be substantially different depending on the type of decision being made.
*** A defender of the current ratings might note that even though the measures are combined, each component is still (in most states) presented separately, usually in a school “report card” or similar format. That’s a fair point– any one of the hypothetical people using the report cards for the purposes above could find the information they needed somewhere in the reports. But the single rating does not help here. In fact, it might actually serve to deter people from seeking out the specific measures they need.