Ohio's New School Rating System: Different Results, Same Flawed Methods
Without question, designing school and district rating systems is a difficult task, and Ohio was somewhat ahead of the curve in attempting to do so (and they're also great about releasing a ton of data every year). As part of its application for ESEA waivers, the state recently announced a newly-designed version of its long-standing system, with the changes slated to go into effect in 2014-15. State officials told reporters that the new scheme is a “more accurate reflection of … true [school and district] quality."
In reality, however, despite its best intentions, what Ohio has done is perpetuate a troubled system by making less-than-substantive changes that seem to serve the primary purpose of giving lower grades to more schools in order for the results to square with preconceptions about the distribution of “true quality." It’s not a better system in terms of measurement - both the new and old schemes consist of mostly the same inappropriate components, and the ratings differentiate schools based largely on student characteristics rather than school performance.
So, whether or not the aggregate results seem more plausible is not particularly important, since the manner in which they're calculated is still deeply flawed. And demonstrating this is very easy.
Rather than get bogged down in details about the schemes, the short and dirty version of the story is that the old system assigned six possible ratings based mostly on four measures: AYP; the state’s performance index; the percent of state standards met; and a value-added growth model (see our post for more details on the old system). The new system essentially retains most of the components of the old, but the formula is a bit different and it incorporates a new “achievement and graduation gap” measure that is supposed to gauge whether student subgroups are making acceptable progress. The "gap" measure is really the only major substantive change to the system's components, but it basically just replaces one primitive measure (AYP) with another.*
Although the two systems yield different results overall, the major components of both – all but the value-added scores – are, directly or indirectly, "absolute performance" measures. They reflect how highly students score, not how quickly they improve. As a result, the measures are telling you more about the students that schools serve than the quality of instruction that they provide. Making high-stakes decisions based on this information is bad policy. For example, closing a school in a low-income neighborhood based on biased ratings not only means that one might very well be shutting down an effective school, but also that it’s unlikely it will be replaced by a more effective alternative.
Put differently, the most important step in measuring schools' effectiveness is controlling for confounding observable factors, most notably student characteristics. Ohio's ratings are driven by them. And they're not the only state.
(Important side note: With the exception of the state’s value-added model, which, despite the usual issues, such as instability, is pretty good, virtually every indicator used by the state is a cutpoint-based measure. These are severely limited and potentially very misleading in ways that are unrelated to the bias. I will not be discussing these issues in this post, but see the second footnote below this post, and here and here for some related work.)**
The components of the new system
The severe bias in the new system's constituent measures is unmistakable and easy to spot. To illustrate it in an accessible manner, I’ve identified the schools with free/reduced lunch rates that are among the highest 20 percent (highest quintile) of all non-charter schools in the state. This is an imperfect proxy for student background, but it's sufficient for our purposes. (Note: charter schools are excluded from all these figures.)
The graph below breaks down schools in terms of how they scored (A-F) on each of the four components in the new system; these four grades are averaged to create the final grade. The bars represent the percent of schools (over 3,000 in total) receiving each grade that are in the highest poverty quintile. For example, looking at the last set of bars on the right (value-added), 17 percent of the schools that received the equivalent of an F (red bar) on the value-added component were high-poverty schools.
If you quickly scan the distribution, you can see that, to varying degrees, three of the four measures – all but value-added - are extremely biased "in favor of" lower-poverty schools. Almost none of the schools receiving A’s or B’s in these three categories are high-poverty, whereas the vast majority receiving F’s are in the highest FRLP quintile.
Poor schools can’t win, and the reason is simple and inevitable: All three indicators are either directly or indirectly “absolute performance” measures. Students from more advantaged backgrounds enter the schooling system with higher scores, so the schools they attend will inevitably do better, regardless of their effectiveness.
To give another, back of the envelope idea of the bias of all four components, the table below presents the correlation between poverty (free/reduced-price lunch) rates and grades for each component.
All but the value-added are quite high and negative - as the poverty of schools increases, their grades on these components get worse.
The final grades under the new system
You can easily guess how this plays out in the final scores. The graph below is slightly different from the previous figure. This one tells you the grades received by schools falling into each poverty quintile. For instance, in the top right of the graph, you can see that 26.5 percent of schools in the highest-poverty quintile received an F grade (value labels less than one percent are deleted for space reasons).
Consider that there are over 600 schools in the highest-poverty group (the top bar). A grand total of four receive an A grade. More than 75 percent receive a D or F. In contrast, among schools with the lowest free/reduced lunch rates, almost none receives lower than an A or B, and only around 10 schools are rated D or F.
According to Ohio's system, there are basically no high-performing schools serving large shares of poor students, and no low-performing schools serving small shares of poor students.
To once again provide a rough sense of this relationship, the correlation between schools’ final grades and their overall poverty rate is -0.73, which is rather high. Grades decrease as poverty rates increase, and vice-versa, and this relationship is especially pronounced at the extremes of the FRLP distribution (in no small part due to the limitations of the free/reduced lunch measure).
In fairness, due to differences in resources and other factors, it’s reasonable to believe that less effective schools are disproportionately located in poorer areas, while the better schools are concentrated in more affluent neighborhoods. But the results of Ohio's new system are simply not credible, especially since they're so obviously a function of the choice of measures.***
Is the new system better than the old?
Since Ohio released its waiver application, many news stories pointed out that, under the old system, over half of Ohio’s schools received the equivalent of an A or A+, while three out of four received the equivalent of a B or higher. One could make a compelling case that this doesn’t seem right, and that the new results, in which half of schools receive a B and very few (about 6-7 percent) receive an A, are more plausible.
This completely misses the point. Whether or not you think the distribution of results seems more correct, neither system provides valid estimates of school performance. Even the most invalid measures can be tweaked to produce a certain distribution of results.
Take a look at the graphs below, and contrast the new system's results (the first graph below) with those of the old scheme (the second graph below).
Under the old system (the bottom graph), virtually all (95 percent) of the lowest-poverty schools received “excellent” (A) or “excellent with distinction” (A+), whereas the new system (the top graph) gives most of them B’s.
The highest-poverty schools, on the other hand, were, in 2010-11, relatively evenly distributed between “emergency” (F), “watch” (D) and “continuous improvement” (C). Now, about half get D’s, about a quarter receive F’s, and 15 percent get C’s.
The new system is still failing to measure school performance; it just moves most schools down a peg or two. Higher-income schools are still virtually guaranteed to do well, but now they'll get B’s instead of A’s. Conversely, poor schools are still a pretty sure bet to get low ratings, but now they'll get D’s and F’s instead of C’s, D’s and F’s.****
The comparable levels of bias are also (roughly) evident in the correlation between poverty and final grades under the new and old systems, as in the table below.
They are virtually identical – around 0.72. In short, the new system allocates different grades in the same flawed manner.
But here’s my most important point: Ohio is absolutely not alone. Proving that no good deed goes unpunished, I focus on their system largely because they are so great about making public detailed performance data.
In fact, most of the states designing and implementing these grading systems are doing the same thing – i.e., judging schools as much or more by how high or low their outcomes are, rather than whether they’re improving those outcomes. This is just not defensible, as it fails to measure in any meaningful way whether schools are effective, given the resources available to them.
To repeat, designing school and district “grading systems” is a tough job – not only do they have to attempt to measure school performance (a massive task), but they should also be accessible to parents, administrators, teachers and other stakeholders. All of this is quite new, and disagreements like mine are inevitable (and motivated by the fact that I support these systems, though not necessarily the stakes attached to them). As states design and implement new systems, one should expect that there will be mistakes and course changes.
What Ohio teaches us is how not to adjust course. There isn't much substance to the changes in the state's system, but there are potentially heavy consequences. Lower grades mean a cascade of increasingly severe consequences, including closure. So, starting in 2014-15, more schools (almost all of them higher-poverty schools) will face serious punitive action because of changes to a system that didn't address the real issue - the validity of the measures. One can only hope the state does so over the next two years.
- Matt Di Carlo
* The “achievement and graduation gap” indicator, which the state’s waiver application hilariously calls “innovative," is ostensibly a growth-based measure, since it is focused on the change in “achievement gaps” between subgroups in math, reading and graduation rates. However, the expectations for each school’s rate of “progress” (they’re actually cohort changes) are set based on statewide targets, and are therefore higher for schools with lower absolute performance levels (though, in math and reading, schools can receive a C grade if they fail to meet the expectations but still show subgroup growth on the value-added model). As a result, performance on this indicator is still strongly correlated with student characteristics, though a bit less so than the performance index and percent of standards met components (as shown in the first textual table above). In addition, regardless of the bias, the math and reading portion of the “achievement and graduation gap” is based on changes in proficiency rates, which are truly awful measures for any purpose, but especially for “achievement gap closing."
** Just to name a couple of major issues, all of the indicators are some version of a cutpoint-based measure [e.g., proficiency rates], which are seriously limited and simply not appropriate for measuring school performance when alternatives are available (as they are). In addition, all of the measures (again, with the exception of value-added) are based on cross-sectional data, which means they are as likely due to sampling variation – changes in the students taking the test – as to “real” improvement.
*** I suppose one might make the case that students in "higher performing" schools - i.e., those with high absolute performance levels - are already performing at an adequate level, and so the state's limited resources should be directed at "lower-performing" schools, even if their higher-scoring counterparts aren't demonstrating much growth. If that's the state's approach, they could have easily accomplished this in a manner that doesn't basically guarantee poor grades for "low-performing" schools and districts due largely largely to their students' characteristics. As it stands now, the grading system is almost completely unable to differentiate low-scoring schools by performance, which would be the most important priority if the state wanted to allocate resources efficiently.
**** There is at least a chance that these movements reflect some marginal improvement in the validity of the measures (although I am highly skeptical, for reasons that aren't worth discussing here), but that would only mean that a massively biased system was very slightly less biased.