A new report, commissioned by the District of Columbia Mayor Vincent Gray and conducted by the Chicago-based consulting organization IFF, was supposed to provide guidance on how the District might act and invest strategically in school improvement, including optimizing the distribution of students across schools, many of which are either over- or under-enrolled.
Needless to say, this is a monumental task. Not only does it entail the identification of high- and low-performing schools, but plans for improving them as well. Even the most rigorous efforts to achieve these goals, especially in a large city like D.C., would be to some degree speculative and error-prone.
This is not a rigorous effort. IFF’s final report is polished and attractive, with lovely maps and color-coded tables presenting a lot of summary statistics. But there’s no emperor underneath those clothes. The report's data and analysis are so deeply flawed that its (rather non-specific) recommendations should not be taken seriously.
The authors' general approach, slightly simplified, is as follows: All schools (DCPS and charter) are ranked, based on four measures: their school- and grade-level current (2011) proficiency rates in math and reading; and their “projected” 2016 proficiency rates in both subjects. The latter is derived from a simple OLS regression that uses each school’s prior rate increases (between 2007 and 2011) to predict where it will be five years down the road.
Based on these ranks, schools are sorted into quartiles, with the top-ranked ("high-performing") schools classified as “Tier 1." The core results compare the number of seats in the these “Tier 1” schools with the number of children within 39 neighborhood “clusters” (geographic subdivisions used by the District for other purposes). Reducing these discrepancies is the focus of the report's policy conclusions.
IFF recommends that the District increase the number of students filling “high-performing seats” using a combination of (largely unspecified) investments, closures, moving students into schools under capacity and other strategies.
My first question in reading this report was: Why didn’t IFF use longitudinal student-level data and carry out a proper analysis? By relying exclusively on cross-sectional school- and grade-level proficiency rates, IFF essentially guarantees that their results will be little more than tentative, at best (also see this excellent discussion by economist Steve Glazerman, who makes some of these same points).
Proficiency rates can be useful, in that they may be easier to understand than average test scores. Nevertheless, they are not well-suited for serious analyses of school performance. They only tell you how many students scored above a certain cutoff point, and this hides a lot of the variation in actual performance. For example, two students – one scoring just above passing and the other scoring a year or two above grade level – would both be coded the same way, as “proficient (or above)." This is a large sacrifice of data.
In addition, while only a small part of IFF’s school rankings consist of “growth” measures (a portion of the variation in the “projected” 2016 rates, discussed below), the data they use are cross-sectional, and are therefore not appropriate. They don’t follow students over time, which means that the changes are not “growth” at all, and may to a large degree reflect demographic and other shifts in the student population of each school. This problem is especially severe at the grade-level (because samples are smaller), particularly in D.C., where student mobility is exceedingly high (in no small part due to rapid charter proliferation). In other words, proficiency rates may change simply because somewhat different sets of students are being tested every year, not because there has been any real progress.
In short, IFF’s reliance on cross-sectional proficiency rates calls its entire analysis into question. These rates are terrible measures of performance, both in any given year and over time, and one can only wonder why a District consultant wouldn’t employ better data. Even a more rigorous analysis would have been suspect using these data.
But, in addition to the issue of school- and grade-level proficiency rates, there are serious problems with IFF’s primary analytical approach – i.e., the manner in which they sort schools into performance “tiers." As stated above, each school’s performance is assessed by an average of four measures: 2011 proficiency rates in math and reading; and projected 2016 proficiency rates in both subjects, which are calculated based on each school’s (assumed to be linear) trajectory between 2007 and 2011 (both over the whole school and for grade clusters [K-5, 6-8, 9-12]).
There are many small problems with this scheme (e.g., the loss of data in rankings and sorting into “tiers”), but they’re all somewhat superfluous, given the fact that the method IFF uses does not measure the actual effectiveness of each school.*
Even using the best data (which IFF does not), testing results, whether proficiency rates or scores, are in large part due to either random variation or factors outside of schools’ control (e.g., students’ backgrounds). Half of IFF’s school performance measure consists of 2011 proficiency rates in math and English, which are essentially just measures of student background. Unless you control for these factors, these rates can tell you almost nothing about the actual quality of instruction going on in the school. Schools with high rates are not necessarily high-performing schools, and vice-versa.
The other half of IFF’s school rankings – the 2016 “projected” rates – aren’t much better, because they are simply added to the 2011 rate, and are therefore mostly just absolute performance measures (i.e., severely biased by student characteristics).
For example, let’s say we have two schools: One serves mostly higher-income students and has a 2011 proficiency rate of 80 percent; the other serves mostly lower-income students, and has a 2011 rate of 40 percent. Now, let’s say IFF projects that both will pick up 20 percentage points by 2016. Putting aside all the flaws in IFF’s methods (including the fact that their data are cross-sectional and therefore do not measure "growth" per se), this would imply that both schools are equally effective – students in both schools will make similar “progress” over the next five years.
But IFF’s methods don’t posit that 20 point increase as their measure. Instead, they add the projected “growth” (20 points) to the 2011 rates, which means that our higher-income school will have a projected 2016 rate of 100 percent, while our lower-income school will come in at 60 percent. The first school will appear to be far better, even though there was actually no difference in the schools’ effectiveness in boosting test scores (which is generally considered the appropriate gauge of school effectiveness).
As a result, even if their projection methods were appropriate (and they're not), the manner in which IFF uses its “growth” measure – adding it to the 2011 rates – essentially ensures that the vast majority of the variation between schools in their final ratings will be due to factors outside of schools’ control.**
This analysis is therefore recommending the closure and expansion of schools, among other actions, based on a criterion that has little to do with how they actually perform, and is largely a function of the backgrounds of the students that attend them (e.g., income, whether or not they are native English speakers, etc.).
Overall, IFF had a very difficult task – identifying low- and high-performing schools – and, for whatever reason, they do not appear to have been up to the challenge. Their data are inappropriate and their methods too simplistic and flawed to accomplish the goals they set out to accomplish.
Even if IFF’s policy recommendations were sound – and they largely boil down to closing or improving low-performing schools, opening more high-performing schools, and monitoring schools in the middle – their methods for classifying these schools is not credible. As Bruce Baker puts it, IFF is essentially saying that we should “close poor schools and replace them with less poor ones” (also check out Glazerman’s above-mentioned article for more on the recommendations).
This report, though attractive and full of interesting summary data, provides little of value in terms of informing sound policy decisions.
- Matt Di Carlo
* There really is a striking progression of data loss in this analysis. Most generally, as discussed above, IFF uses proficiency rates (how many students above or below the line), which ignores underlying variation in actual scores. In addition, they’re using cross-sectional grade- and school-level data, which masks differences between students in any given year, and over time. Then they use the rates (and projected rates) to calculate rankings, which ignore the extent of differences between schools. And, finally, the rankings are averaged and schools are sorted in quartiles (performance “tiers”), losing even more data – for example, schools at the “top” of “Tier 2” may have essentially the same scores as schools at the “bottom” of “Tier 1." At each "step," a significant chunk of the variation between schools in their students’ testing performance is forfeited.
** I cannot illustrate this bias directly – i.e., using real data from the report and the District’s schools – without a somewhat onerous manual data entry effort (neither D.C. nor IFF provide their data in a format that is convenient for analysts). But it’s really not necessary – it’s beyond dispute that absolute proficiency rates are largely a function of student characteristics, and IFF’s school rankings are based predominantly on those rates. See this discussion, as well as this example from Florida and this one from Ohio.