Lessons And Directions From The CREDO Urban Charter School Study
Last week, CREDO, a Stanford University research organization that focuses mostly on charter schools, released an analysis of the test-based effectiveness of charter schools in “urban areas” – that is, charters located in cities located within in 42 urban areas throughout 22 states. The math and reading testing data used in the analysis are from the 2006-07 to 2010-11 school years.
In short, the researchers find that, across all areas included, charters’ estimated impacts on test scores, vis-à-vis the regular public schools to which they are compared, are positive and statistically discernible. The magnitude of the overall estimated effect is somewhat modest in reading, and larger in math. In both cases, as always, results vary substantially by location, with very large and positive impacts in some places and negative impacts in others.
These “horse race” charter school studies are certainly worthwhile, and their findings have useful policy implications. In another sense, however, the public’s relentless focus on the “bottom line” of these analyses is tantamount to asking continually a question ("do charter schools boost test scores?") to which we already know the answer (some do, some do not). This approach is somewhat inconsistent with the whole idea of charter schools, and with harvesting what is their largest potential contribution to U.S. public education. But there are also a few more specific issues and findings in this report that merit a little bit of further discussion, and we’ll start with those.
Interpreting overall effect sizes. CREDO’s first few reports, including their highly influential 2009 national study, attempted to make findings more accessible to the general public by presenting them in the form of the percentage of charters that performed significantly better, significantly worse, and no differently from the regular public schools to which they were compared (and we should always bear in mind that these are relative, not absolute, estimated effects). CREDO still carries out these better than/worse than comparisons, but they have since started to emphasize a different means of expressing effect sizes accessibly – “additional days of learning.”
The latter is a very simple transformation of coefficients (often called "effects") expressed in standard deviations (s.d.), which are not particularly meaningful to most people. The conversion employs a benchmark of “typical growth” among U.S. students in grades 4-8, which is estimated at 0.25 s.d. per year – i.e., one fourth of one standard deviation annually.
To illustrate how the “days of learning” metric works, you can calculate it by dividing a given effect size by this 0.25 benchmark (this yields the percentage of one year of typical growth represented by the coefficient), and then multiplying it by 180, the length of a standard school year in days (for a quick example using CREDO's results, which come out to 28 additional "days of learning" in reading and 40 in math, see the first footnote).*
This manner of expressing measured effect sizes is certainly more accessible to the general public than s.d. (which is great). It is not uncommon in the literature (e.g., MET Project 2012), and really just expresses the same information in a different fashion. But there are alternative conversion factors. For example, another common set of benchmarks (Hill et al. 2007) are based on averages across seven nationally-normed tests in reading and six in math. If we perform a ballpark transformation of CREDO’s estimates using these "typical growth" estimates in grades 4-8 (which I average separately for math and reading), they translate into about 23 “days of learning” in reading and 25 in math. This is not to say that one transformation is “more correct” than the other (the 0.25 conversion is, again, very conventional), but it does illustrate that these are relative measures.**
A different means of expressing estimated effect sizes in an accessible manner is in terms of percentile point changes. For instance, CREDO's overall reading effect (0.039) is equivalent to a student moving from the 50th percentile to roughly 51.5, while their math estimate is equivalent to moving from the 50th to just above the 52nd percentile.
Again, neither of these two conversions – “days of learning” and percentile changes – is more or less "correct" than the other, but they might leave the average reader with rather different impressions of the "real world" magnitude of the estimated impact. And they are not the only possibilities, of course – for example, one can express effects in terms of the percentage of the typical black/white achievement gap (about one s.d.), or, even better, one can compare the measured impact to that of other educational interventions. (And, finally, of course, CREDO's "percent better, worse, no different" metric is a different approach, one that is very useful for summarizing not the magnitude of overall effects, but rather the degree to which they vary between schools.)
In any case, there are no set-in-stone rules for interpreting effect sizes, and even seemingly small increases can have a very positive impact for many students. It’s best to evaluate the magnitude of impacts using several different “tools” (including a level head). Remember, also, that these are annual effects, and students attend schools for multiple years – the gains can add up (CREDO tests this directly).
Breaking down results by subgroup. One might notice a tendency in recent years for charter school advocates to downplay the overall results (particularly when those impacts are small), and instead highlight the estimated impacts for specific groups of students – e.g., minority students or those who are eligible for subsidized lunch. Charter opponents, in contrast, tend to focus on the overall impacts (again, particularly when those impacts are small).
On the one hand, it is important to see how schools’ impact on test scores (or other outcomes) varies by subgroup, and there is certainly a case for, in a sense, “prioritizing” traditionally lower-performing subgroups. On the other hand, this can go too far, since, in its extreme form, it implies that we can gauge charters’ success based solely (or even mostly) on either the estimated test-based impacts among just one or two subgroups or, perhaps worse, the estimates for subgroups defined by a combination of multiple characteristics (e.g., black students who are eligible for subsidized lunch assistance, which implies that eligible students who are members of other ethnicities, including Hispanics, are somehow less important). And this is particularly salient in the case of this CREDO report, which is limited to urban charters.***
That said, in terms of results for specific subgroups across all areas included in the study, CREDO finds:
- Among students eligible for subsidized lunch, effects are positive but modest in math and reading (0.033 and 0.024, respectively);
- The estimated charter impact among black students was roughly the same as overall in both subjects, whereas among Hispanic students the impact is modest in math (0.029) and essentially nil in reading (0.008);
- And, finally, the estimates are either very modest or not statistically significant among ELL and special education students.
On the whole, these results suggest that relative charter impacts on math and reading test scores are either positive or small/zero among traditionally low-scoring student subgroups. These are at least marginally positive results, and there is no evidence of any negative impact on the math and reading testing progress of these groups.
Finally, buried underneath all the debate about subgroup effects is a very important policy question: Why might charter results vary by subgroup? Based on CREDO's results, the implicit explanation would seem to be that whatever it is that what charter schools do “works better” on the testing results of a few key traditionally lower-scoring subgroups. This is possible, but it is difficult to rectify with the fact that charters’ policies and practices (and their estimated impacts) are far from uniform.
An alternative possibility is that the more effective charter schools (at least in terms of test-based outcomes) are located in neighborhoods serving larger proportions of specific populations (and/or that the regular public schools serving these students are generally less effective in boosting math and reading test scores). Again, though, the estimated impacts by subgroup, like the impacts overall, vary widely by location. For example, the math estimates for free and reduced price lunch eligible students are positive and meaningful for Austin charters, negative for Orlando charters, and nil for charter schools in New Orleans. Interpretations and explanations of subgroup effects might therefore depend quite a bit on context.
Explaining variation by urban area. To their credit, CREDO presents some simple correlations between their measured effect sizes and a bunch of urban area-level characteristics that are available to them. Most of the correlations are either weak or moderate (and charter effects almost certainly vary more within these areas than between them). But a few are worth mentioning briefly, with the standard caveat these are just bivariate correlations, and not necessarily evidence of a causal relationship (or lack thereof):
- There is no relationship between estimated effects and states’ charter law “rankings,” which does not support (but, of course, does not preclude) the common argument that charter legislation has a substantial mediating effect on (test-based) performance;
- Nor is there any relationship between measured performance and charter market share (i.e., the percentage of students in a given area served by charter schools). In fact, the correlation is weak and positive, which does not support the argument that some people (myself included) have made about the possibility that smaler charter sectors perform better because there is less competition for finite resources (e.g., foundation funding, teacher labor, etc.);
- There is a moderate, significant association between estimated performance and the growth of the charter sector within each region between 2006 and 2010. It may be that sectors are to some degree doing well because they are growing, or that sectors are growing because they are doing well.
Where do we go from here? As mentioned above, these charter versus regular public school “horse race” studies are certainly important – e.g., for evaluating sectorwide performance and trends. But, to me, what they illustrate time and time again is that it is not particularly useful from a policy perspective to judge U.S. charter schools en masse, whether by test scores or any other outcome. The charter school idea’s primary contribution to U.S. public education is giving rise to variation in policies and practices within districts, and thus expanding schools’ ability to try different things, test their impact (hopefully on a variety of different outcomes), and inform the design of all schools, regardless of their governance structures. The merits of this idea can hardly be summed up solely (or even mostly) by a math or reading coefficient estimated with data that include a huge variety of different models in many different policy environments.
That’s more akin to judging the impact of experimentation, when we should be equally if not more focused on evaluating the impact of the treatments. Schools' performance varies by what they do, not by what they are.
The empirical evidence from 2-3 decades of charter research, including this study, bear out this argument. It shows that charters’ estimated effectiveness relative to regular public schools’ varies enormously within and between locations. The fact that pooling together thousands of schools across two dozen states yields a modest-to-moderate positive relative impact for charters is obviously a noteworthy finding, one that should not be dismissed or downplayed, but the real policy value of these results is hidden beneath – how and whether these estimated effects vary by specific policies and practices, at the state-, district- and especially the school-level.
Needless to say, however, teasing out these relationships will not be easy – it will require a lot of additional data collection, and the results will never be conclusive. (CREDO is already situated to play a potentially breakthrough role, given their massive collection of testing data from almost every significant charter school market in the nation, and their above-mentioned area-level correlations are an excellent start.)
Another, more immediate problem with realizing this approach is that charter schools are so controversial, and supporters and opponents so dug into their positions, that many people are focused intensely on the question of whether charter schools "work," even though everyone already agrees on the best answer – some do, some do not. Shifting the focus from “whether” to “why” may be the best way to find some productive common ground. And it has the additional virtue of encouraging good education research and policymaking.
* For example, the overall estimated reading effect in this latest CREDO study (0.039 s.d.) is equivalent to 15.6 percent of a “typical year of growth” (0.039 / 0.250 = 0.156), which, given a 180 day school year, is in turn equivalent to 28 “days of learning” (180 X 0.156 = 28.08). Similarly, CREDO's overall math coefficient (0.055) is equivalent to 22 percent of a "typical year of growth" (0.055 / 0.250 = 0.220), which is 40 additional "days of learning" (0.220 X 180 = 39.6).
** Note that it would not be appropriate to apply this benchmark to any of the individual urban areas, since they might progress at different rates than the national sample of students who take the NAEP exam. In addition, as I’ve discussed before, charter schools may in fact add additional “days of learning” in terms of test scores because they actually do add additional days of learning – and sometimes 2-4 months of learning - to the school calendar.
*** It bears mentioning, also, that subsidized lunch eligibility is a poor proxy for income (and CREDO calls it “poverty,” which is not really what it measures).