The Year In Research On Market-Based Education Reform

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post.

Race to the Top and Waiting for Superman made 2010 a banner year for the market-based education reforms that dominate our national discourse. By contrast, a look at the “year in research” presents a rather different picture for the three pillars of this paradigm: merit pay, charter schools, and using value-added estimates in high-stakes decisions.

There will always be exceptions (especially given the sheer volume of reports generated by think tanks, academics, and other players), and one year does not a body of research make.  But a quick review of high-quality studies from independent, reputable researchers shows that 2010 was not a particularly good year for these policies.

First and perhaps foremost, the first and best experimental evaluation of teacher merit pay (by the National Center on Performance Incentives) found that teachers eligible for bonuses did not increase their students’ tests scores more than those not eligible. Earlier in the year, a Mathematica study of Chicago’s TAP program (which includes data on the first two of the program’s three years) reached the same conclusion.

The almost universal reaction from the market-based reformers was that merit pay is not supposed to generate short-term increases in test scores, but rather to improve the quality of applicants to the profession and their subsequent retention. This viewpoint, while reasonable on the surface, not only implies that merit pay is a leap of faith, one that will likely never have its benefits “proven” with any degree of rigor, but also that the case against teacher experience and education (criticized by some of the very same people for their weak association with short-term student test score gains) must be reassessed. On that note, a 2010 working paper showed that previous studies may have underestimated the returns to teacher experience, and that teachers’ value-added scores may improve for ten or more years.

In the area of charter schools, Mathematica researchers also released an experimental evaluation showing no test score benefits of charter middle schools. This directly followed a preliminary report on KIPP schools (also from Mathematica), which showed positive gains. The latter was widely touted, while the former was largely ignored. The conflicting results begged for a deeper discussion of the specific policies and practices that differentiate KIPP from the vast majority of charters (also see this single-school study on KIPP from this year), which produce results that are no better or worse than comparable public schools.

On this topic, an article published in the American Journal of Education not only found no achievement advantage for charters, but also that a measure of “innovation” was actually negatively associated with score gains. The aforementioned study of charter middle schools likewise found few positive correlations between school policies and achievement. As a result, explanations of why a few charters seem to do well remain elusive (I proposed school time as the primary mechanism in the case of KIPP).

So, some of the best work ever on charters and merit pay came in 2010, with very lackluster results, just as a massive wave of publicity and funding awarded these policy measures a starring role in our national education policy.

2010 was also a bountiful year for value-added research. Strangely, the value-added analysis that got the most attention by far – and which became the basis for a series of LA Times stories – was also among the least consequential. The results were very much in line with more than a decade of studies on teacher effects. The questionable decision to publish teachers’ names and scores, on the other hand, garnered incredible public controversy.

From a purely research perspective, other studies were far more important. Perhaps most notably in the area of practical implications, a simulation by Mathematica researchers (published by the Education Department) showed high “Type I and Type II” error rates (classification errors that occur even when estimates are statistically significant), which persisted even with multiple years of data. On a similar note, a look at teacher value-added scores in New York City – the largest district in the nation – found strikingly large error margins.

The news wasn’t all bad, of course, and value-added research almost never lends itself to simple “yes/no verdicts.”  For instance, the recently-released preliminary results from the Gates-funded MET study provide new evidence that alternative measures of teacher quality, most notably student perceptions of their effectiveness, maintain modest but significant correlations with value-added scores (similarly, another 2010 paper found an association between principal evaluations and value-added scores, and also demonstrated that principals may use this information productively).

In contrast to absurd mass media claims that these preliminary MET results validate the use of value-added scores in high-stakes decisions, the first round of findings represents the beginning of a foundation for building composite measures of teacher effectiveness. The final report of this effort (scheduled for release this fall) will be of greater consequence.  In the meantime, a couple of studies this year (here and here) provided some evidence that certain teacher instructional practices are associated with better student achievement results (a major focus of the MET project).

There were also some very interesting teacher quality papers that didn’t get much public attention, all of which suggest that our understanding of teacher effects on test scores is still very much a work in progress. There are too many to list, but one particularly clever and significant working paper (from NBER) found that the “match quality” between teachers and schools explains about one-quarter of the variation in teacher effects (i.e., teachers would get different value-added scores in different schools). 

A related, important paper (from CALDER researchers) found that teachers in high-poverty schools get lower value-added scores than those in more affluent schools, but that the differences are small and do not arise among the top teachers (and cannot be attributed to higher attrition in poorer schools).  The researchers also found that the effects of experience are less consistent in higher-poverty schools, which may explain the discrepancies by school poverty.  These contextual variations in value-added estimates carry substantial implications for the use of these estimates in high-stakes decisions (also see this article, published in 2010). They also show how we're just beginning to address some of the most important questions about these measures' use in actual decisions.

In the longer-term, though, the primary contribution of the value-added literature has been to show that teachers vary widely in their effect on student test scores, and that most of the variation is unexplained by conventional variables.  These findings remain well-established. But whether or not we can use value added to identify persistently high- and low-performers is still very much an open question.

Nevertheless, 2010 saw hundreds of states and districts move ahead with incorporating heavily-weighted value-added measures into their evaluation systems. The reports above (and many others) sparked important debates about the imprecision of all types of teacher quality measures, and how to account for this error while building new, more useful evaluation systems.  The RTTT-fueled rush to design these new systems might have benefited from this discussion, and from more analysis to guide it.

Overall, while 2010 will certainly be remembered as a watershed year for market-based reforms, this wave of urgency and policy changes unfolded concurrently with a steady flow of solid research suggesting that extreme caution, not haste, is in order.


Good summary overall.

I'd quibble with this description of the Mathematica middle school study: "Mathematica researchers also released an experimental evaluation showing no test score benefits of charter middle schools."

No benefits on average, yes, but the overall average masks a crucial distinction:

"we found that study charter schools serving more low income or low achieving students had statistically significant positive effects on math test scores, while charter schools serving more advantaged students—those with higher income and prior achievement—had significant negative effects on math test scores."

Based on that more detailed description of their findings, it looks as if charter schools are doing very different things for different students -- raising up low income and poor scoring students while actually harming richer and higher scoring students.

So if one wanted to expand charter schools in impoverished urban areas, the "no overall average benefit" finding would be beside the point.


Thanks for the kind word, Stuart.

Fair enough on the Mathematica report (had to resist detail to keep the post reasonably short), though the effect among lower-income students was significant in math only (and, of course, the coefficients measure the relative, not the absolute, charter impact). Also, as you know, my point there was less about whether charters work than about why. From that angle, the discrepancy in results between high- and low-poverty schools is, in my view, not at all beside the point. It’s a critical issue. If the charter concept is sound, shouldn’t it produce better results in most cases, instead of in just one subject and “type” of school? Why would effects vary by student characteristics (and, for that matter, subject)? I could speculate, but that’s all it would be. It’s an interesting question, and an important one.


Here's what I would theorize:

1. I tend to agree with Mike Petrilli's explanation (…): urban charter schools tend to have the express focus of raising achievement, while lots of suburban charter schools have a very different purpose (to offer a more creative and progressive alternative to the traditional public schools).

So both types of charter schools could be succeeding at what they're trying to do, even though the "average" achievement gain is nil.

2. Finding achievement gains in math but not reading is what seems to happen in just about every educational study ever done. I exaggerate perhaps, but not by much. I suspect this is because schools are the main place in life where children learn and do math, whereas reading is a skill that most parents practice with their children. (Parents are more likely to read a book every night to their kids than to do a worksheet of long division problems).

More than that: reading comprehension in higher grades is closely tied to background knowledge, and background knowledge is something that would seem to be affected much more by out-of-school factors than math.


Just on an anecdotal basis, the two charter schools near where I'm from are the Benton County School of the Arts, and Haas Hall (a science/math academy). If test scores at the latter go up (because math is one of their focuses) while test scores at the former don't do as well (because their whole purpose is to cater to kids who are more into ballet, music, art, etc., rather than test scores), then averaging their performance together would create the headline: "NO AVERAGE TEST SCORE GAINS FROM CHARTER SCHOOLS!"

But wouldn't that be the wrong way to look at it?



To the school focus explanation (which is plausible), I would add differences between high- and low-poverty schools in terms of: teacher labor markets; the applicant pool; the influence of peer effects and other unobserved school-level factors; and the relative performance oversubscribed charter schools. I’m sure there are multiple explanations, varying by context, some substantive and some methodological. I’d like to see some actual work on this, and I’m not aware of much already done.

As for the persistent elusiveness of reading-improvement policies, it’s certainly true to a degree, but there is plenty of evidence of reading effects, in the charter literature and elsewhere (math effects are always stronger, but even small relative reading differences are detectable). Nevertheless, on the core issues - we’re in full agreement here. Reading is a skill that requires background knowledge largely taught by parents and other figures out of school. I think the best way to address these issues within the education system would be to complement the common core standards with a huge emphasis on common curriculum – making sure all students are being taught the best, research-based content that they need to read and write (and which would comprise the standards). This would, I think, go a LONG way. But even then, reading achievement is likely to remain highly resistant to intervention from education policies per se. I wish more commentators and especially policymakers would acknowledge this point and its implications.

Finally, on your headline point, I suppose it would be misleading if the reason for the difference was definitely the different foci of the schools (as it may be for the two schools near you but not necessarily for the [unknown] schools in the study). Also, charter schools produced benefits in only one quadrant of the “poverty/subject matrix,” so I think it’s fair to characterize the results as no effect, especially given the fact that I was breezing through many papers, that I had described the findings in detail when I wrote about the study in July, and, most importantly, that I was making a point about why charters work rather than whether they do so. If you disagree, then we’ll have to add that to the list!

Please keep reading/commenting.


I think pages 70-73 of the Mathematica middle school study are important. Quote:

"In study charter schools that served more economically disadvantaged students (that is, schools in which the
percentage eligible for free or reduced-price meals was above the sample median), the estimated impact on Year 2 mathematics scores was positive and significant (impact = 0.18; p-value = 0.002)."

It goes on to say that charter schools serving black kids were better than those serving white kids.

So YMMV, but for me, if charter schools are doing a good job helping poor black kids (who need improved schooling more than rich whites) at increasing their math abilities (which we expect schools to be able to affect more than reading), then that's a big plus.


Forty years of work as an inner city public school teacher, administrator, PTA president, researcher and public school advocate convince me that there is no single typical "district" or charter public school. They vary widely - from Chinese, German, Spanish immersion to core knowledge to project based to Core KNowledge, etc etc. We are wasting a huge amount of time, effort and money trying to decide which is better, district or charter. Instead, we should, as we did in Cincinnati (with help from the Cincinnati Federation of Teachers, learn from the best, whether district or charter. That led to significantly increased graduation rates and an elimination of the high school graduation gap between white and African American students.


Mr. Di Carlo,

Could you help me understand what you meant in the following statement?

"Strangely, the value-added analysis that got the most attention by far – and which became the basis for a series of LA Times stories – was also among the least consequential. The results were very much in line with more than a decade of studies on teacher effects."

I looked at the link briefly, and it seems that the study shows that there are large differences in teacher quality as measured by value-added analysis. If this conclusion was used by the LA Times as part of their story, why was the study inconsequential?

I'm not disagreeing. I'm just not familiar enough with this story to know what you mean, and I'd like to understand this important issue.

Put another way, you cite lots of studies that show problems with value-added measures, yet for the LA study, you say that it is in line with more than a decade of research.

How can these both be true? Is there a difference between the kinds of studies that I'm not understanding?

Again, I'm not trying to be critical. I'm a teacher and I'd like to have a solid understanding of what the scholarship is teaching us.

Thank you.


Hi Jeffrey,

Thanks for your comment.

Your misunderstanding is entirely my fault, as I should have been more clear with my language. First of all, when I said the LAT analysis was not especially “consequential,” I meant that the results were not new or surprising from a research perspective (i.e., they are in line with over a decade of prior work). The research on value-added demonstrates wide variation in teacher effects on test scores (I think convincingly), and so does the LAT analysis (which was a high-quality analysis, by the way).

I suppose “consequential” is the wrong word to use there, especially since, in terms of public attention and impact, the LAT analysis (or at least the articles based on it) had a huge effect. But the findings were nothing new. “Surprising” or “original” might have been a better choice.

As for whether there is a contradiction between the studies I cite and the value-added literature, the studies are less “problems with value-added” than contributions to it. As you know, this is how empirical research works – a body of evidence on a given topic accumulates, and new work addresses new angles and contexts, which adds to the greater understanding of the topic.

However, it’s a whole different ballgame when you use this research in “real life” – in this case, using value-added estimates in high-stakes decisions about teachers. To do so, you have to be confident that the methods are “ready,” and that there is a good idea of how to use them properly. And this stuff is still evolving rapidly (in part because the special datasets linking teachers and students haven’t been widely collected until fairly recently). Many of the most important questions – such as whether school poverty or school “match” affect the results (and a dozen other issues), and how one might account for these factors – are just recently being addressed.

Yet so many states and districts, which might have benefited from this research, have already put new systems in place (and, for some of them, the haste shows).

So, what I was trying to say (again, apparently not very clearly) was that the research demonstrates that teacher effects vary, but it’s still an open question as to whether the estimates are accurate and/or stable enough to identify where *individual* teachers fall within that varying distribution. The LAT analysis didn’t contribute much new information to this research effort, even though it had a huge effect in a different way (sparking debate and controversy).

If you’re interested, I discuss some of these “theory/practice” issues in greater detail here:

I hope this answers your question. Let me know if it doesn’t. And thanks for being a dedicated teacher who cares enough to muddle through this stuff.



Mr. Di Carlo,

Thanks for your detailed response. I think I get it now. There are (sometimes large) differences between teachers, in terms of quality, but it's difficult to know with precision exactly who those people are, especially given the usually small amounts of data we have to work with.

I find this research fascinating and will continue to read your blog. Thanks again for taking the time to respond.