The Year In Research On Market-Based Education Reform
** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post.
Race to the Top and Waiting for Superman made 2010 a banner year for the market-based education reforms that dominate our national discourse. By contrast, a look at the “year in research” presents a rather different picture for the three pillars of this paradigm: merit pay, charter schools, and using value-added estimates in high-stakes decisions.
There will always be exceptions (especially given the sheer volume of reports generated by think tanks, academics, and other players), and one year does not a body of research make. But a quick review of high-quality studies from independent, reputable researchers shows that 2010 was not a particularly good year for these policies.
First and perhaps foremost, the first and best experimental evaluation of teacher merit pay (by the National Center on Performance Incentives) found that teachers eligible for bonuses did not increase their students’ tests scores more than those not eligible. Earlier in the year, a Mathematica study of Chicago’s TAP program (which includes data on the first two of the program’s three years) reached the same conclusion.
The almost universal reaction from the market-based reformers was that merit pay is not supposed to generate short-term increases in test scores, but rather to improve the quality of applicants to the profession and their subsequent retention. This viewpoint, while reasonable on the surface, not only implies that merit pay is a leap of faith, one that will likely never have its benefits “proven” with any degree of rigor, but also that the case against teacher experience and education (criticized by some of the very same people for their weak association with short-term student test score gains) must be reassessed. On that note, a 2010 working paper showed that previous studies may have underestimated the returns to teacher experience, and that teachers’ value-added scores may improve for ten or more years.
In the area of charter schools, Mathematica researchers also released an experimental evaluation showing no test score benefits of charter middle schools. This directly followed a preliminary report on KIPP schools (also from Mathematica), which showed positive gains. The latter was widely touted, while the former was largely ignored. The conflicting results begged for a deeper discussion of the specific policies and practices that differentiate KIPP from the vast majority of charters (also see this single-school study on KIPP from this year), which produce results that are no better or worse than comparable public schools.
On this topic, an article published in the American Journal of Education not only found no achievement advantage for charters, but also that a measure of “innovation” was actually negatively associated with score gains. The aforementioned study of charter middle schools likewise found few positive correlations between school policies and achievement. As a result, explanations of why a few charters seem to do well remain elusive (I proposed school time as the primary mechanism in the case of KIPP).
So, some of the best work ever on charters and merit pay came in 2010, with very lackluster results, just as a massive wave of publicity and funding awarded these policy measures a starring role in our national education policy.
2010 was also a bountiful year for value-added research. Strangely, the value-added analysis that got the most attention by far – and which became the basis for a series of LA Times stories – was also among the least consequential. The results were very much in line with more than a decade of studies on teacher effects. The questionable decision to publish teachers’ names and scores, on the other hand, garnered incredible public controversy.
From a purely research perspective, other studies were far more important. Perhaps most notably in the area of practical implications, a simulation by Mathematica researchers (published by the Education Department) showed high “Type I and Type II” error rates (classification errors that occur even when estimates are statistically significant), which persisted even with multiple years of data. On a similar note, a look at teacher value-added scores in New York City – the largest district in the nation – found strikingly large error margins.
The news wasn’t all bad, of course, and value-added research almost never lends itself to simple “yes/no verdicts.” For instance, the recently-released preliminary results from the Gates-funded MET study provide new evidence that alternative measures of teacher quality, most notably student perceptions of their effectiveness, maintain modest but significant correlations with value-added scores (similarly, another 2010 paper found an association between principal evaluations and value-added scores, and also demonstrated that principals may use this information productively).
In contrast to absurd mass media claims that these preliminary MET results validate the use of value-added scores in high-stakes decisions, the first round of findings represents the beginning of a foundation for building composite measures of teacher effectiveness. The final report of this effort (scheduled for release this fall) will be of greater consequence. In the meantime, a couple of studies this year (here and here) provided some evidence that certain teacher instructional practices are associated with better student achievement results (a major focus of the MET project).
There were also some very interesting teacher quality papers that didn’t get much public attention, all of which suggest that our understanding of teacher effects on test scores is still very much a work in progress. There are too many to list, but one particularly clever and significant working paper (from NBER) found that the “match quality” between teachers and schools explains about one-quarter of the variation in teacher effects (i.e., teachers would get different value-added scores in different schools).
A related, important paper (from CALDER researchers) found that teachers in high-poverty schools get lower value-added scores than those in more affluent schools, but that the differences are small and do not arise among the top teachers (and cannot be attributed to higher attrition in poorer schools). The researchers also found that the effects of experience are less consistent in higher-poverty schools, which may explain the discrepancies by school poverty. These contextual variations in value-added estimates carry substantial implications for the use of these estimates in high-stakes decisions (also see this article, published in 2010). They also show how we're just beginning to address some of the most important questions about these measures' use in actual decisions.
In the longer-term, though, the primary contribution of the value-added literature has been to show that teachers vary widely in their effect on student test scores, and that most of the variation is unexplained by conventional variables. These findings remain well-established. But whether or not we can use value added to identify persistently high- and low-performers is still very much an open question.
Nevertheless, 2010 saw hundreds of states and districts move ahead with incorporating heavily-weighted value-added measures into their evaluation systems. The reports above (and many others) sparked important debates about the imprecision of all types of teacher quality measures, and how to account for this error while building new, more useful evaluation systems. The RTTT-fueled rush to design these new systems might have benefited from this discussion, and from more analysis to guide it.
Overall, while 2010 will certainly be remembered as a watershed year for market-based reforms, this wave of urgency and policy changes unfolded concurrently with a steady flow of solid research suggesting that extreme caution, not haste, is in order.