The Year In Research On Market-Based Education Reform: 2011 Edition

** Also posted here on 'Valerie Strauss' Answer Sheet' in the Washington Post

If 2010 was the year of the bombshell in research in the three “major areas” of market-based education reform – charter schools, performance pay, and value-added in evaluations – then 2011 was the year of the slow, sustained march.

Last year, the landmark Race to the Top program was accompanied by a set of extremely consequential research reports, ranging from the policy-related importance of the first experimental study of teacher-level performance pay (the POINT program in Nashville) and the preliminary report of the $45 million Measures of Effective Teaching project, to the political controversy of the Los Angeles Times’ release of teachers’ scores from their commissioned analysis of Los Angeles testing data.

In 2011, on the other hand, as new schools opened and states and districts went about the hard work of designing and implementing new evaluations compensation systems, the research almost seemed to adapt to the situation. There were few (if any) "milestones," but rather a steady flow of papers and reports focused on the finer-grained details of actual policy.*

Nevertheless, a review of this year's research shows that one thing remained constant: Despite all the lofty rhetoric, what we don’t know about these interventions outweighs what we do know by an order of magnitude.

2011 was perhaps an unprecedented year for charter school proliferation. Several states either passed laws allowing charters to open or lifted the caps on the number permitted to operate, while federal and private funds flowed in to assist with the expansion. Within the body of research on charters, there was, predictably, no movement in the standard finding on the test-based effects of these schools – that they vary widely, and are on the whole no more or less effective than comparable regular public schools.**

Accordingly, the annual meta-analysis from the Center on Reinventing Public Education (CRPE), which includes a large selection of high-quality charter studies released in recent years, once again concluded that charters do well in some contexts but not others. For its part, CREDO issued two supplemental state reports this year, finding moderate negative charter effects in Pennsylvania, and moderate positive effects in Indiana.

On a similar note, an important report (by Mathematica and CRPE) on charter management organizations (CMOs) presented an analysis of schools operated by 22 of these organizations around the nation. Insofar as the CMOs included tended to be well-established and heavily-funded, the analysis was in many respects a test of schools that should theoretically be among the better ones in operation. The results were, however, in line with the general conclusion of the literature – schools run by some CMOs did well, others did poorly, and differences in both cases tended to be modest.

Despite this ongoing stalemate in the pointless “horse race” as to whether charters “work," 2011 saw a substantial stream of research addressing the truly meaningful question of what makes the handful of charters that do get results succeed (discussed further here). For instance, the above-mentioned CMO report offered some insight, finding that there was a relationship between CMO effectiveness and the use of measures such as comprehensive behavior policies and teacher coaching (and, to a limited extent, more school time).

There were also several studies taking a look at the effects of the so-called “no excuses” model (most famously used by KIPP), which is generally characterized by an extended day/year, intensive tutoring, stepped-up recruitment and performance pay of teachers and administrators, strict discipline and data-informed instruction via interim assessments.

Perhaps most notably, we received the first round of results on a pilot program in Houston, in which 20 regular public schools are being “converted” to the “no excuses” model. The preliminary results for the Houston intervention’s first year were, however, mixed. There were no discernible gains in reading, to go along with strong increases in math achievement, which seemed to be in no small part driven by the schools’ tutoring program (further discussion here).

An evaluation of Massachusetts charter schools found significant gains (relative to comparable students in regular public schools) among urban charter students (but not non-urban students), especially those who were enrolled in oversubscribed, “no excuses” schools that provide extended time and strong emphases on discipline and achievement.

Finally, an analysis of New York City charters also found positive results, as well as, once again, evidence that practices such as extended time, high-dosage tutoring and data-driven instruction accounted for a substantial portion of the variation in performance. The same was found for the SEED boarding school in Washington, D.C., which employs what might be viewed as an extreme version of extended time (students stay overnight during the week). There were substantial gains in exchange for the school’s exorbitant cost (though, as is often the case with such intensive interventions, questions remain about the possible confounding role of attrition).

Overall, what these 2011 studies have done is finally start to shed productive light on why some few charters, such as KIPP and other “no excuses” chains, consistently succeed (also see this 2011 paper on hiring practices). In that sense, it was an important year for charter research, even if it lagged behind legislation/proliferation that it could have helped to guide (as discussed here).

In contrast, 2011 was a quiet year in the performance pay arena, at least compared with last year's succession of seminal studies, including the above-mentioned Nashville evaluation.

There were, of course, a few exceptions. The most consequential studies of actual programs were evaluations of New York City’s schoolwide bonus program. Two analyses – one from RAND and the other from Harvard professor Roland Fryer – found that the incentives had little or no effect on most outcomes, including student performance and teacher retention. To its credit, the city reacted to this evidence by ending the program.

There was also a an evaluation of Denver’s ProComp system, which was presented in a working paper from the Center for Education Data and Research. ProComp is a multifaceted compensation system that includes incentives, professional development and other components. There was evidence of some improvement among participating teachers, but it wasn’t possible to prove definitively that it was a direct result of the program itself.

The biggest splash in the area of performance pay (and incentives/accountability in general) was not a program evaluation, but rather a large, extensive report released by the National Research Council (NRC). The impressive team of scholars presented a top-notch review of the evidence on incentives and test-based accountability, along with the primary conclusion that these measures had not been successful enough to warrant any expectations that they would bring the U.S. to the level of top-performing nations. This conclusion was widely-repeated, sometimes overinterpreted and strongly contested.

In general, though, the 2011 research on performance pay did little to break the stalemate that was established last year. Proponents claim that these programs cannot be assessed by short-term testing outcomes, as their purpose is to attract higher-achieving, non-risk-averse young people to the profession, and keep them around once they get in the classroom (see discussions here and here, as well as this 2011 working paper, which presents some early, highly tentative evidence on this score).

Insofar as this outcome is extraordinarily difficult to measure/test, and would have to unfold over a period of many years, the debate on performance pay seems unlikely to budge much in response to empirical evidence, at least in the short term. As a result, the increasing number of states and districts adopting these programs, including IndianaFlorida and Idaho, are in many respects taking a leap of faith.

In the final major area of market-based reform – the use of value-added and other growth models in teacher evaluations – 2011 was an historic year. There was an astonishing increase in the adoption and implementation of new systems. In fact, the majority of U.S. public school students are now enrolled in states/districts that either currently use test-based productivity measures (e.g., value-added scores) in teacher evaluations, or will be using them very soon.

Despite this explosion, there remains virtually no concrete evidence as to the proper design or efficacy of these systems. To a significant degree, these questions will have to be resolved by observing and analyzing what happens on the ground over the coming years, though the evidence that does exist suggests that many states and districts are ignoring some of the critical details that will determine success or failure. Nevertheless, there was very important progress by the research community during 2011.***

For instance, an analysis of Cincinnati’s teacher evaluation system found that it did indeed generate improvements in math (but not reading) achievement, even among students of mid-career teachers, and even during the years after the evaluations were conducted. This study represents one of the first pieces of evidence that evaluations (this one uses student performance measures but not test scores) can potentially affect performance, but it also suggests that the structure of systems (e.g., constituent measures, timing) is important and in need of further examination, including, the paper suggests, "systems willing to experimentally vary the components of their evaluation system."

A couple of other papers used the results from new and existing evaluations to examine the relationships between the measures (e.g., value-added, observations) that comprise these systems. This is important because, put simply, different measures and weights can either compound or mitigate each other's imprecision.

report from the Consortium on Chicago School Research found a complementary relationship between teachers’ value-added scores and their ratings on the Danielson observational framework, especially at the “tails” of the distribution (i.e., the teachers judged most and least effective). This squares with previous research on the relationship between principal and test-based assessments of teachers’ performance (also here).

On a similar note, researchers began to explore the properties and feasibility of performance measures other than the “big two” – observations and growth models using state assessments. For instance, a RAND report released in December 2010 offered a detailed look at the state of the research and practice on a variety of student performance measures, and how they are being combined with other indicators. A couple of papers examined the possibility of incorporating measures of productivity in evaluations of high school teachers (see here and here), while there were initial examinations of the properties of alternative assessments (tests other than the official state exams).

In an attempt to provide a framework for assessing all these different components, Brookings released a paper presenting a concrete formula for assessing the validity of non-testing components for teacher evaluations (e.g., observations, etc.) based on the degree to which they match up with a “benchmark” measure (e.g., growth model estimates; see Bruce Baker's discussion). This is similar to the controversial approach (discussed here) used by the Gates Foundation’s Measures of Effective Teaching project (the project’s final report was scheduled to be released this year, but was delayed).

Finally, this year saw a bunch of research tackling other very basic practical and/or technical details, such as how to deal with missing data in value-added models, and documentation of how teachers are (non-randomly) assigned to classrooms (also here).****

Unfortunately, however, just as researchers started to get a handle on these critical details of how to design better teacher evaluations, a large group of states had already determined many of the key features of these new systems, while some initiated rapid, full-blown implementation without so much as a pilot year.

Overall, then, it was a productive research year in the three areas discussed above, and it might have been even more productive but for the fact that, in too many cases, the policy decisions this work could have guided had already been made.

- Matt Di Carlo


* Needless to say, this is not a comprehensive presentation of the year’s research. It focuses on a broad selection of high-quality, mostly quantitative analyses in the three areas of charter schools, performance pay and the use of value-added and other growth model estimates in teacher evaluations. For the most part, this review is limited to papers/reports that were actually released for the first time in 2011, rather than released in previous years and finally published in 2011 (many are therefore still in the working paper phase, and should be interpreted with that in mind).

** It always bears mentioning that most of this work pertains solely to testing outcomes, and there is some limited evidence that charters may do better on other outcomes, such as parental satisfaction (e.g., this paper). In addition, many of the test-based analyses, including the oft-cited CREDO report, find that charters are more effective with certain subgroups, such as lower-scoring students.

*** One notable “carry over” from 2010 was a substantive response to the Los Angeles Times’ value-added analysis: A critique (and replication) of the paper’s methods by researchers Derek Briggs and Ben Domingue (released early this year by the National Education Policy Center, discussed here).

**** A few other important and/or interesting papers on teacher quality-related topics: a working paper about the achievement effects stemming from the assignment of teachers to “well-matched” students; an analysis showing that teacher mobility harms student achievement (also see this paper on how mobility affects the distribution of teacher quality); a look at one district’s web-based data tool for teachers, which finds relatively low levels of use and no association with achievement outcomes; a analysis finding that pre-services measures of leadership experience and perseverance maintain a discernible relationship with future test-based effectiveness; and a working paper examining the association between schools’ effectiveness and their practices surrounding teacher hiring, assignment, development and retention.