The Year In Research On Market-Based Education Reform: 2011 Edition

** Also posted here on 'Valerie Strauss' Answer Sheet' in the Washington Post

If 2010 was the year of the bombshell in research in the three “major areas” of market-based education reform – charter schools, performance pay, and value-added in evaluations – then 2011 was the year of the slow, sustained march.

Last year, the landmark Race to the Top program was accompanied by a set of extremely consequential research reports, ranging from the policy-related importance of the first experimental study of teacher-level performance pay (the POINT program in Nashville) and the preliminary report of the $45 million Measures of Effective Teaching project, to the political controversy of the Los Angeles Times’ release of teachers’ scores from their commissioned analysis of Los Angeles testing data.

In 2011, on the other hand, as new schools opened and states and districts went about the hard work of designing and implementing new evaluations compensation systems, the research almost seemed to adapt to the situation. There were few (if any) "milestones," but rather a steady flow of papers and reports focused on the finer-grained details of actual policy.*

Nevertheless, a review of this year's research shows that one thing remained constant: Despite all the lofty rhetoric, what we don’t know about these interventions outweighs what we do know by an order of magnitude.

What Value-Added Research Does And Does Not Show

Value-added and other types of growth models are probably the most controversial issue in education today. These methods, which use sophisticated statistical techniques to attempt to isolate a teacher’s effect on student test score growth, are rapidly assuming a central role in policy, particularly in the new teacher evaluation systems currently being designed and implemented. Proponents view them as a primary tool for differentiating teachers based on performance/effectiveness.

Opponents, on the other hand, including a great many teachers, argue that the models’ estimates are unstable over time, subject to bias and imprecision, and that they rely entirely on standardized test scores, which are, at best, an extremely partial measure of student performance. Many have come to view growth models as exemplifying all that’s wrong with the market-based approach to education policy.

It’s very easy to understand this frustration. But it's also important to separate the research on value-added from the manner in which the estimates are being used. Virtually all of the contention pertains to the latter, not the former. Actually, you would be hard-pressed to find many solid findings in the value-added literature that wouldn't ring true to most educators.

The Categorical Imperative In New Teacher Evaluations

There is a push among many individuals and groups advocating new teacher evaluations to predetermine the number of outcome categories – e.g., highly effective, effective, developing, ineffective, etc. - that these new systems will include. For instance, a "statement of principles" signed by 25 education advocacy organizations recommends that the reauthorized ESEA law require “four or more levels of teacher performance." The New Teacher Project’s primary report on redesigning evaluations made the same suggestion.* For their part, many states have followed suit, mandating new systems with a minimum of 4-5 categories.

The rationale here is pretty simple on the surface: Those pushing for a minimum number of outcome categories believe that teacher performance must be adequately differentiated, a goal on which prior systems, most of which relied on dichotomous satisfactory/unsatisfactory schemes, fell short. In other words, the categories in new evaluation systems must reflect the variation in teacher performance, and that cannot be accomplished when there are only a couple of categories.

It’s certainly true that the number of categories matters – it is an implicit statement as to the system’s ability to tease out the “true” variation in teacher performance. The number of categories a teacher evaluation system employs should depend on how on how well it can differentiate teachers with a reasonable degree of accuracy. If a system is unable to pick up this “true” variation, then using several categories may end up doing more harm than good, because it will be providing faulty information. And, at this early stage, despite the appearance of certainty among some advocates, it remains unclear whether all new teacher evaluation systems should require four or more levels of “effectiveness."

The Impact Of The Principal In The Classroom

Direct observation is way of gathering data by watching behavior or events as they occur; for example, a teacher teaching a lesson. This methodology is important to teacher induction and professional development, as well as teacher evaluation. Yet, direct observation has a major shortcoming: it is a rather obtrusive data gathering technique. In other words, we know the observer can influence the situation and the behavior of those being observed. We also know people do not behave the same way when they know they are being watched. In psychology, these forms of reactivity are known as the Hawthorne effect, and the observer- or experimenter- expectancy effect (also here).

Social scientists and medical researchers are well aware of these issues and the fact that research findings don’t mean a whole lot when the researcher and/or the study participants know the purpose of the research and/or are aware that they are being observed or tested. To circumvent these obstacles, techniques like “mild deception” and “covert observation” are frequently used in social science research.

For example, experimenters often take advantage of “cover stories” which give subjects a sensible rationale for the research while preventing them from knowing (or guessing) the true goals of the study, which would threaten the experiment’s internal validity – see here. Also, researchers use double-blind designs, which, in the medical field, mean that neither the research participant nor the researcher know when the treatment or the placebo are being administered.

A Few Other Principles Worth Stating

Last week, a group of around 25 education advocacy organizations, including influential players such as Democrats for Education Reform and The Education Trust, released a "statement of principles" on the role of teacher quality in the reauthorization of the Elementary and Secondary Education Act (ESEA). The statement, which is addressed to the chairs and ranking members of the Senate and House committees handling the reauthorization, lays out some guidelines for teacher-focused policy in ESEA (a draft of the full legislation was released this week; summary here).

Most of the statement is the standard fare from proponents of market-based reform, some of which I agree with in theory if not practice. What struck me as remarkable was the framing argument presented in the statement's second sentence:

Research shows overwhelmingly that the only way to close achievement gaps – both gaps between U.S. students and those in higher-achieving countries and gaps within the U.S. between poor and minority students and those more advantaged – and transform public education is to recruit, develop and retain great teachers and principals.
This assertion is false.

Quality Control, When You Don't Know The Product

Last week, New York State’s Supreme Court issued an important ruling on the state’s teacher evaluations. The aspect of the ruling that got the most attention was the proportion of evaluations – or “weight” – that could be assigned to measures based on state assessments (in the form of estimates from value-added models). Specifically, the Court ruled that these measures can only comprise 20 percent of a teacher’s evaluation, compared with the option of up to 40 percent for which Governor Cuomo and others were pushing. Under the decision, the other 20 percent must consist entirely of alternative test-based measures (e.g., local assessments).

Joe Williams, head of Democrats for Education Reform, one of the flagship organizations of the market-based reform movement, called the ruling “a slap in the face” and “a huge win for the teachers unions." He characterized the policy impact as follows: “A mediocre teacher evaluation just got even weaker."

This statement illustrates perfectly the strange reasoning that seems to be driving our debate about evaluations.

Certainty And Good Policymaking Don't Mix

Using value-added and other types of growth model estimates in teacher evaluations is probably the most controversial and oft-discussed issue in education policy over the past few years.

Many people (including a large proportion of teachers) are opposed to using student test scores in their evaluations, as they feel that the measures are not valid or reliable, and that they will incentivize perverse behavior, such as cheating or competition between teachers. Advocates, on the other hand, argue that student performance is a vital part of teachers’ performance evaluations, and that the growth model estimates, while imperfect, represent the best available option.

I am sympathetic to both views. In fact, in my opinion, there are only two unsupportable positions in this debate: Certainty that using these measures in evaluations will work; and certainty that it won’t. Unfortunately, that’s often how the debate has proceeded – two deeply-entrenched sides convinced of their absolutist positions, and resolved that any nuance in or compromise of their views will only preclude the success of their efforts. You’re with them or against them. The problem is that it's the nuance - the details - that determine policy effects.

Let’s be clear about something: I'm not aware of a shred of evidence – not a shred – that the use of growth model estimates in teacher evaluations improves performance of either teachers or students.

Test-Based Teacher Evaluations Are The Status Quo

We talk a lot about the “status quo” in our education debates. For instance, there is a common argument that the failure to use evidence of “student learning” (in practice, usually defined in terms of test scores) in teacher evaluations represents the “status quo” in this (very important) area.

Now, the implication that “anything is better than the status quo” is a rather massive fallacy in public policy, as it assumes that the costs of alternatives will outweigh benefits, and that there is no chance the replacement policy will have a negative impact (almost always an unsafe assumption). But, in the case of teacher evaluations, the “status quo” is no longer what people seem to think.

Not counting Puerto Rico and Hawaii, the ten largest school districts in the U.S. are (in order): New York City; Los Angeles; Chicago; Dade County (FL); Clark County (NV); Broward County (FL); Houston; Hillsborough (FL); Orange County (FL); and Palm Beach County (FL). Together, they serve about eight percent of all K-12 public school students in the U.S., and over one in ten of the nation’s low-income children.

Although details vary, every single one of them is either currently using test-based measures of effectiveness in its evaluations, or is in the process of designing/implementing these systems (most due to statewide legislation).

Again, Niche Reforms Are Not The Answer

Our guest author today is David K. Cohen, John Dewey Collegiate Professor of Education and professor of public policy at the University of Michigan, and a member of the Shanker Institute’s board of directors.

A recent response to my previous post on these pages helps to underscore one of my central points: If there is no clarity about what it will take to improve schools, it will be difficult to design a system that can do it.  In a recent essay in the Sunday New York Times Magazine, Paul Tough wrote that education reformers who advocated "no excuses" schooling were now making excuses for reformed schools' weak performance.  He explained why: " Most likely for the same reason that urban educators from an earlier generation made excuses: successfully educating large numbers of low-income kids is very, very hard." 

 In his post criticizing my initial essay, "What does it mean to ‘fix the system’?," the Fordham Institute’s Chris Tessone told the story of how Newark Public Schools tried to meet the requirements of a federal school turnaround grant. The terms of the grant required that each of three failing high school replace at least half of their staff. The schools, he wrote, met this requirement largely by swapping a portion of their staffs with one another, a process which Tessone and school administrators refer to as the “dance of the lemons.”Would such replacement be likely to solve the problem?

Even if all of the replaced teachers had been weak (which we do not know), I doubt that such replacement could have done much to help.

Success Via The Presumption Of Accuracy

In our previous post, Professor David K. Cohen argued that reforms such as D.C.’s new teacher evaluation system (IMPACT) will not by themselves lead to real educational improvement, because they focus on the individual rather than systemic causes of low performance. He framed this argument in terms of the new round of IMPACT results, which were released two weeks ago. While the preliminary information was limited, it seems that the distribution of teachers across the four ratings categories (highly effective, effective, minimally effective, and ineffective) were roughly similar to last year’s - including a small group of teachers fired for receiving the lowest “ineffective” rating, and a somewhat larger group (roughly 200) fired for having received the “minimally effective” label for two consecutive years.

Cohen’s argument on the importance of infrastructure does not necessarily mean that we should abandon the testing of new evaluation systems, only that we should be very careful about how we interpret their results and the policy conclusions we draw from them (which is good advice at all times). Unfortunately, however, it seems that caution is in short supply. For instance, shortly after the IMPACT results were announced, the Washington Post ran an editorial, entitled “DC Teacher Performance Evaluations Are Working," in which a couple of pieces of “powerful evidence” were put forward in an attempt to support this bold claim. The first was that 58 percent of the teachers who received a “minimally effective” rating last year and remained in the district were rated either “effective” or “highly effective” this year. The second was that around 16 percent of DC teachers were rated “highly effective” this year, and will be offered bonuses, which the editorial writers argued shows that most teachers “are doing a good job” and being rewarded for it.

The Post’s claim that these facts represent evidence - much less “powerful evidence” - of IMPACT’s success is a picture-perfect example of the flawed evidentiary standards that too often drive our education debate. The unfortunate reality is that we have virtually no idea whether IMPACT is actually “working," and we won’t have even a preliminary grasp for some time. Let’s quickly review the Post’s evidence.