The Data Are In: Experiments In Policy Are Worth It

Our guest author today is David Dunning, professor of psychology at Cornell University, and a fellow of both the American Psychological Society and the American Psychological Association. 

When I was a younger academic, I often taught a class on research methods in the behavioral sciences. On the first day of that class, I took as my mission to teach students only one thing—that conducting research in the behavioral sciences ages a person. I meant that in two ways. First, conducting research is humbling and frustrating. I cannot count the number of pet ideas I have had through the years, all of them beloved, that have gone to die in the laboratory at the hands of data unwilling to verify them.

But, second, there is another, more positive way in which research ages a person. At times, data come back and verify a cherished idea, or even reveal a more provocative or valuable one that no one has never expected. It is a heady experience in those moments for the researcher to know something that perhaps no one else knows, to be wiser—more aged if you will—in a small corner of the human experience that he or she cares about deeply.

Gender Pay Gaps And Educational Achievement Gaps

There is currently an ongoing rhetorical war of sorts over the gender wage gap. One “side” makes the common argument that women earn around 75 cents on the male dollar (see here, for example).

Others assert that the gender gap is a myth, or that it is so small as to be unimportant.

Often, these types of exchanges are enough to exasperate the casual observer, and inspire claims such as “statistics can be made to say anything." In truth, however, the controversy over the gender gap is a good example of how descriptive statistics, by themselves, say nothing. What matters is how they’re interpreted.

Moreover, the manner in which one must interpret various statistics on the gender gap applies almost perfectly to the achievement gaps that are so often mentioned in education debates.

Quality Control In Charter School Research

There's a fairly large body of research showing that charter schools vary widely in test-based performance relative to regular public schools, both by location as well as subgroup. Yet, you'll often hear people point out that the highest-quality evidence suggests otherwise (see here, here and here) - i.e., that there are a handful of studies using experimental methods (randomized controlled trials, or RCTs) and these analyses generally find stronger, more uniform positive charter impacts.

Sometimes, this argument is used to imply that the evidence, as a whole, clearly favors charters, and, perhaps by extension, that many of the rigorous non-experimental charter studies - those using sophisticated techniques to control for differences between students - would lead to different conclusions were they RCTs.*

Though these latter assertions are based on a valid point about the power of experimental studies (the few of which we have are often ignored in the debate over charters), they are dubiously overstated for a couple of reasons, discussed below. But a new report from the (indispensable) organization Mathematica addresses the issue head on, by directly comparing estimates of charter school effects that come from an experimental analysis with those from non-experimental analyses of the same group of schools.

The researchers find that there are differences in the results, but many are not statistically significant and those that are don't usually alter the conclusions. This is an important (and somewhat rare) study, one that does not, of course, settle the issue, but does provide some additional tentative support for the use of strong non-experimental charter research in policy decisions.

Staff Matters: Social Resilience In Schools

In the world of education, particularly in the United States, educational fads, policy agendas, and funding priorities tend to change rapidly. The attention of education research fluctuates accordingly. And, as David Cohen persuasively argues in Teaching and Its Predicaments, the nation has little coherent educational infrastructure to fall back upon. As a result of all this, teachers’ work is almost always surrounded by important levels of uncertainty (e.g., lack of a common curricula) and variation. In such a context, it is no surprise that collaboration and collegiality figure prominently in teachers’ world (and work) views.

After all, difficulties can be dealt with more effectively when/if individuals are situated in supportive and close-knit social networks from which to draw strength and resources. In other words, in the absence of other forms of stability, the ability of a group – a group of teachers in this case – to work together becomes indispensable to cope with challenges and change.

The idea that teachers’ jobs are surrounded by uncertainty made me of think problems often encountered in the field of security. In this sector, because threats are increasingly complex and unpredictable, much of the focus has shifted away from heightened protection and toward increased resilience. Resilience is often understood as the ability of communities to survive and thrive after disasters or emergencies.

The Test-Based Evidence On New Orleans Charter Schools

Charter schools in New Orleans (NOLA) now serve over four out of five students in the city – the largest market share of any big city in the nation. As of the 2011-12 school year, most of the city’s schools (around 80 percent), charter and regular public, are overseen by the Recovery School District (RSD), a statewide agency created in 2003 to take over low-performing schools, which assumed control of most NOLA schools in Katrina’s aftermath.

Around three-quarters of these RSD schools (50 out of 66) are charters. The remainder of NOLA’s schools are overseen either by the Orleans Parish School Board (which is responsible for 11 charters and six regular public schools, and taxing authority for all parish schools) or by the Louisiana Board of Elementary and Secondary Education (which is directly responsible for three charters, and also supervises the RSD).

New Orleans is often held up as a model for the rapid expansion of charter schools in other urban districts, based on the argument that charter proliferation since 2005-06 has generated rapid improvements in student outcomes. There are two separate claims potentially embedded in this argument. The first is that the city’s schools perform better that they did pre-Katrina. The second is that NOLA’s charters have outperformed the city’s dwindling supply of traditional public schools since the hurricane.

Although I tend strongly toward the viewpoint that whether charter schools "work" is far less important than why - e.g., specific policies and practices - it might nevertheless be useful to quickly address both of the claims above, given all the attention paid to charters in New Orleans.

Value-Added Versus Observations, Part Two: Validity

In a previous post, I compared value-added (VA) and classroom observations in terms of reliability – the degree to which they are free of error and stable over repeated measurements. But even the most reliable measures aren’t useful unless they are valid – that is, unless they’re measuring what we want them to measure.

Arguments over the validity of teacher performance measures, especially value-added, dominate our discourse on evaluations. There are, in my view, three interrelated issues to keep in mind when discussing the validity of VA and observations. The first is definitional – in a research context, validity is less about a measure itself than the inferences one draws from it. The second point might follow from the first: The validity of VA and observations should be assessed in the context of how they’re being used.

Third and finally, given the difficulties in determining whether either measure is valid in and of itself, as well as the fact that so many states and districts are already moving ahead with new systems, the best approach at this point may be to judge validity in terms of whether the evaluations are improving outcomes. And, unfortunately, there is little indication that this is happening in most places.

Value-Added Versus Observations, Part One: Reliability

Although most new teacher evaluations are still in various phases of pre-implementation, it’s safe to say that classroom observations and/or value-added (VA) scores will be the most heavily-weighted components toward teachers’ final scores, depending on whether teachers are in tested grades and subjects. One gets the general sense that many - perhaps most - teachers strongly prefer the former (observations, especially peer observations) over the latter (VA).

One of the most common arguments against VA is that the scores are error-prone and unstable over time - i.e., that they are unreliable. And it's true that the scores fluctuate between years (also see here), with much of this instability due to measurement error, rather than “real” performance changes. On a related note, different model specifications and different tests can yield very different results for the same teacher/class.

These findings are very important, and often too casually dismissed by VA supporters, but the issue of reliability is, to varying degrees, endemic to all performance measurement. Actually, many of the standard reliability-based criticisms of value-added could also be leveled against observations. Since we cannot observe “true” teacher performance, it’s tough to say which is “better” or “worse," despite the certainty with which both “sides” often present their respective cases. And, the fact that both entail some level of measurement error doesn't by itself speak to whether they should be part of evaluations.*

Nevertheless, many states and districts have already made the choice to use both measures, and in these places, the existence of imprecision is less important than how to deal with it. Viewed from this perspective, VA and observations are in many respects more alike than different.

If Your Evidence Is Changes In Proficiency Rates, You Probably Don't Have Much Evidence

Education policymaking and debates are under constant threat from an improbable assailant: Short-term changes in cross-sectional proficiency rates.

The use of rate changes is still proliferating rapidly at all levels of our education system. These measures, which play an important role in the provisions of No Child Left Behind, are already prominent components of many states’ core accountability systems (e..g, California), while several others will be using some version of them in their new, high-stakes school/district “grading systems." New York State is awarding millions in competitive grants, with almost half the criteria based on rate changes. District consultants issue reports recommending widespread school closures and reconstitutions based on these measures. And, most recently, U.S. Secretary of Education Arne Duncan used proficiency rate increases as “preliminary evidence” supporting the School Improvement Grants program.

Meanwhile, on the public discourse front, district officials and other national leaders use rate changes to “prove” that their preferred reforms are working (or are needed), while their critics argue the opposite. Similarly, entire charter school sectors are judged, up or down, by whether their raw, unadjusted rates increase or decrease.

So, what’s the problem? In short, it’s that year-to-year changes in proficiency rates are not valid evidence of school or policy effects. These measures cannot do the job we’re having them do, even on a limited basis. This really has to stop.

The Uses (And Abuses?) Of Student Data

Knewton, a technology firm founded in 2008, has developed an “adaptive learning platform” that received significant media attention (also here, here, here and here), as well as funding and recognition early last fall and, again, in February this year (here and here). Although the firm is not alone in the adaptive learning game – e.g., Dreambox, Carnegie Learning – Knewton’s partnership with Pearson puts the company in a whole different league.

Adaptive learning takes advantage of student-generated information; thus, important questions about data use and ownership need to be brought to the forefront of the technology debate.

Adaptive learning software adjusts the presentation of educational content to students' needs, based on students’ prior responses to such content. In the world of research, such ‘prior responses’ would count and be treated as data. To the extent that adaptive learning is a mechanism for collecting information about learners, questions about privacy, confidentiality and ownership should be addressed.

Learning From Teach For America

There is a small but growing body of evidence about the (usually test-based) effectiveness of teachers from Teach for America (TFA), an extremely selective program that trains and places new teachers in mostly higher needs schools and districts. Rather than review this literature paper-by-paper, which has already been done by others (see here and here), I’ll just give you the super-short summary of the higher-quality analyses, and quickly discuss what I think it means.*

The evidence on TFA teachers focuses mostly on comparing their effect on test score growth vis-à-vis other groups of teachers who entered the profession via traditional certification (or through other alternative routes). This is no easy task, and the findings do vary quite a bit by study, as well as by the group to which TFA corps members are compared (e.g., new or more experienced teachers). One can quibble endlessly over the methodological details (and I’m all for that), and this area is still underdeveloped, but a fair summary of these papers is that TFA teachers are no more or less effective than comparable peers in terms of reading tests, and sometimes but not always more effective in math (the differences, whether positive or negative, tend to be small and/or only surface after 2-3 years). Overall, the evidence thus far suggests that TFA teachers perform comparably, at least in terms of test-based outcomes.

Somewhat in contrast with these findings, TFA has been the subject of both intensive criticism and fawning praise. I don’t want to engage this debate directly, except to say that there has to be some middle ground on which a program that brings talented young people into the field of education is not such a divisive issue. I do, however, want to make a wider point specifically about the evidence on TFA teachers – what it might suggest about the current focus to “attract the best people” to the profession.