Revisiting The "5-10 Percent Solution"

In a post over a year ago, I discussed the common argument that dismissing the “bottom 5-10 percent" of teachers would increase U.S. test scores to the level of high-performing nations. This argument is based on a calculation by economist Eric Hanushek, which suggests that dismissing the lowest-scoring teachers based on their math value-added scores would, over a period of around ten years  (when the first cohort of students would have gone through the schooling system without the “bottom” teachers), increase U.S. math scores dramatically – perhaps to the level of high-performing nations such as Canada or Finland.*

This argument is, to say the least, controversial, and it invokes the full spectrum of reactions. In my opinion, it's best seen as a policy-relevant illustration of the wide variation in test-based teacher effects, one that might suggest a potential of a course of action but can't really tell us how it will turn out in practice. To highlight this point, I want to take a look at one issue mentioned in that previous post – that is, how the instability of value-added scores over time (which Hanushek’s simulation doesn’t address directly) might affect the projected benefits of this type of intervention, and how this is turn might modulate one's view of the huge projected benefits.

One (admittedly crude) way to do this is to use the newly-released New York City value-added data, and look at 2010 outcomes for the “bottom 10 percent” of math teachers in 2009.

Reign Of Error: The Publication Of Teacher Data Reports In New York City

Late last week and over the weekend, New York City newspapers, including the New York Times and Wall Street Journal, published the value-added scores (teacher data reports) for thousands of the city’s teachers. Prior to this release, I and others argued that the newspapers should present margins of error along with the estimates. To their credit, both papers did so.

In the Times’ version, for example, each individual teacher’s value-added score (converted to a percentile rank) is presented graphically, for math and reading, in both 2010 and over a teacher’s “career” (averaged across previous years), along with the margins of error. In addition, both papers provided descriptions and warnings about the imprecision in the results. So, while the decision to publish was still, in my personal view, a terrible mistake, the papers at least make a good faith attempt to highlight the imprecision.

That said, they also published data from the city that use teachers’ value-added scores to label them as one of five categories: low, below average, average, above average or high. The Times did this only at the school level (i.e., the percent of each school’s teachers that are “above average” or “high”), while the Journal actually labeled each individual teacher. Presumably, most people who view the databases, particularly the Journal's, will rely heavily on these categorical ratings, as they are easier to understand than percentile ranks surrounded by error margins. The inherent problems with these ratings are what I’d like to discuss, as they illustrate important concepts about estimation error and what can be done about it.

Do Value-Added Models "Control For Poverty?"

There is some controversy over the fact that Florida’s recently-announced value-added model (one of a class often called “covariate adjustment models”), which will be used to determine merit pay bonuses and other high-stakes decisions, doesn’t include a direct measure of poverty.

Personally, I support adding a direct income proxy to these models, if for no other reason than to avoid this type of debate (and to facilitate the disaggregation of results for instructional purposes). It does bear pointing out, however, that the measure that’s almost always used as a proxy for income/poverty – students’ eligibility for free/reduced-price lunch – is terrible as a poverty (or income) gauge. It tells you only whether a student’s family has earnings below (or above) a given threshold (usually 185 percent of the poverty line), and this masks most of the variation among both eligible and non-eligible students. For example, families with incomes of $5,000 and $20,000 might both be coded as eligible, while families earning $40,000 and $400,000 are both coded as not eligible. A lot of hugely important information gets ignored this way, especially when the vast majority of students are (or are not) eligible, as is the case in many schools and districts.

That said, it’s not quite accurate to assert that Florida and similar models “don’t control for poverty." The model may not include a direct income measure, but it does control for prior achievement (a student’s test score in the previous year[s]). And a student’s test score is probably a better proxy for income than whether or not they’re eligible for free/reduced-price lunch.

Even more importantly, however, the key issue about bias is not whether the models “control for poverty," but rather whether they control for the range of factors – school and non-school – that are known to affect student test score growth, independent of teachers’ performance. Income is only one part of this issue, which is relevant to all teachers, regardless of the characteristics of the students that they teach.

A Case For Value-Added In Low-Stakes Contexts

Most of the controversy surrounding value-added and other test-based models of teacher productivity centers on the high-stakes use of these estimates. This is unfortunate – no matter what you think about these methods in the high-stakes context, they have a great deal of potential to improve instruction.

When supporters of value-added and other growth models talk about low-stakes applications, they tend to assert that the data will inspire and motivate teachers who are completely unaware that they’re not raising test scores. In other words, confronted with the value-added evidence that their performance is subpar (at least as far as tests are an indication), teachers will rethink their approach. I don’t find this very compelling. Value-added data will not help teachers – even those who believe in its utility – unless they know why their students’ performance appears to be comparatively low. It’s rather like telling a baseball player they’re not getting hits, or telling a chef that the food is bad – it’s not constructive.

Granted, a big problem is that value-added models are not actually designed to tell us why teachers get different results – i.e., whether certain instructional practices are associated with better student performance. But the data can be made useful in this context; the key is to present the information to teachers in the right way, and rely on their expertise to use it effectively.

The Persistence Of Both Teacher Effects And Misinterpretations Of Research About Them

In a new National Bureau of Economic Research working paper on teacher value-added, researchers Raj Chetty, John Friedman and Jonah Rockoff present results from their analysis of an incredibly detailed dataset linking teachers and students in one large urban school district. The data include students’ testing results between 1991 and 2009, as well as proxies for future student outcomes, mostly from tax records, including college attendance (whether they were reported to have paid tuition or received scholarships), childbearing (whether they claimed dependents) and eventual earnings (as reported on the returns). Needless to say, the actual analysis includes only those students for whom testing data were available, and who could be successfully linked with teachers (with the latter group of course limited to those teaching math or reading in grades 4-8).

The paper caused a remarkable stir last week, and for good reason: It’s one of the most dense, important and interesting analyses on this topic in a very long time. Much of the reaction, however, was less than cautious, specifically the manner in which the research findings were interpreted to support actual policy implications (also see Bruce Baker’s excellent post).

What this paper shows – using an extremely detailed dataset and sophisticated, thoroughly-documented methods – is that teachers matter, perhaps in ways that some didn’t realize. What it does not show is how to measure and improve teacher quality, which are still open questions. This is a crucial distinction, one which has been discussed on this blog numerous times (also here and here), as it is frequently obscured or outright ignored in discussions of how research findings should inform concrete education policy.

What Value-Added Research Does And Does Not Show

Value-added and other types of growth models are probably the most controversial issue in education today. These methods, which use sophisticated statistical techniques to attempt to isolate a teacher’s effect on student test score growth, are rapidly assuming a central role in policy, particularly in the new teacher evaluation systems currently being designed and implemented. Proponents view them as a primary tool for differentiating teachers based on performance/effectiveness.

Opponents, on the other hand, including a great many teachers, argue that the models’ estimates are unstable over time, subject to bias and imprecision, and that they rely entirely on standardized test scores, which are, at best, an extremely partial measure of student performance. Many have come to view growth models as exemplifying all that’s wrong with the market-based approach to education policy.

It’s very easy to understand this frustration. But it's also important to separate the research on value-added from the manner in which the estimates are being used. Virtually all of the contention pertains to the latter, not the former. Actually, you would be hard-pressed to find many solid findings in the value-added literature that wouldn't ring true to most educators.

Has Teacher Quality Declined Over Time?

** Also posted here on "Valerie Strauss' Answer Sheet" in the Washington Post

One of the common assumptions lurking in the background of our education debates is that “quality” of the teaching workforce has declined a great deal over the past few decades (see here, here, here and here [slide 16]). There is a very plausible storyline supporting this assertion: Prior to the dramatic rise in female labor force participation since the 1960s, professional women were concentrated in a handful of female-dominated occupations, chief among them teaching. Since then, women’s options have changed, and many have moved into professions such as law and medicine instead of the classroom.

The result of this dynamic, so the story goes, is that the pool of candidates to the teaching profession has been “watered down." This in turn has generated a decline in the aggregate “quality” of U.S. teachers, and, it follows, a stagnation of student achievement growth. This portrayal is often used as a set-up for a preferred set of solutions – e.g., remaking teaching in the image of the other professions into which women are moving, largely by increasing risk and rewards.

Although the argument that “teacher quality” has declined substantially is sometimes taken for granted, its empirical backing is actually quite thin, and not as clear-cut as some might believe.

A Big Fish In A Small Causal Pond

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

In three previous posts, I discussed what I’ve begun to call the “trifecta” of teacher-focused education reform talking points:

In many respects, this “trifecta” is driving the current education debate. You would have trouble finding many education reform articles, reports, or speeches that don’t use at least one of these arguments.

Indeed, they are guiding principles behind much of the Obama Administration’s education agenda, as well as the philosophies of high-profile market-based reformers, such as Joel Klein and Michelle Rhee. The talking points have undeniable appeal. They imply, deliberately or otherwise, that policies focused on improving teacher quality in and of themselves can take us a very long way - not all the way, but perhaps most of the way - towards solving all of our education problems.

This is a fantasy.

How Many Teachers Does It Take To Close An Achievement Gap?

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

Over the weekend, New York Times columnist Nick Kristof made a persuasive argument that teachers should be paid more. In making his case, he also put forth a point that you’ve probably heard before: “One Los Angeles study found that having a teacher from the 25 percent most effective group of teachers for four years in a row would be enough to eliminate the black-white achievement gap."

This is an instance of what we might call the "X consecutive teachers” argument (sometimes it’s three, sometimes four or five). It is often invoked to support, directly or indirectly, specific policy prescriptions, such as merit pay, ending tenure, or, in this case, higher salaries (see also here and here). To his credit, Kristof’s use of the argument is on the cautious side, but there are plenty of examples in which it used as evidence supporting particular policies.

Actually, the day after the column ran, in a 60 Minutes segment featuring “The Equity Project," a charter school that pays its teachers $125,000 a year, the school’s principal was asked how he planned to narrow the achievement gap with his school. His reply was: “The difference between a great teacher and a mediocre or poor teacher is several grade levels of achievement in a given year. A school that focuses all of its energy and its resources on fantastic teaching can bridge the achievement gap."

Indeed, it is among the most common arguments in our education policy debate today.  In reality, however, it is little more than a stylistic riff on empirical research findings, and a rough one at that. It is not at all useful when it comes to choosing between different policy options.

The 5-10 Percent Solution

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post.

In the world of education policy, the following assertion has become ubiquitous: If we just fire the bottom 5-10 percent of teachers, our test scores will be at the level of the highest-performing nations, such as Finland. Michelle Rhee likes to make this claim. So does Bill Gates.

The source and sole support for this claim is a calculation by economist Eric Hanushek, which he sketches out roughly in a chapter of the edited volume Creating a New Teaching Profession (published by the Urban Institute). The chapter is called "Teacher Deselection" (“deselection” is a polite way of saying “firing”). Hanushek is a respected economist, who has been researching education for over 30 years. He is willing to say some of the things that many other market-based reformers also believe, and say privately, but won’t always admit to in public.

So, would systematically firing large proportions of teachers every year based solely on their students’ test scores improve overall scores over time? Of course it would, at least to some degree. When you repeatedly select (or, in this case, deselect) on a measurable variable, even when the measurement is imperfect, you can usually change that outcome overall.

But anyone who says that firing the bottom 5-10 percent of teachers is all we have to do to boost our scores to Finland-like levels is selling magic beans—and not only because of cross-national poverty differences or the inherent limitations of most tests as valid measures of student learning (we’ll put these very real concerns aside for this post).