The 5-10 Percent Solution

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post.

In the world of education policy, the following assertion has become ubiquitous: If we just fire the bottom 5-10 percent of teachers, our test scores will be at the level of the highest-performing nations, such as Finland. Michelle Rhee likes to make this claim. So does Bill Gates.

The source and sole support for this claim is a calculation by economist Eric Hanushek, which he sketches out roughly in a chapter of the edited volume Creating a New Teaching Profession (published by the Urban Institute). The chapter is called "Teacher Deselection" (“deselection” is a polite way of saying “firing”). Hanushek is a respected economist, who has been researching education for over 30 years. He is willing to say some of the things that many other market-based reformers also believe, and say privately, but won’t always admit to in public.

So, would systematically firing large proportions of teachers every year based solely on their students’ test scores improve overall scores over time? Of course it would, at least to some degree. When you repeatedly select (or, in this case, deselect) on a measurable variable, even when the measurement is imperfect, you can usually change that outcome overall.

But anyone who says that firing the bottom 5-10 percent of teachers is all we have to do to boost our scores to Finland-like levels is selling magic beans—and not only because of cross-national poverty differences or the inherent limitations of most tests as valid measures of student learning (we’ll put these very real concerns aside for this post).

Before addressing the argument directly, it bears noting that this policy, even if it went down perfectly, would not be a quick fix. The simulation does not entail a one-time layoff. We would have to fire the “bottom” 5-10 percent of teachers permanently. Then, according to the calculation—and if everything went as planned—it would take around 10 years for U.S. test scores to rise to level of the world’s higher-performing nations.

It also seems improbable that we could ever legislate, design, and carry out such a policy on a large, nationwide scale, even if it had widespread support (which it doesn’t). Yet that’s what would be needed to produce the promised benefits (again, assuming everything went perfectly).

But what if we could do it? Would it work? As I said, there would almost certainly be some increase in overall test scores, at least in the short-term (whether or not that would signal proportional true improvement is a different matter entirely). But would the gains be large and sustained? It's always difficult to project the impact of an untried, drastic intervention like this, but I would argue probably not. In fact, there is a risk that this type of policy would end up hurting overall education performance in the long run, especially in higher-poverty, hard-to-staff schools and districts.

The presumed benefits of this proposal rely on several shaky assumptions, some of which would, if violated, carry negative consequences. One assumption, which I have discussed before, is that the replacement teachers will be of sufficient quality (on the whole) to produce at least average student test score gains. Hanushek’s calculation assumes that the replacements will do so (though, among other things, it’s unclear whether he uses the average gains for a first-year teacher, which are lower).

Currently, around 8-9 percent of teachers leave the profession every year, and this will probably increase as baby boomers retire. Maintaining the deselection might place substantial strain on the labor pool (of course, there would be some overlap – teachers who would be fired under the proposal would have left anyway).

In particular, high-poverty and other hard-to-staff schools—which already have problems finding good new teachers—would have to replace even more teachers every year, while choosing from an ever-narrowing applicant pool (it seems that much of California is in trouble right now). The assumption that the quality of replacements would remain stable is rather unsafe, and the calculation hinges on it.

Moreover, you can bet that many teachers, faced with the annual possibility of being fired based on test scores alone, would be even more likely to switch to higher-performing, lower-poverty schools (and/or schools that didn’t have the layoff policy). This would create additional, disruptive churn, as well as exacerbate the shortage of highly-qualified teachers in poorer schools and districts.

When all is said, it’s conceivable that, taking the firings, attrition, and switching into account, the total annual mobility rate for all teachers could approach 25 percent, and it would be much higher in poorer school and districts (making these students bear a disproportionate burden for this unintended consequence). It’s hard to imagine a public education system that could function effectively under those circumstances, let alone thrive.

Remember also that a widespread test-based firing policy would almost certainly change the “type” of person who chooses to pursue teaching (or, for that matter, chooses to remain). I find it hard to believe that any top-notch applicant would be attracted to a low-paying profession because of a systematic layoff policy (see here for an alternative view). There’s no way to know, but my guess is that the opposite is true. If so, the policy’s projected benefits would be further mitigated.

The simulation also assumes that all the dismissed teachers would leave the profession permanently. Again, this seems highly unlikely, especially if replacements are in short supply. Rather, I would speculate that a significant proportion of dismissed teachers would get jobs in other districts. In doing so, they would seriously dilute the policy’s effects, while also creating needless turnover for schools.

Then there is the issue of error. Due to the well-known imprecision of value-added models, and the year-to-year fluctuation of teacher effects, many replacement teachers would be no better or worse than the fired teachers would have been (error will be particularly high among newer teachers, due to small samples). There is something unethical about firing people based solely on measures that may be wrong due to nothing more than random statistical error, yet these mistakes would have to be tolerated, as collateral damage, in the name of productivity. But, if the replacement pool runs dry, there would also be practical consequences: we will have fired many solid teachers, whom we might have identified as such with more nuanced measures.

Finally, on a similar note, the quality of teachers who constitute the "bottom" 5-10 percent varies by location, and by poverty level (though not drastically). Imposing a widespread dismissal system would therefore result in the deselection of many teachers who would have done quite well in a different school or district. Firing these teachers solely to meet a quota is a harmful practice (again - especially if there are shortages).

In short, this proposal would be slow, risky, unfair, and it would require us to deliberately engineer test score gains for their own sake—in the most brutal manner possible. It would also be, I argue, unlikely to work, not to anywhere near the advertised degree.

Is this really our best option?

Hanushek doesn’t think so. Talking about the systematic firings, he notes, “In the long run, it would probably be superior…to develop systems that upgrade the overall effectiveness of teachers." He points out, however, that these efforts have not been successful in the past. But have we really tried?

Instead of trying to fire our way to the high performance of Finland or anywhere else, why not try to emulate the policies that these nations actually employ? It seems very strange to shoot for the achievement levels of these nations by doing the exact opposite of what they do.

In any case, Gates, Rhee, et al. constantly repeat the “fire 5-10 percent” talking point, along with the promise of miracle results, because of its potent political message: all we have to do is fire bad teachers, and everything will be fixed. They use Hanushek’s calculation to provide an empirical basis for this message. They do not, however, seem at all attuned to the fact that the proposal is less an actual policy recommendation than a stylistic illustration of the wide variation in teacher effects.

Let’s stick with meaningful conversations about how to identify, improve, and, failing that, remove ineffective teachers. Test-based measures may have a role in the evaluation of both teachers and overall school performance, but not a dominant one, and certainly not an exclusive one.

Systematically firing large numbers of teachers based solely on test scores is an incredibly crude, blunt instrument, fraught with risk. We’re better than that.

Blog Topics

Just because we haven't figured out a way to do something systematically and consistently, may mean that a systematic solution doesn't exist. Rather than looking for another systematic solution, it might be better to leave some control to the schools to implement their own non-systematic, inconsistent solutions.

The Huff Po report on VAMs http://www.huffingtonpost.com/2010/12/23/teacher-layoffs-seniority_n_80… and layoffs reported:

"Dan Goldhaber, lead author of the study and the center's director, projected that student achievement after seniority-based layoffs would drop by an estimated 2.5 to 3.5 months of learning per student, when compared to laying off the least effective teachers."

But the report says:
"Teachers RIFed in our simulation are approximately 20% of a standard deviation in student performance less effective in student performance than teachers RIFed in reality."

Goldhaber misquoted the Boyd et al study of 2010, but then he said his results were similar to their conclusion that The typical teacher who is laid off under a valueadded system is 26 percent of a standard deviation in student achievement less effective than the typical teacher laid off under the seniority-based policy. 7Boyd then says that the gap would shrink as the new teachers gained experience.

Am I missing something or is this a huge bluff by Goldhaber et al? After all, the latest Gates MET study’s conclusions contradicted their findings.

John

I think I understand now what the Goldhaber report says and what it means. I just can’t tell what they actually did.

Yes, they ran a simulation and the teachers RIFed would be estimated to be less effective by 2.5 to 3.5 months. I did not understand that that was what they were doing for two reasons.

Firstly, that would require them to run that simulation for each district and each subject matter. Had they done so, I figured, they would have said that was what they did. They have a convoluted footnote that might address that but I couldn’t figure out what it means, and I assumed that such an effort would be reported in the text. So, maybe they did that but did not mention it in prose. But, I wonder if they just ran a macro simulation for the entire state where the bottom 145 teachers were RIFed without regard for whether the replacement worked in that district or not. (Boyd didn’t have that to worry about, and speaking of that their typo threw me off also) That would be intellectually dishonest, but if they used a more complex alternative method of running the more complex simulation, would they have not said they had done so.

Secondly, they described six different VAM scenarios, but they never said precisely which one they used. Had they chosen one method for the simulation, I thought, they would say which one they used. But they promised a VAM that took into account comparability of schools and districts. So, I read the article wondering if they had started with some generic VAMs for the simulation, and used six more refined models for their final tables.

Rereading again, they said at one point that they incorporated parts of scenarios #2 and #3, and at another point #2 and #5. I didn’t understand #5,

Regarding #2 they said its weakness was that it could not account for student, classroom, and district characteristics. But then they said it could be adjusted to reach those characteristics. But they didn’t whether they did that or not in the simulation.

So when I read and reread the report, I concentrated on the Tables, which was the only place where they said what they were controlling for.

I have to say that even though Im not a statistician, I never had these problems reading acedmic papers, on say econometrics. There are reasons why scholarship has certain conventions, and if the Gates people followed them, their reports would be more intellectually honest.

But now I realize they meant what was reported and that their simulation would mean that RIFs based on effectiveness would increase student performance by up to 3.5 months. They just didn’t say how they did the simulation. I’m assuming now that they meant that using VAMs to determine which 2200 teachers get layoff notices would mean that the students of 145 teachers would benefit by that much, but still they say little or nothing on how they reached the headlined conclusion, in contrast to the detail they announced for minor points.

For instance, could they be running a simulation where those gains would be produced by replacing a senior teacher with low test score gains by a teacher in another type of school in another district who had high gains? That seems too absurd. But it also seems absurd that they run a simulation without reporting what that simulation was.

Sorry to bother you about this.

Great theoretical discussions but no one has mentioned the overriding factor in new teacher hires - money. You can fire as many as you want but the discussions in my school district is about going out to hire inexpensive/inexperienced teachers not the most experienced/expensive teachers with a proven track record.

And really, with the number of teachers that leave the profession every year through retirement or just being fed up and the this decimation, is there any way to replace 500K to 700K new teachers a year to fill the gap?