** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post
The New Teacher Project (TNTP) has a new, highly-publicized report about what it calls “irreplaceables," a catchy term that is supposed to describe those teachers who are “so successful they are nearly impossible to replace." The report’s primary conclusion is that these “irreplaceable” teachers often leave the profession voluntarily, and TNTP offers several recommendations for how to improve this.
I’m not going to discuss this report fully. It shines a light on teacher retention, which is a good thing. Its primary purpose is to promulgate the conceptual argument that not all teacher turnover is created equal – i.e., that it depends on whether “good” or “bad” teachers are leaving (see here for a strong analysis on this topic). The report’s recommendations are standard fare – improve working conditions, tailor pay to “performance” (see here for a review of evidence on incentives and retention), etc. Many are widely-supported, while others are more controversial. All of them merit discussion.
I just want to make one quick (and, in many respects, semantic) point about the manner in which TNTP identifies high-performing teachers, as I think it illustrates larger issues. In my view, the term “irreplaceable” doesn't apply, and I think it would have been a better analysis without it.
The report includes performance data for four large districts and one charter management organization (CMO). In the four regular public school districts, “irreplaceable” is defined in terms of growth model estimates (for the CMO teachers, it is based on overall evaluation ratings).
To TNTP’s credit, in these four districts, they do attempt to account for the imprecision in the estimates - e.g., by employing confidence intervals (if only more states and districts were doing likewise). But, even so, right off the bat, most of their estimates (three out of four districts) are based on only one year of data, and while these scores are hardly useless, it’s not quite appropriate to draw any strong conclusions about teachers’ effectiveness with such small samples, and that includes grand labels such as “irreplaceable." For instance, a decent-sized proportion of these teachers will not make the “irreplaceable” cut the following year, due mostly to error rather than "real" change in performance.
To illustrate this (albeit somewhat crudely), I used the New York City “Teacher Data Reports” (described here) to code teachers as “irreplaceable” according to a rough approximation of one of TNTP’s district-specific definitions (“District B”). Based on single-year estimates in math and reading, a full 43 percent of the NYC teachers classified as “irreplaceable” in 2009 were not classified as such in 2010. (In fairness, the year-to-year stability may be a bit higher using the other district-specific definitions.)
Such instability and misclassification are inevitable no matter how the term is defined and how much data are available – it’s all a matter of degree – but, in general, one must be cautious when interpreting single-year estimates (see here, here and here for related analyses).
Perhaps more importantly, if you look at how they actually sorted teachers into categories, the label “irreplaceable," at least as I interpret it, seems inappropriate no matter how much data are available.
For example, in “District B," it is teachers with at least one median growth percentile rank (e.g., in one subject) above 65 and none below 35, while in “District D," it is teachers with at least one statistically significant, positive percentile rank and none below the median. In "District C," teachers are coded as "irreplaceable" if they have at least one score significantly above average, and none significantly below average.
I would characterize these (test-based) definitions as “probably above average” (though remember that teachers need only score highly in one subject - half/most of a teacher's estimates can be statistically average - and those in Districts B and C can actually have most of their point estimates below the mean/median - so long as one of them [e.g., one subject] is discernibly above).
In other words, calling them “irreplaceable” seems, at best, an exaggeration.
Now, I fully acknowledge that there is no widely-accepted definition of concepts such as “irreplaceable," and that such definitions inevitably entail subjective judgment (mine is on display in this post). In addition, as stated above, I give them a lot of credit for paying attention to error in constructing their definitions.*
But I think it might have been a better analysis had TNTP avoided the overblown, media-friendly characterizations of their performance categories. They could have made their points and presented their results, many of which are interesting and meaningful, without doing so. There's a big difference between characterizing teachers as "higher-performing than average" and saying they're "nearly impossible to replace"
Many of their readers might not be aware of the issues involved, and the report itself is very light on discussion of them.
(I would add that a bunch of the core results are based on surveys of teachers. I’m all for querying teachers’ opinions, but, while TNTP does have a large sample, the survey is voluntary. This is not their “fault," but it does mean that the results cannot necessarily be used to make generalizations about what the teachers in these districts think.**)
Don’t get me wrong –there are certainly “irreplaceable” teachers, and there's no doubt that they often leave the profession for reasons that can be partially prevented. TNTP has a viewpoint on how to do this, and none of the discussion above speaks to whether their recommendations are good or bad. But, when it comes to the characteristics, attitudes and behavior among teachers who are "nearly impossible to replace," I would interpret the results of this particular report with a healthy dose of caution.
- Matt Di Carlo
* My guess is that TNTP predetermined that roughly the "top 20 percent" of teachers should be classified as “irreplaceable” in each of the four districts (perhaps in part to provide a sufficiently large sample to use their survey data), and then calibrated their definitions to produce that result. Yet - and remember I'm speculating here - the scores in some districts were so imprecisely-estimated that they couldn't achieve the 20 percent sub-sample without relaxing their definitions to the point where they (at least in my view) no longer reflected "irreplaceability." This may be why, in “District A," for example, where three years of data were available, the bar is higher (e.g., teachers cannot have any estimates below the mean) than in the other districts – because the more precisely-estimated value-added scores allowed for a more stringent identification of “top performers." If data availability varies between districts, there is no shame in acknowledging that the ability to use these data to identify high-performing teachers might also vary. Also, seeing as the term is so central to the report, it would have been impressive for them to have presented results for one or two alternative definitions of “irreplaceable," to see if they were different.
** TNTP does not seem to report response rates, perhaps to protect the confidentiality of their districts, but they do say they required 20-30 percent in a given school, depending on the district. Another thing that would have been helpful is a full set of survey results, disaggregated by district and performance category. For instance, Figure 3 represents compelling evidence that the high-performing teachers differ in attitudes, but in order to evaluate this, one would really need to see the responses for other, similar questions (if there were others).
Thanks for posting, Matt. A quick question--I've just read the report, but have not gotten into the nitty gritty of measures. Are these teachers considered "irreplaceable" based on standardized test scores alone? Were any other measures used to determine this "top 20%"? How does TNTP suggest schools identify those "irreplaceable" teachers who work in non-testing subjects or areas?
I found some valuable information in the report (i.e., appreciate good teachers, work to keep them, etc.) but admit I'm a bit concerned that this is a performance-pay-in-sheep's-clothes piece of policy work. Overblown concern, considering the players at the table?
In the four regular public school districts, yes - they use only test-based productivity measures (growth or value-added models) to sort teachers. In their CMO sub-sample, they use full evaluation results, which (if I recall correctly) do not employ any test-based components.
I'm afraid I cannot speak to the second question you pose, except to say that (and I'm generalizing here) TNTP tends to support performance-based pay, and I think they're quite open and honest about that position.
Thanks for the comment,
You've made a crucial point because words matter, and this clear exaggeration may undercut the significance of retaining quality teachers in public schools.
As somebody who left teaching high school in public schools after two years almost 27 years ago, I can also confirm that working conditions and pay matter. It's difficult to sustain one's own standards if you feel surrounded by hopelessness, mediocrity and student indifference.
Beyond the statistical imprecision issue (although, they do have uncertainty estimates for District A and incorporate it into their "irreplaceable" algorithm), what I don't think is very clear from the report is that the thresholds for "exceptional" vary from district to district because the growth models aren't pooled (the got the estimates from the district, rather than running themselves, which is bizarre): so, teachers, after conditioning on some characteristics, are compared to other teachers within that district. So, the top 20% in one district may be quite different than the top 20% in another district.
What's doubly bizarre is that they then used these disparate growth models to then standardize learning gains in terms of months of learning (footnote 4) without providing the formula they used to do this. Since the distribution and magnitude of these effects may differ by district, I can't see how this could mathematically or even substantively work. A standard deviation may be different in district A than in district B, so the learning gain yield must also be different. Right?
They're also unclear about whether the 20% threshold was pre-determined (as it indicates in the Technical Appendix), or whether it was based on the distribution of teacher effect estimates from the growth models (as it suggests on page 2 with the phrase "fell into this category").