A recent New York Times "teacher diary" presents the compelling account of a New York City teacher whose value-added rating was 6th percentile in 2009 – one of the lowest scores in the city – and 96th percentile the following year, one of the highest. Similar articles - for example, about teachers with errors in their rosters or scores that conflict with their colleagues'/principals' opinions - have been published since the release of the city’s teacher data reports (also see here). These accounts provoke a lot of outrage and disbelief, and that makes sense – they can sound absurd.
Stories like these can be useful as illustrations of larger trends and issues - in this case, of the unfairness of publishing the NYC scores, most of which are based on samples that are too small to provide meaningful information. But, in the debate over using these estimates in actual policy, we need to be careful not to focus too much on anecdotes. For every one NYC teacher whose value-added rank changed over 90 points between 2009 and 2010, there are almost 100 teachers whose ranks were within 10 points (and percentile ranks overstate the actual size of all these differences). Moreover, even if the models yielded perfect measures of test-based teacher performance, there would still be many implausible fluctuations between years - those that are unlikely to be "real" change - due to nothing more than random error.*
The reliability of value-added estimates, like that of all performance measures (including classroom observations), is an important issue, and is sometimes dismissed by supporters in a cavalier fashion. There are serious concerns here, and no absolute answers. But none of this can be examined or addressed with anecdotes.
In observing the reaction to the value-added stories mentioned above, I was reminded of the New York Times article from a couple of months ago, which featured a District of Columbia Public Schools teacher who received a large bonus for her high evaluation rating. The reporter quoted this teacher saying that the bonus was the reason she stayed in the district. Her story was actually reprinted in other newspapers, and some supporters of merit pay argued that it showed the DC bonus program was serving to retain the district's great teachers.
Yet it was no different from the value-added stories – it was just one instance, which meant virtually nothing on its own. Rather, it had to be placed in the larger context of the research on the relationship between merit pay and teacher retention (we published a post reviewing this evidence, directly in response to the NYT article).
Look, I fully acknowledge the power of “putting a face” on larger problems - the practice is as old as political discourse. These stories have their place (in my view, the truly important argument made by the "6th to 96th" teacher mentioned above was not the size of the change between years, but rather his broadly-applicable point that he had no idea what he had done - e.g., lesson plans, etc. - to bring about the change).
But, as unfortunate as it may be to people who are (understandably) sick of hearing about correlations and error margins, the debate over using value-added in evaluations and other decisions will have to proceed in a manner that keeps anecdotes in proper perspective. It's an obvious point, but the evidence will have to be systematic. Policy discussions, like value-added estimates, require large samples.
- Matt Di Carlo
* Given the small samples, the single-year estimates are pretty much guaranteed to be imprecisely estimated, and thus unstable over time. Putting aside the publication of the NYC data, this instability only matters to the degree that policymakers choose to use value-added in a completely irresponsible fashion - by attaching stakes to estimates without sufficient observations. Unfortunately, that's exactly what's happening in some states and districts. The responsible approach would be, at the very least, to require a minimum sample size and account for error in the estimates. For example, the correlation between the multi-year estimates in 2009 (those based on 2-4 years of prior data) and value-added in 2010 is around 0.50, which is "solidly moderate."