Teacher Value-Added Scores: Publish And Perish
On the heels of the Los Angeles Times’ August decision to publish a database of teachers’ value-added scores, New York City newspapers are poised to do the same, with the hearing scheduled for late November.
Here’s a proposition: Those who support the use of value-added models (VAM) for any purpose should be lobbying against the release of teachers’ names and value-added scores.
The reason? Publishing the names directly compromises the accuracy of an already-compromised measure. Those who blindly advocate for publication – often saying things like “what’s the harm?" – betray their lack of knowledge about the importance of the models’ core assumptions, and the implications they carry for the accuracy of results. Indeed, the widespread publication of these databases may even threaten VAM’s future utility in public education.
Let me explain. Value-added models, which are statistical techniques for isolating the effect of individual teachers on gains in their students’ test scores, rely on a set of core assumptions. Some of these can be tested; others cannot. One of the most important assumptions – and one that has recently gotten a lot of public attention – is that students are assigned to teachers based on factors that are either measured by the models or that are not correlated with achievement outcomes. In other words, that the models can control for the fact that teachers get assigned different groups of students.
For example, if students and teachers were randomly assigned to classes, with enough years of data, value-added models could produce relatively accurate teacher effect estimates. Random assignment would ensure that, over time, with multiple years or classes, teachers would all get roughly the same mix of students (in terms of ability, behavioral issues, and other traits) to work with.
But this is almost never the case. It has long been known that students and teachers are assigned to classrooms in very deliberate ways (see here and here), often for good educational reasons. However, since the VAMs assume that sorting is independent of all factors the models can't account for, each teacher is being “treated” as if he or she has the same types of students as all other teachers (VAMs assess teacher performance entirely by comparison with other teachers/students). The fact that this isn't the case creates serious systematic inaccuracy in the results. For example, teachers who are particularly good with students who have behavioral issues may be assigned a disproportionate share of those students (and we should want it that way). But since students with behavioral issues tend to score poorly on tests, compared with students without these difficulties, the teachers who get them are handed an unmeasured “burden” that the models do not account for (to the degree that behavioral issues are not "absorbed" by other variables in the model). Instead of being rewarded for their ability to work with difficult students, the teachers may be punished for it.
Put simply, any time students are assigned to classrooms based on unmeasured factors, such as behavioral issues or motivation, that are associated with test performance, VAM results are corrupted. The extent of the bias that results from non-random assignment – the amount of inaccuracy it causes – is sometimes overstated (and simple random error, mostly due to small sample sizes, is arguably a bigger problem when there are only a few years of data), but it remains a huge issue in the use of these methods.
Which brings me back to New York City. If newspapers publish teachers’ VAM scores, it will almost certainly inflame the non-random assignment problem, perhaps even to a large degree. Many parents will never check the rankings of their children’s current or future teachers in the online databases. But many will. And those who do check (especially those who might act on the information) will, on the whole, be different from those who don’t. The checkers will, on average, tend to be better-informed and better-educated. They will be the parents who have the time, resources, and know-how to peruse data online, and they will be, again on average, more motivated and involved in their kids’ education than the non-checkers. These are all characteristics that are associated with test performance, and they are all largely unmeasured by value-added models (controls for, say, free lunch eligibility and prior year’s test scores might capture some of these differences in parental characteristics, but only partially in most cases).
Every time a parent finds that his or her child has been assigned to a “low-rated” teacher and successfully requests a new teacher, bias is created – not just for the first teacher, but for their new teacher as well. The inflow of higher performing students into the classes of the “higher-rated” teachers will artificially boost those teachers’ VAM scores, while their removal from the classrooms of “lower-rated” teachers will artificially depress these teachers’ scores. In many cases, the teachers who are deliberately assigned lower-performing students (because they are more effective with them) will be punished even further for their laudable efforts.
Just as importantly, any time a parent checks a teacher’s rating and finds it acceptable – and therefore does not request a new teacher – this also creates bias of sorts, by reinforcing the assignment of “effective” teachers to higher-performing students.
Even if only a relatively small proportion of parents check teachers’ ratings with the intention of acting on them (if they don’t meet with their approval), it could have considerable implications for the accuracy of VAMs in these cities. And remember that VAM results are already subject to tremendous amounts of imprecision. For example, the average margin of error for NYC VAM scores is plus or minus 30 percentile points – a 60 point spread. Some of this is random error; some is due to “systematic” factors like non-random assignment. It’s often hard to separate the two. Increasing non-random assignment (by publishing databases) is likely to increase the imprecision of these already-imprecise models, possibly by a substantial margin (at the very least, it is likely that hundreds of individual teachers' scores will be biased). And this is accuracy that the models cannot afford to spare.
So, putting aside all the serious and very valid concerns about the fairness of publishing teacher ratings that known to be incomplete, imprecise, and unstable, there is a very real degree to which the publication of names and scores represents a threat to a potentially useful research tool. Anyone who believes in using VAMs for any purpose – whether it’s to improve instruction or to determine teachers’ pay and evaluations – should think three times before they support the publication of these databases.