** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post
About two weeks ago, the National Education Policy Center (NEPC) released a review of last year’s Los Angeles Times (LAT) value-added analysis – with a specific focus on the technical report upon which the paper’s articles were based (done by RAND’s Richard Buddin). In line with prior research, the critique’s authors – Derek Briggs and Ben Domingue – redid the LAT analysis, and found that teachers’ scores vary widely, but that the LAT estimates would be different under different model specifications; are error-prone; and conceal systematic bias from non-random classroom assignments. They were also, for reasons yet unknown, unable to replicate the results.
Since then, the Times has issued two responses. The first was a quickly-published article, which claimed (including in the headline) that the LAT results were confirmed by Briggs/Domingue – even though the review reached the opposite conclusions. The basis for this claim, according to the piece, was that both analyses showed wide variation in teachers’ effects on test scores (see NEPC’s reply to this article). Then, a couple of days ago, there was another response, this time on the Times’ ombudsman-style blog. This piece quotes the paper’s Assistant Managing Editor, David Lauter, who stands by the paper’s findings and the earlier article, arguing that the biggest question is:
...whether teachers have a significant impact on what their students learn or whether student achievement is all about ... factors outside of teachers’ control. ... The Colorado study comes down on our side of that debate. ... For parents and others concerned about this issue, that’s the most significant finding: the quality of teachers matters.Saying “teachers matter” is roughly equivalent to saying that teacher effects vary widely - the more teachers vary in their effectiveness, controlling for other relevant factors, the more they can be said to “matter” as a factor explaining student outcomes. Since both analyses found such variation, the Times claims that the NEPC review confirms their “most significant finding."
The review’s authors had a much different interpretation (see their second reply). This may seem frustrating. All the back and forth has mostly focused on somewhat technical issues, such as model selection, sample comparability, and research protocol (with some ethical charges thrown in for good measure). These are essential matters, but there is also an even simpler reason for the divergent interpretations, one that is critically important and arises constantly in our debates about value-added.
Here’s the first key point: The finding that teachers matter – that there is a significant difference overall between the most and least effective teachers – is not in dispute. Indeed, the fact that there is wide variation in teacher “quality” has been acknowledged by students, parents, and pretty much everyone else for centuries – and has been studied empirically for decades (see here and here for older examples). The more recent line of value-added research has made enormous (and fascinating) contributions to this knowledge, using increasingly sophisticated methods (see here, here, here, and here for just a few influential examples).
Therefore, the Times’ claim that the NEPC analysis confirmed their findings because they too found wide variation in teacher effects is kind of missing the point. Teacher effects will vary overall with virtually any model specification that’s even remotely complex. The real issue, both in this case and in the larger debate over value-added, is whether we can measure the effectiveness of individual teachers.
Now, if the Times had simply published a few articles reporting their overall findings – for example, the size of the aggregate difference between the most and least effective teachers, and how it varies by school, student, and teacher characteristics – I suspect there would have been relatively little controversy. The core criticisms by Briggs and Domingue would still have been relevant and worth presenting, of course – their review is focused on the analysis, not how the Times used it. But the LAT technical paper (and articles based on it) would really have just been one of dozens reaching the same conclusion – albeit one presented more accessibly (in the articles), using a large new database in the newspaper’s home town.
Of course, the Times did not stop there. They published the value-added scores for individual teachers in an online database. Just as the academic literature on value-added is different from using the estimates in high-stakes employment decisions, the paper’s publication of the database is very different from their presenting overall results.
Let’s say I was working for a private company, and I told my boss that I had an analysis showing that there was wide variation in productivity among the company’s employees. She probably already knew that, or at least suspected as much, but she might be interested to see the size of the differences between the most and least productive workers. The results might even lead her to implement particular policies – in hiring, mentoring, supervision, and the like. But this is still quite different from saying that I could use this information to accurately identify which specific employees are the most and least productive, both now and in the future.
The same goes for teachers, and that is the context in which the criticisms by Briggs and Domingue are most consequential. They address a set of important questions: How many teachers’ estimates change with a different model with different variables (and what does that mean if they do)? Did the model omit important variables that influenced individual teachers’ estimates? Were the estimates biased by school-based decisions such as classroom assignment? How many teachers were misclassified due to random error?
From this perspective, with an eye toward individual-level accuracy, the Times might have proceeded differently. They might have accounted for error margins in assigning teachers effectiveness ratings (as I have discussed before). When confronted with the failure to replicate their results, they might have actually shown concern, and taken steps to figure it out. And they may have reacted to the fact that their results vary by model specification and were likely biased by non-random classroom assignment (which will likely be made worse by the publication of the database) by, at the very least, agreeing to make public their sensitivity analyses, and defending their choices.
Instead, they persisted in defending a conclusion that was never in question. They argued – twice – that the NEPC review also found variation in teacher effects, and therefore supported their “most significant” conclusion, even if it disagreed with their other findings. On this basis, they downplayed the other issues raised by Briggs/Domingue (who are, by the way, reputable researchers pointing out inherent, universally-accepted flaws in these methods). In other words, the Times seems to have conflated the importance of teacher quality with the ability to measure it at the individual level.
And, unfortunately, they are not alone. I hear people – including policymakers – advocate constantly for the use of value-added in teacher evaluations or other high-stakes decisions by saying that “research shows” that there are huge differences between “good” and “bad” teachers.
This overall variation is a very important finding, but for policy purposes, it doesn’t necessarily mean that we can differentiate between the good, the bad, and the average at the level of individual teachers. How we should do so is an open question. Conflating the importance of teacher quality with the ability to measure it carries the risk of underemphasizing all the methodological and implementation details – such as random error, model selection, and data verification - that will determine whether value-added plays a productive role in education policy. These details are critical, and way too many states and districts, like the Los Angeles Times, actually seem to be missing the trees for the forest.