Value-Added: Theory Versus Practice

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

About two weeks ago, the National Education Policy Center (NEPC) released a review of last year’s Los Angeles Times (LAT) value-added analysis – with a specific focus on the technical report upon which the paper’s articles were based (done by RAND’s Richard Buddin). In line with prior research, the critique’s authors – Derek Briggs and Ben Domingue – redid the LAT analysis, and found that teachers’ scores vary widely, but that the LAT estimates would be different under different model specifications; are error-prone; and conceal systematic bias from non-random classroom assignments. They were also, for reasons yet unknown, unable to replicate the results.

Since then, the Times has issued two responses. The first was a quickly-published article, which claimed (including in the headline) that the LAT results were confirmed by Briggs/Domingue – even though the review reached the opposite conclusions. The basis for this claim, according to the piece, was that both analyses showed wide variation in teachers’ effects on test scores (see NEPC’s reply to this article). Then, a couple of days ago, there was another response, this time on the Times’ ombudsman-style blog. This piece quotes the paper’s Assistant Managing Editor, David Lauter, who stands by the paper’s findings and the earlier article, arguing that the biggest question is:

...whether teachers have a significant impact on what their students learn or whether student achievement is all about ... factors outside of teachers’ control. ... The Colorado study comes down on our side of that debate. ... For parents and others concerned about this issue, that’s the most significant finding: the quality of teachers matters.

Saying “teachers matter” is roughly equivalent to saying that teacher effects vary widely - the more teachers vary in their effectiveness, controlling for other relevant factors, the more they can be said to “matter” as a factor explaining student outcomes. Since both analyses found such variation, the Times claims that the NEPC review confirms their “most significant finding."

The review’s authors had a much different interpretation (see their second reply). This may seem frustrating. All the back and forth has mostly focused on somewhat technical issues, such as model selection, sample comparability, and research protocol (with some ethical charges thrown in for good measure). These are essential matters, but there is also an even simpler reason for the divergent interpretations, one that is critically important and arises constantly in our debates about value-added.

Here’s the first key point: The finding that teachers matter – that there is a significant difference overall between the most and least effective teachers – is not in dispute. Indeed, the fact that there is wide variation in teacher “quality” has been acknowledged by students, parents, and pretty much everyone else for centuries – and has been studied empirically for decades (see here and here for older examples). The more recent line of value-added research has made enormous (and fascinating) contributions to this knowledge, using increasingly sophisticated methods (see here, here, here, and here for just a few influential examples).

Therefore, the Times’ claim that the NEPC analysis confirmed their findings because they too found wide variation in teacher effects is kind of missing the point. Teacher effects will vary overall with virtually any model specification that’s even remotely complex. The real issue, both in this case and in the larger debate over value-added, is whether we can measure the effectiveness of individual teachers.

Now, if the Times had simply published a few articles reporting their overall findings – for example, the size of the aggregate difference between the most and least effective teachers, and how it varies by school, student, and teacher characteristics – I suspect there would have been relatively little controversy. The core criticisms by Briggs and Domingue would still have been relevant and worth presenting, of course – their review is focused on the analysis, not how the Times used it. But the LAT technical paper (and articles based on it) would really have just been one of dozens reaching the same conclusion – albeit one presented more accessibly (in the articles), using a large new database in the newspaper’s home town.

Of course, the Times did not stop there. They published the value-added scores for individual teachers in an online database. Just as the academic literature on value-added is different from using the estimates in high-stakes employment decisions, the paper’s publication of the database is very different from their presenting overall results.

Let’s say I was working for a private company, and I told my boss that I had an analysis showing that there was wide variation in productivity among the company’s employees. She probably already knew that, or at least suspected as much, but she might be interested to see the size of the differences between the most and least productive workers. The results might even lead her to implement particular policies – in hiring, mentoring, supervision, and the like. But this is still quite different from saying that I could use this information to accurately identify which specific employees are the most and least productive, both now and in the future.

The same goes for teachers, and that is the context in which the criticisms by Briggs and Domingue are most consequential. They address a set of important questions: How many teachers’ estimates change with a different model with different variables (and what does that mean if they do)? Did the model omit important variables that influenced individual teachers’ estimates? Were the estimates biased by school-based decisions such as classroom assignment? How many teachers were misclassified due to random error?

From this perspective, with an eye toward individual-level accuracy, the Times might have proceeded differently. They might have accounted for error margins in assigning teachers effectiveness ratings (as I have discussed before). When confronted with the failure to replicate their results, they might have actually shown concern, and taken steps to figure it out. And they may have reacted to the fact that their results vary by model specification and were likely biased by non-random classroom assignment (which will likely be made worse by the publication of the database) by, at the very least, agreeing to make public their sensitivity analyses, and defending their choices.

Instead, they persisted in defending a conclusion that was never in question. They argued – twice – that the NEPC review also found variation in teacher effects, and therefore supported their “most significant” conclusion, even if it disagreed with their other findings. On this basis, they downplayed the other issues raised by Briggs/Domingue (who are, by the way, reputable researchers pointing out inherent, universally-accepted flaws in these methods). In other words, the Times seems to have conflated the importance of teacher quality with the ability to measure it at the individual level.

(Incidentally, they made a similar mistake in their article about the Gates MET report.)

And, unfortunately, they are not alone. I hear people – including policymakers – advocate constantly for the use of value-added in teacher evaluations or other high-stakes decisions by saying that “research shows” that there are huge differences between “good” and “bad” teachers.

This overall variation is a very important finding, but for policy purposes, it doesn’t necessarily mean that we can differentiate between the good, the bad, and the average at the level of individual teachers. How we should do so is an open question. Conflating the importance of teacher quality with the ability to measure it carries the risk of underemphasizing all the methodological and implementation details – such as random error, model selection, and data verification - that will determine whether value-added plays a productive role in education policy. These details are critical, and way too many states and districts, like the Los Angeles Times, actually seem to be missing the trees for the forest.

Blog Topics

Education Policy

Value-Added

Teacher Evaluation

Would the Times claim that there is a wide variation in the number of assists by NBA players, so the way to win championships would be to cut the players who don't make enough assists according to a statistical model. Or would it say the same about rebounds? Or points scored?

Would it say that we should go ahead and fire basketball players, now, based on any one of those factors, because someday over the rainbow, a statistical model may be developed that takes all three into account?

I'm not sure I understand this. I agree that probably the quality of teaching varies, as in any profession, but I'm not sure how we can know this for sure, if we can't ascertain or measure what the actual quality of any individual teacher might be.

If the estimates of a teacher's effectiveness (based solely on test scores) varies widely from year to year, or from one VA formula to another, or from one specific test to another, or even perhaps for one class vs. another -- how do we know how much the "quality" of teachers varies over an entire cohort?

Which does not even to begin to address the issue that test scores should never be the sole measure of teacher quality. There are some teachers who might be good at keeping students from dropping out and engaging their interest -- but not at raising their test scores, for example. Are you counting that as "quality" as well?

Thanks for continuing to shine a light on the misuse of statistics, and value-added analysis. As Mark Twain's famous quote*, "Lies--damned lies--and statistics," reveals, numbers are easily manipulated to convey truth or a highly nuanced meaning, when in fact, there value is highly dependent upon the logic used to produce them. The methods that produce quantifiable data are assumed to be valid on their face simply since data allow easy comparisons. It reduces the complex to the facile, allowing anyone with an opinion to rail against the object quantified, irrespective of the validity of the measure, or model, and hence, the data. "If the LA Times reports this, and it is data, surely it must be correct," they think.

Sadly, right is not might in our world very often. Those with access to nearly limitless amounts of capital, and lest we forget - axes to grind, persist in perpetuating falsehoods like the LA Times value-added analysis of LAUSD teachers. While value-added offers some value as a management tool, its misuse at the individual teacher level is not apparent to the average person. So, like the days of old, good citizens rally around the town square, pitchforks and torches in hand, ready to rid the castle of the purported evildoers who, when viewed properly, are simply those trying to do what the townsfolk want and could not do themselves.

* Twain quoting Benjamin Disraeli in his autobiography. According to Stephen Goranson at http://is.gd/iejZcD, "Twain's Autobiography attribution of a remark about lies and statistics to Disraeli is generally not accepted. Evidence is now available to conclude that the phrase originally appeared in 1895 in an article by Leonard H. Courtney."

Hey Leonie,

If I understand you correctly, you are asking how we can determine that there is overall variation in teacher effects if we can’t isolate them at the individual level.

If that’s what you meant, it’s an excellent question. I might have explained it more clearly in the post.

I think the best way to answer is with an extremely ironic (albeit imperfect) analogy. Let’s say I’m gathering data on dart throwing – I have a large group of regular dart players, and I give each one ten tries to hit the bull’s eye, recording the results for each throw. Over a large sample of throwers, there would probably be a lot “spread” in the results. Some people would hit it 8-9 times, some 6-7 times, some 3-4 times. Using these data, I could demonstrate reasonably – with statistical measures or just eyeballing it (I might need more than ten tries for each person to do it statistically) – that some players were better than others at this particular task. That is - there is overall variation in performance that was not just random flucutation (if it was all just “luck,” most people would hit roughly the same number, and there’d be less spread).

But let’s say I then took it a step further, and tried to assign each *individual* a score – percentage of throws hitting bull’s – and I called that their “bull’s eye performance index.” This is a whole different matter, as anyone who has played darts will tell you. At the *individual* level, my index would be pretty weak. Maybe some of my throwers were drunk, or distracted by poor lighting, or just having a bad day, etc. I might *tentatively* say that the people who hit 8-9 bull’s were at least above average, those who hit none were likely below average (at least for regular dart players), but for the vast majority, it would be tough to tell either way.

So, my darts “analysis” could certainly show that there seems to be quite a bit of variation in the ability to hit the bull’s eye (or at least that it wasn’t just random), but NOT which individual players are definitely good or bad, at least in the vast majority of cases. (Note that, as is the case with teachers, accuracy would improve if I gave them 100 throws instead of ten, or if I made an effort to control for factors like environment or sobriety.)

Obviously, teaching is nothing like darts, but I think this gives the basic idea. Dart throwing, like teaching, is a skill, and some people are better than others at it. But when it comes to showing WHO is good or bad with any degree of accuracy, that’s a far more difficult endeavor.

And we could also, of course, make an even more important point: That being a good dart player is about much more than hitting bull’s eyes, just as good teaching is about much more than test results. For instance, someone might be poor at hitting bull’s eyes, but very good hitting other parts of the board. So, a good performance measure – for dart players and teachers – must be multidimensional.

In the case of the LAT/NEPC, we obviously cannot know the “correct model,” but the fact that results varied with different models is itself one of the critical issues (I would add that Briggs and Domingue did, in my view, present compelling evidence that alternative specifications were superior). All models will show that teacher effects (at least as measured by test score gains) are not solely random, but different models will show different teachers to be more or less effective.

And that was my main argument: The overall variation was not really the point, and by focusing so strongly on it, the LA Times (and others) kind of misses all the other issues that bear on the debate over using value-added.
Sorry for the unacceptably long reply, but I hope this answers your question. Thanks for the comment.