When The Legend Becomes Fact, Print The Fact Sheet
The New Teacher Project (TNTP) just released a "fact sheet" on value-added (VA) analysis. I’m all for efforts to clarify complex topics such as VA, and, without question, there is a great deal of misinformation floating around on this subject, both "pro-" and "anti-."
The fact sheet presents five sets of “myths and facts." Three of the “myths” seem somewhat unnecessary: that there’s no research behind VA; that teachers will be evaluated based solely on test scores; and that VA is useless because it’s not perfect. Almost nobody believes or makes these arguments (at least in my experience). But I guess it never hurts to clarify.
In contrast, the other two are very common arguments, but they are not myths. They are serious issues with concrete policy implications. If there are any myths, they're in the "facts" column.
The first objection – that the models aren’t “fair to teachers who work in high-needs schools, where students tend to lag far behind academically” - is a little confusing. In one sense, it’s correct to point out that value-added models focus on growth, not absolute scores, and teachers aren’t necessarily penalized just because their students “start out” low.
But most of the response to this "myth" addresses a rather different question - whether or not the models can fully account for the many factors out of teachers' hands. TNTP's take is that VA models “control for students’ past academic performance and demographic factors," which, they say, means that teachers “aren’t penalized for the effects of factors beyond their control." Even under ideal circumstances, that's just not accurate.
The evidence they cite is a frequently-misinterpreted paper by researchers at Vanderbilt University and the SAS Institute, published in 2004. What the analysis finds is that the results of a specific type of VA model (TVAAS) – one with very extensive data requirements, spanning multiple (in this analysis, five) years and subjects, in one specific location (Tennessee) - are not substantially different when variables measuring student characteristics (i.e., free/reduced lunch eligibility and race) are added to the models.
This does not, however, mean that the TVAAS model – or any other – can account for all the factors that teachers can’t control. For one thing, the free/reduced-price lunch variable is not a very good income proxy. Eligible students vary widely in family circumstances, which is a particular problem in high-poverty areas where virtually all the students qualify.
That paper aside, it's true that students' prior achievement scores account for much of the income-based variation in achievement gains (ironically, prior test scores are probably better at this than free/reduced-priced lunch). But not all of poverty's impacts are measurable/observed, and, perhaps more importantly, there are several other potential sources of bias, including the fact that students are not randomly assigned to classrooms (also here). VA scores are also affected by the choice of model, data quality and the test used. And, of course, even if there is no bias at all, many teachers will be “treated unfairly” by simple random error.
These are the important issues, the ones that need discussion. If we're going to use these VA estimates in education policy, we need to at least do it correctly and minimize mistakes. In many places around the nation, this isn't happening (also see Bruce Baker's discussion of growth models). As a result, the number of teachers "penalized" unfairly - whether because they have high-needs students or for other reasons beyond their control - may actually be destructively high. TNTP calls this a "myth." It's not.
The second “myth” they look at is the very common argument that VA scores are too volatile between years to be useful. This too is not a “myth," but it is indeed an issue that could use some clarifying discussion.TNTP points out that all performance measures fluctuate between years, and that they all entail uncertainty. These are valid points. However, their strongest rebuttal is that “teachers who earn very high value-added scores early in their career rarely go on to earn low scores later, and vice-versa."
Their “evidence” is an influential paper by researchers from Florida State University and the RAND Corporation (it was published in 2009). The analysis focuses on the stability of VA estimates over time. While everyone might have a different definition of “rarely," it’s safe to say that the word doesn’t quite apply in this case. Across all teachers, for instance, only about 25-40 percent of the top quintile (top 20%) teachers in one year were in the top quintile the next year, while between 20-30 percent of them ended up in the bottom 40%. Some of this volatility appears to have been a result of “true” improvement or degradation (within-teacher variation), but a very large proportion was due to nothing more than random error.
The accurate interpretation of this paper is that value-added estimates are, on average, moderately stable from year-to-year, but that stability improves with multiple years of data and better models (also see here and here for papers reaching similar conclusions). This does not mean that teachers scores "rarely" change over time, nor does it disprove TNTP's "myth." In fact, the papers' results show that VA estimates from poorly-specified models with smaller samples are indeed very unstable, probably to the point of being useless. And, again, since many states and districts are making these poor choices, the instability "myth" is to some degree very much a reality.
Value-added models are sophisticated and have a lot of potential, but we have no idea how they are best used or whether they will work. It is, however, likely that poor models implemented in the wrong way would "penalize" critically large numbers for reasons beyond their control, as well as generate estimates that are too unstable to be useful for any purpose, even low-stakes decisions. These are not myths, they are serious risks. Given that TNTP is actively involved in redesigning teacher quality policies in dozens of states and large districts, it is somewhat disturbing that they don't seem to know the difference.
- Matt Di Carlo
Stuart: When I said “let’s not accuse people,” I was talking about you, not TNTP. Maybe “accuse” is a strong word. As for the question itself, let’s just put it this way: Among people involved in education debates, the three arguments, as stated in the document, are too infrequent to be “myths” per se, but among the general public, they’re probably more common. Either way, once again, I don’t have a problem with TNTP pointing them out.
Mary: I appreciate your candor. If I could say I’m relatively certain that any use, no matter how it’s done, of test-based teacher productivity measures in evaluations will destroy public education, I would say so without hesitation. Based on my appraisal of the relevant research, as well as the fact that these things have never really been tried before, I can’t say that. And so I don’t.
I do think, on the other hand, that the specific manner in which this is being done – heavily-weighted estimates, ignoring error margins, rushed implementation – is wrong, that it won’t work and that it might cause harm. This I’ve said many times.
In general, all I can do is interpret the evidence as best I can, and point out when I think others aren’t doing likewise. If that represents betrayal in your eyes, then I’ll have to accept that, but I hope you keep reading and commenting nonetheless.
Wait a minute, Matt. How did you slip over to this extravagantly mealy-mouthed action threshold?
"...relatively certain that any use, no matter how it’s done, of test-based teacher productivity measures in evaluations will destroy public education, I would say so without hesitation. Based on my appraisal of the relevant research, as well as the fact that these things have never really been tried before, I can’t say that."
My degrees are in the natural sciences. I don't have a doctorate of my own, but I've spent hundreds of hours drafting and editing peer-reviewed research papers for my husband's medical biochemistry research (and now my son's computational biology). You don't really need me to tell you how unsound, unreliable, and invalid these methods are. The very research you discuss shows it, and you do a fair job of summarizing the results. No method this weak would ever (ever, ever!) be accepted or applied in the natural sciences, and a finding based on it wouldn't be publishable, let alone mandated by law for massive implementation. All I'm asking is that people admit that, so a real public discussion can begin of what just got rammed down our legislatures.
But very powerful forces demand it that all employed commentators pay it lip service. So, your standard for actually opposing it's implementation has slipped all the way to being "relatively certain" it "will destroy public education". And you can't be certain, of course, because it has never been tried!
You (and Weingarten and the soon-to-be-former AFT delegates who caved in to Gates) are all off the hook until public education is actually destroyed.
Meanwhile, according to the sanitized article in Wikipedia,
"TNTP is a revenue-generating nonprofit. The majority of its revenue comes from contracts with districts and states to supply services; additional funding for new program development and research is provided by donors such as the Bill and Melinda Gates Foundation."
And here's the bullet-list for their "Smart Spending for Better Teacher Evaluations" publication:
*Tools and Systems to guide and support the evaluation process.
*Training for evaluations and key school district staff.
*Communications to key audiences, especially teachers and school leaders.
*Monitoring to ensure consistent implementation across schools and districts.
*Sustainability of the new system over time, fiscally and substantively.
I wish you would take a less sanguine view of the myths around multiple measures. The issue should be stated with more nuance. Value-added (whether its a good idea of not) is valid enough to COMPLEMENT or SUPPLEMENT human judgments, but it should never DRIVE evaluations. Real world, being indicted as ineffective by a VAM would often be no different than being convicted. In many (most)systems that are under the gun,(and that means most poor systems) it would be a rare evaluator who dared trust his or her lying eyes and not convict a person with a low value-added. Gotham Schools has reported on NYC principals who already have delayed granting tenure to teachers who they see as worthy because the district has made its preference clear.
And in Tennessee, D.C. Florida, and elsewhere, multiple measures just means multiple hoops to jump through. When you have an evaluation rubric that doesn't account for difficult-to-educate populations, being used by evaluators who have trained to believe like Rhee, Huffman, Daly, Klein, et. al, then the second measure is not a check or a balance, but another gotcha.
The validity measure, by the way, should consider George Soros' example of a row of poisoned water. Even if one is poisoned, all are worthless. The issue is not the principals with extraordinary moral character who won't give in to pressure from above. The issue is a) principals who love control who are being given a loaded gun, without a trigger lock, and b) the pressure on the majority of principals to go along and get along with the accountability hawks who believe we can chop up knowledge into measurable pieces.
There will be two metrics that matter more than any other, and we won't know where the tipping point will be. Firstly, at what point does the fear generated by VAMs create enough pressure to mandate Cover Your Ass teacher-proof scripted rote instruction? The Gates' say, correctly, that that strategy would not be rational. But the world is not rational. CYA is the rational and predictable response by powerless institutions under siege.
Secondly, at what point do VAMs prompt an exodus of teachers from schools where it is harder to raise test scores? And when that started, the CYA tactic become even more rational and that further incentives bureucrats and principals to surround themselves with "yes men."
It looks like we're already seeing all of the above in turnaround schools. Speaking from my experience, I'm seeing school leaders going along with the top down aligned and paced curriculum, taught be scared 23 year-olds, who are being socialized into just following orders. Since these school leaders know that the tactics they were forced to embrace are doomed, the new metric is "exiting" Baby Boomers. And when VAMs get applied to principals evalaution, this fear and loathing could metastisize (sp?)
Finally, when the TNTP apologizes for its false and misleading statements about unions in general, and peer review in particular, I'll embrace your tone. But now, I see them as an enemy to be defeated. After all, who drafted the VAM language in RttTs? I've hardly seen a RttT application that did not cite the TNTP as a contributor, and as an organization that would be a consultant.