Value-Added, For The Record

People often ask me for my “bottom line” on using value-added (or other growth model) estimates in teacher evaluations. I’ve written on this topic many times, and while I have in fact given my overall opinion a couple of times, I have avoided expressing it in a strong “yes or no” format. There's a reason for this, and I thought maybe I would write a short piece and explain myself.

My first reaction to the queries about where I stand on value-added is a shot of appreciation that people are interested in my views, followed quickly by an acute rush of humility and reticence. I know think tank people aren’t supposed to say things like this, but when it comes to sweeping, big picture conclusions about the design of new evaluations, I’m not sure my personal opinion is particularly important.

Frankly, given the importance of how people on the ground respond to these types of policies, as well as, of course, their knowledge of how schools operate, I would be more interested in the views of experienced, well-informed teachers and administrators than my own. And I am frequently taken aback by the unadulterated certainty I hear coming from advocates and others about this completely untested policy. That’s why I tend to focus on aspects such as design details and explaining the research – these are things I feel qualified to discuss. (I also, by the way, acknowledge that it’s very easy for me to play armchair policy general when it's not my job or working conditions that might be on the line.)

That said, here’s my general viewpoint, in two parts. First, my sense, based on the available evidence, is that value-added should be given a try in new teacher evaluations.

I cannot say how large a role they should play, but I confess that I am uncomfortable with the 40-50 percent weighting requirements, not only because it strikes me as too high a starting point, but also because districts vary in their needs, preferences and current policy environments. In addition, I am seriously concerned about other details – the treatment of error, nominal versus actual weights, the manner in which the estimates are converted to ratings, etc. I think these issues are being largely ignored in many states and districts, even though they might, in my view, compromise this whole endeavor. I have strong opinions on these fronts, and I express them regularly.

Now, at this point, you might ask: How can you say you take teachers' opinions seriously and still support value-added, which most teachers don’t like? That’s a fair question. In my experience, the views of teachers who oppose value-added are not absolute.

Here’s my little field test (I recommend trying it yourself): When I’m talking to someone, especially a teacher, who is dead set against value-added, I ask them whether they could support using these estimates as 10-15 percent of their final evaluation score, with checks on the accuracy of the datasets and other basic precautions. Far more often than not, they are receptive (if not enthusiastic).

In other words, I would suggest that views on this issue, as is usually the case, are not binary – it’s a continuum. Teachers are open to trying new things, even if they're not crazy about them; they do this all the time (experienced teachers can [and will] tell you about dozens of programs and products that have come and gone over the years). So, while there’s plenty of blame to go around, this debate might have been a bit less divisive but for the unfounded, dealbreaker insistence on very high weights, for every district, right out of the gate.

This brings me to the second thing I want to say, which is more of a meta-opinion: Whether or not we use these measures in teacher evaluations is an important decision, but the attention it gets seems way overblown.

And I think this is because the intense debate surrounding value-added isn’t entirely – or perhaps even mostly - about value-added itself. Instead, for many people on both “sides” of this issue, it has become intertwined with - a kind of symbol of - firing teachers.

Supporters of these measures are extremely eager to use the estimates as a major criterion for dismissals, as many believe (unrealistically, in my view) that this will lead to very quick, drastic improvements in aggregate performance. Opponents, on the other hand, frequently assert (perhaps unfairly) that value-added represents an attempt to erect a scientific facade around the institutionalization of automatic dismissals that will end up being arbitrary and harmful. Both views (my descriptions of them are obviously generalizations) are less focused on the merits of the measures than on the connected but often severely conflated issue of how they’re going to be used.

Think about it: If, hypothetically, we were designing new evaluations solely for the purpose of helping teachers improve, without also tying them to dismissals or other high-stakes decisions, would there be as much controversy? I very much doubt it. We would certainly find plenty to argue about, but the areas of major disagreement today – e.g., how high the weights should be – might not be particularly salient, since teachers and administrators would presumably be given all the information to use as they saw fit.

Now, here’s a more interesting hypothetical: If we were designing new evaluations and did plan to use them for dismissals and other high-stake decisions, but value-added wasn’t on the table for whatever reason, would there still be relentless controversy over the measures we were using and how they were combined? I suspect there would be (actually, for teachers in untested grades/subjects, there already is).

That’s because, again, much of the fuss is about the decisions for which the ratings will be used, and the manner in which many of these systems are being imposed on teachers and other school staff. Value-added is the front line soldier in that larger war.

Thus, when I say that I think we should give value-added a try, that is really just saying that I believe, based on the available evidence, that the estimates transmit useful, albeit imperfect, information about teacher performance. Whether and how this information – or that from other types of measures - is appropriate for dismissals or other high-stakes decisions is a related yet in many respects separate question, and a largely empirical one at that. That's the whole idea of giving something a try - to see how it works out.

- Matt Di Carlo

Blog Topics

Hi Matt,

As always your balance and nuanced approach are important. You are quite right that the battle is more about the firing (but also shaming) of teachers. I think if teachers and administrators were simply offered to be given the calculations as information for reference, there would be no problem. But when using something for policy and other high stakes decision making (down to the opening and closing of entire schools) it makes sense to require a high degree (.9+) of reliability and validity which, in their totality, I don't think these models can make a very strong claim to.

But there is one last reason, perhaps most important, that these models should be used for information, more than for strict/automatic decisions. That is the role the tests themselves play in the curriculum. As a teacher, I was not at all worried about being fired. I was worried about how my entire curriculum was designed not to teach conceptual understanding of mathematics but to narrowly produce the scores that would make people look good. As a test prep coach and math major, I was acutely aware of the difference and how much it impacted all decision making--including mine. In my opinion, the overuse of the tests is the disease, the model and the debate are more symptom.

If value-added ratings based on one year's worth of data are highly volatile and even random, as they seem to be, why should they be included as part of teacher evaluation? Instead they will give teachers a false sense of confidence if they are high, and unfairly wreck their morale if they are low.

I'm stunned. I'm abssolutely stunned.

I can't even gather myself to reply. So, I'll just ask this. What available evidence do you have that value-added is valid for high-poverty high schools?

After I get over my shock, depending if you cite a source for high schools (and please don't cite Chetty et. al. because it excludes classes with over 25% of students are on IEPs)I'll want to what evidence you find to be persuasive. Or, are we teachers (and our students who will be sacrificed to more rote instruction) in the inner city to be seen as pawns to be sacrificed in the off chance that that evidence you cite might prove valid?

I just can't believe you wrote that ...

Hi John,

I'm not sure why you're stunned, but I guess it's good I wrote this post. In either case, I like the candor.

For me to answer your question, you would have to define "valid," and, specifically, "valid for what?" As you know, validity is a property of the inferences one draws from the measures, rather than the measures themselves. This means that one cannot really address the validity of value-added without reference to how they’re used.

Correct me if I'm wrong, but the language in your comment (“pawns to be sacrificed ”) makes it sound like you’re objecting to firing teachers with value-added. One of my big points in this post, which I may not have expressed clearly enough, is the distinction between value-added as potentially useful signal and value-added as the basis for high-stakes decisions.

Accordingly, I would ask you the following questions: Are you okay with firing teachers based on classroom observations, or on some combination of non-test measures? If that's acceptable to you, can you produce evidence that observation ratings (or the other non-test measures) are "valid" as you define it? If you're not okay with firing, then this becomes a different conversation.

Thanks for the comment,
MD

P.S. There is some research on value-added at the high school level, but it's a little thin, since most states only test in grades 3-8.

Matthew,

Would a system be valid if 5 or 10 or 15% of teachers of a high poverty high school PER YEAR have their careers damaged or destroyed due a statistical model that can't control for poverty? How many teachers must see how many of their friends damaged by flawed guest-imates before they throw in the towell? How could urban teachers have any piece of mind when they can be fired, in part, because of circumstances beyond their control? Is it valid, in terms of policy, to take such risks in high schools without first doing the research on high schools?

Of course, I support the firing of ineffective teachers. My preference would be PAR, but I'd prefer empowered, even unreformed, principals over a system that encorouges more bubble-in testing. I'd even reluctantly support the Grand Bargain where peer evaluators consider value-added, although there is a huge difference between using it to complement or check human observations, as opposed to the system which is now incentivized - where value-added can indict a teacher as ineffective. For many principals, the indictment would be tantamount to a conviction.

But, value-added is systemically different for three reasons. Its most likely to produce an exodus of teaching talent from schools where it is harder to raise test scores. Secondly, abusive human evaluators, I bet, are distributed pretty equally across all types of schools. Value-added stacks EVERYTHING against urban scores, where fearful administrators clearly are more willing to sacrifice their teachers and play games with numbers.

Steve nailed the biggest reason. Value-added will sentence more students to educational malpractice as teachers are pressured to teach to the test.

And, given our victories recently, I'm reconsidering my support for the Grand Bargain. Now is the time to drive a stake through the heart of high-stakes test-driven experiments.

Finally, you indicated you support 10 to 15% high stakes, which lowers the potential harm. But, graduation rates and attendence rates, often, are only 10%, each, of NCLB accountability. But, how many districts responded by completely fabricating those metrics? I bet they are the most falsified data in existance. In other words, even fairly small metrics are huge for some administrators and thus become their job #1. and that gets us back to why value-added is systemically more dangerous for poor schools. Certainly, that has been my experience where under-the-gun urban systems react with the most fear and retribution.

John Thompson- to your point, check out empirical evidence from TN on their public website: https://tvaas.sas.com/welcome.html?ad=DDm8vljOAeANuo4n

Click "Reports" - "Scatterplots" - Pick a subject/grade/year - "Growth vs. % Econonomically Disadvantaged" - and "Add All" schools.

You can view scatterplots of all TN schools ranked by % FRPL and see that high-poverty schools have a fair and equal ability to show growth as compared to affluent schools. Actually, many more of these high-poverty schools are above the growth standard (making more than the expected amount of progress with their students as measured by value-added) than not.

Full disclosure, I work for SAS, which provides TN's value-added analysis. I think that looking through the actual data is the best way to see the level playing field across socioeconomic, demographic, SPED, and achievement groups. However, I also understand that not all value-added models are the same across states, nor are the policy decisions placed around them the same.

Here is one of my blogs where I interviewed 2 NC teacher's to gather their opinions of value-added being used as one component of their evaluation system: http://blogs.sas.com/content/statelocalgov/2012/06/27/nc-teachers-voice…

Nadja,

You need a password to access it.

Can you access it and find evidence that high-poverty NEIGHBORHOOD high schools are not disadvantaged? If so, I'd like to read it. If so, can you find a single objective study that confirms it.

While we're comparing scatterplots, check out Tulsa's web site. Between 85% and 100% of all of their high schools have the identical pattern - high performing schools have high growth while low performing have low growth. So, maybe, Tennessee can do mass firings of high poverty high school teachers and not do injustices, but this is a big diverse nation. "Reformers" are imposing it on 90% low income districts, like mine, as well as places with many times as much per student funding.

By the way, have you taught in a high-poverty NEIGHBORHOOD high school? If you have, you know that they typically have nothing in common with high poverty msagnet and charter schools.

Please note that I choose my words carefully and I wrote of "schools where it is harder to raise test scores." If you've been in NEIGHBORHOOD schools you've seen that the out-of-control violence, the prohibitions of enforcing attendance, disciplinary and academic policies, and other policies set from above correlate pretty well with poverty, but its not one-to-one. Maybe you've seen a high-poverty NEIGHBORHOOD high school where teachers were allowed to enforce the rules, but I never have. Who knows? If the two teachers you cite have taught in the inner city, perhaps they've experienced something other than mayhem. I can't imagine a teacher who has taught in schools like mine who would trust a value-added model.

And please remember, the issue isn't whether value-added might work some places. The issue is whether it will cause devasting harm to many. And the poorer the district, the more likely that value-added could prove an existential threat. In a place like my OKC, where there are twenty something other districts in the county so that teachers just need to extend their commute a few minutes to get away from the perfect store of value-added and intense concentrations of trauma and generational poverty, I don't think our district has a snowball's chance of surviving an extended period of high-stakes value-added.

Matt-
I'm very appreciative of the thoughtful comments you have made. In the abstract, I can agree that value-added estimates might provide a useful, but limited signal regarding school or teacher performance. (As already noted, the tests themselves are very limited as measures. ) However, I think you would agree, the major problems arise in how states are using this information. It seems that there has been far too little research done to validate the overall evaluation systems being enacted, not just the role value-added measures will play. (I worry that many policymakers harbor a bias towards what they believe are objective quantitative measures of performance, having little understanding of or appreciation for the many subjective judgments that go into their construction.)
I was pleased to see that Ms. Young has entered this conversation. I’ll state up front that I'm not convinced that EVAAS is purged of the bias introduced by unmeasured demographic characteristics. I was wondering what SAS's response was to the concerns raised by Dr. Bruce Baker in this post: http://schoolfinance101.wordpress.com/2011/11/06/when-vams-fail-evaluat… Baker's work suggests that the school level Ohio value-added scores are negatively related to poverty (free and reduced price lunch) and special education status.
When I looked at school level EVAAS (PVAAS) indexes published for Pennsylvania, I found that truancy rates (not typically part of state data collections) have a moderate negative (-.4) correlation to PVAAS reading scores. The Ballou, Mokher, and Cavalluzzo paper from last March's AEFP conference suggests that model choice and unmeasured demographic characteristics can have a very substantial impact on which teachers end up in the tails of the value-added distribution. So it seems to me that John Thompson’s concerns are very important and not easily dismissed. I think it incumbent on states to provide evidence regarding potential VAM bias that is evaluated by qualified third parties. From what I have observed, states don’t even seem to bother with putting out RFPs or solicit competitive bids for their vendors of value-added analyses. Relying solely on vendors for technical reviews of their own products is indefensible.
Finally, if you know of any independent organization that is evaluating either the value-added models being sold to states or the overall evaluation systems they are being incorporated into, please pass that along. I recently saw a piece put out by the Center for American Progress. At least in the case of Pennsylvania, they only interviewed a couple of officials of the State Department of Education. I have no doubt they would have come up with a different picture had they bothered to dig deeper. Until these concerns are addressed and the evaluation systems undergo rigorous validation, I will retain my belief that value-added remains a noisy signal of performance and will find much greater value as a powerful research tool.

(I posted this once and it disappeared. Sorry if this is a dupe)

Given the "thin research" on VAM and high school, why would you write an entire post in favor of using VA to assess teachers without mentioning whether you would limit it to teachers of 3-8 or not? Seems like a big hole. So, do you think it's only appropriate for elementary school teachers, or do you think that all new teachers (and apparently new teachers only) should be assessed with VA?

Second question, and this one is real: does VA take average proficiency level per classroomus into account?

Teacher A has 30 students: 5 advanced, 15 proficient, 10 just below basic.

Teacher B has 30 students: 5 basic, 10 below basic, 15 far below basic.

Are the expectations for each teacher the same for each proficiency level?

Or, a real-life example:

http://educationrealist.wordpress.com/2012/02/24/algebra-student-distri…

This is the actual distribution of incoming algebra ability at a Title I school (I collected the data myself, as one of the teachers). Should each teacher be assessed based on incoming ability only, even though some teachers had far fewer below basic students than others?