Skip to:

Premises, Presentation And Predetermination In The Gates MET Study


Thanks for the detailed review and response to Rothstein's work. Looks like this is one I'll dig into myself. There are a couple of other issues that aren't raised here, issues that, as a classroom teacher, I find essential to understanding what actually happens in schools. First, there's a necessary assumption that no significant changes occurred from year to year if you intend to compare those data from year to year. But in a given year, how many changes might there be in my school or classroom? New curriculum, technology, collaborative partners, class-sizes, schedules, administrative/discipline policies... and based on studies of each item in that list, I can argue that separately, each of those changes might produce a corresponding effect on test scores. Can researchers identify all the relevant factors? If they could identify those factors, would they have enough data to do anything about it? If they had the data, what research would guide their decisions about the proper formula to weight each factor? I anticipate that researchers would argue those factors affect all teachers, and those who perform best are still worth identifying so that we see who makes the most of those situations. That argument assumes that each relevant in-school factor affects all teachers equally - an faulty assumption that could only be made by someone who hasn't taught in many schools or for very long. See how ridiculous this becomes, from a classroom perspective? Furthermore, individual teacher effects cannot reasonably be isolated on reading tests. My students read outside of my class, and receive relevant instruction outside of my class; they read in every other class, and have teachers with varying abilities helping or hindering their progress. You would need many years of randomized student groups (holding all other factors constant - ha! See above) to make up for the fact that my students have other means of learning reading. That is probably less true in math, especially in higher grades, and I would speculate that is the reason that reading scores are less stable. (Less true because while students read in all classes, they are not doing arithmetic or algebra in all classes).

I agree with David -- great piece. But we have a problem: the MET study will get tons of coverage; analyses that criticize it won't. This is a major problem that can only be resolved by promotion -- and it would appear that Gates will win that round handily. Regarding the theme of pre-determined conclusions, if you ever worked with Mr. Grates, or you ever worked at Microsoft, you may have detected not so much a sense of pre-determined conclusions but an emphatic confidence of being right in the moment. As a young man, Bill Gates was never afraid to go to the mat for what he believed -- and to shred anyone who challenged him. This produced some very good results in the software business, a survival of the fittest in the idea fitness landscape. I never knew Mr. Gates, however, to stick dogmatically or ideologically to many points. He switched to new ideas when old ones failed. He was just supremely confident all the time. The Gates Foundation's work in education has been supremely confident all the time. It has never been tentative about anything. At the same time, it has yet to be right about anything. I would submit that as the organization does more research, we see more and more of this surety -- until we don't. That is, Gates can continue to study the problem of education indefinitely in ways other organizations cannot. And it would be foolish to think that the style of research they conduct will ever change. They're just never going to do that humble, self-conscious, reverence for the past stuff that is the stock and trade of tradition academic research. It's not that they don't believe in it; it's just not their style. The Gates Foundation doesn't invest in things it isn't sure about. It's not a think tank or a research organization. Research is a strategy for policy development which is a strategy for problem-solving. Policies are right until proven wrong, as are the the problems they are designed to solve. Small schools was a big success until it wasn't. This may seem inappropriate, but if you've at worked in the culture of technology -- and especially with Microsoft in the pre-1995 years, you would recognize this not so much as a dubious stance on educational research but as a competitive approach to educational problem-solving -- with a certain kind of research as a strategy, not as an end in itself. Gates may eventually "get it right". But if they do, it will be the same way Microsoft "got it right" -- by taking so many opportunities to get it wrong.

Thanks. I'm curious why you stressed "more than 20 percent of the teachers in the bottom quartile (lowest 25 percent) on the state math test are in the top two quartiles on the alternative assessment," when the more important issue, it seems to me, is a couple of sentences later. Rothstein wrote "More than 40% of those whose actually available state exam scores place them in the bottom quarter are in the top half on the alternative assessment." As I read it, districts that use those tests would have to risk the sacrifice of two effective teachers to get rid of three alledgedly ineffective ones. Also, the Appendix says "Value-added for state ELA scores is the least predictable of the measures, with an R-squared of 0.167. A teacher with predicted value-added for state ELA scores at the 25th percentile is actually more likely to fall in the top half of the distribution than in the bottom quarter! Does that mean that the VAM is less likely than a coin flip to be accurate?

Hey John, The finding you cite is the same as mine, but the estimates are corrected for measurement error (from the testing instrument). I chose the uncorrected figures because they represent the "best case scenario," as well as because it makes the case just fine on its own. I might have mentioned that the disattenuated estimates are even worse, but I had to pick and choose (the post is already too long!). As for the coin flip situation, it depends on how you define "accurate." In the coin flip analogy, accuracy is defined in terms of stability between years, which is probably the relevant perspective when we're talking about using VA in high-takes decisions (there's no point in punishing/rewarding teachers based on end-of-year VA scores if they don't predict performance the next year). On the other hand, keep in mind that the MET project is only using two years of data, and stability increases with sample size (a point that would not matter if these measures are used for probationary teachers, of course). Perhaps more importantly, remember that the stability between the 25th percentile in one year, and what happens the next year, is one particular slice of the distribution. For instance, the chance of a 10th percentile teacher ending up in the top half the next year would be lower than 50/50, and that is just as much a measure of accuracy as any other comparison, right? The coin flip analogy is catchy, and it is a vivid demonstration of the instability (though interpretations of the size of these correlations seems to vary widely). But we might be careful about inferring it to characterize the accuracy of VA in general. The instability varies depending on what you choose as the "starting" and "ending" points, as well as with sample size.

Thanks again. So it would be accurate to say that districts adopting that method would be risking the invalid firing of two effective teachers in order to fire three ineffective ones? That, in itself, should be a show-stopper at least if vAMs are solely in the hands of management. I also appreciate any reluctance to speculate on legal matters, but I'd think that would call into question the legal basis of many or most terminations based significantly on vAMs in the hands of management. It is a fundamental principle of American juriprudence that the government can't take a person's property rights without evidence that applies to that person's case. I wouldn't think that a 60% chance that this evidence applies to a teacher would meet judicial scrutiny in many states.

John, One has to wonder how you leave comments so frequently and on so many blogs, and respond so quickly to replies, while also blogging yourself and having a day job. Doesn't seem fair to the rest of us. The legal matters are way beyond my purview. I would check out Bruce Baker's work as a starting point. And my response to the "firing three validly requires firing two invalidly" is the same as my response to the coin toss: in both cases, you are choosing a specific point in the distribution, and inferring it to value-added in general. You could only do so if a district or state adopted a policy that specifically fired teachers based *exclusively* on value-added scores that fell at or below the 25th percentile (and that won't happen [let's hope!]). On a point unrelated to your comment, while I'm all about finding ways to express complex information in a vivid, understandable manner, and while I'm still highly skeptical about the role of value-added in evalutions and other policies (especially HOW it is used, and how quickly we are adopting it), I know you agree that everyone needs to be (or stay) more careful about how we present our case. Call me naive, but I still believe that most (but not all) of the people pushing VA are, at least to some extent, reasonable and receptive to discussion. We make our case more effectively by staying cool and presenting our evidence fairly. There is plenty, and it stands on its own. Just my opinion... Thanks again for your comment, MD

Just to be clear, John: I'm holding you up as an example of someone who WAS careful with words (that's what I meant by "unrelated to your post;" didn't copyedit). You found a finding, and instead of using it in arguments, you asked whether it would be fair. Others don't always do the same (and I include myself in that, by the way), and that causes a lot of unnecessary polarization. It's an old and mundane point, but it always bears repeating, especially when the subject is as complicated as value-added.

Matthew (or others), I read your post as well as Jesse Rothstein's piece. I was especially interested in the idea that there may be a difference between "teacher quality" depending on what kind of test is used to measure the outcomes. Specifically, Rothstein argues that the MET study shows that teachers who are good at improving scores on the regular state assessments are not necessarily good at improving scores on tests that measure higher-order thinking skills, despite the study authors' claims to the contrary. Can you point me toward other research in this general area, namely, that the kind of test has some impact on a teacher's value-added scores? Thanks.


This web site and the information contained herein are provided as a service to those who are interested in the work of the Albert Shanker Institute (ASI). ASI makes no warranties, either express or implied, concerning the information contained on or linked from The visitor uses the information provided herein at his/her own risk. ASI, its officers, board members, agents, and employees specifically disclaim any and all liability from damages which may result from the utilization of the information provided herein. The content in the Shanker Blog may not necessarily reflect the views or official policy positions of ASI or any related entity or organization.