Premises, Presentation And Predetermination In The Gates MET Study
** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post
The National Education Policy Center today released a scathing review of last month’s preliminary report from the Gates Foundation-funded Measures of Effective Teaching (MET) project. The critique was written by Jesse Rothstein, a highly-respected Berkeley economist and author of an elegant and oft-cited paper demonstrating how non-random classroom assignment biases value-added estimates (also see the follow-up analysis).
Very quickly on the project: Over two school years (this year and last), MET researchers, working in six large districts—Charlotte-Mecklenburg, Dallas, Denver, Hillsborough County (FL), Memphis, and New York City—have been gathering an unprecedented collection of data on teachers and students, grades 4-8. Using a variety of assessments, videotapes of classroom instruction, and surveys (student surveys are featured in the preliminary report), the project is attempting to address some of the heretofore under-addressed issues in the measurement of teacher quality (especially non-random classroom assignment and how different classroom practices lead to different outcomes, neither of which are part of this preliminary report). The end goal is to use the information to guide in the creation of more effective teacher evaluation systems that incorporate high-quality multiple measures.
Despite my disagreements with some of the Gates Foundation’s core views about school reform, I think that they deserve a lot of credit for this project. It is heavily-resourced, the research team is top-notch, and the issues they’re looking at are huge. The study is very, very important — done correctly.
But Rothstein’s general conclusion about the initial MET report is that the results “do not support the conclusions drawn from them." Very early in the review, the following assertion also jumps off the page: "there are troubling indications that the Project’s conclusions were predetermined."
On first read, it might sound like Rothstein is saying that the MET researchers cooked the books. He isn’t. In the very next sentence, he points out that the MET project has two stated premises guiding its work— that, whenever feasible, teacher evaluations should be based “to a significant extent” on student test score gains; and that other components of evaluations (such as observations), in order to be considered valid, must be correlated with test score gains. (For the record, there is also a third premise – that evaluations should include feedback on teacher practice to support growth and development – but it is not particularly relevant to his review.)
So, by “predetermined," Rothstein’s is saying that the MET team’s acceptance of the two premises colors the methods they choose and, more importantly, how they interpret and present their results. That is, since the project does not ask all the right questions, it cannot provide all of the most useful answers in the most useful manner. This issue is evident throughout Rothstein’s entire critique, and I will return to it later. But first, a summary of the MET report’s three most important findings and Rothstein’s analysis of each (I don’t review every one his points):
MET finding: In every grade and subject, a teacher’s past track record of value-added is among the strongest predictors of their students’ achievement gains in other classes and academic years.
This was perhaps the major conclusion of the preliminary report. The implication is that value-added scores are good signals of future teacher effects, and it therefore makes sense to evaluate teachers using these estimates. In response, Rothstein points out that one cannot assess whether value-added is the “among the strongest predictors” of teacher effects without comparing it with a broad array of alternative predictors. Although the final MET report will include several additional “competitors” (including scoring of live and videotaped classroom instruction), this initial report compares only two: value-added scores and the surveys of students’ perception of teachers’ skills and performance — not the strongest basis to call something “among the best." (Side note: One has to wonder whether it is reasonable to expect that any alternative measure would ever predict value-added better than value-added itself, and what it really proves if none does.)
Moreover, notes Rothstein, while it’s true that past value-added is a better predictor than are student perceptions, the actual predictive power of the former is quite low. Because the manner in which the data are presented makes this difficult to assess, Rothstein makes his own calculations (see his appendix). I’ll spare you the technical details, but he finds that the explanatory power is not impressive, and the correlations it implies are actually very modest. Even in math (which was more stable than reading), a teacher with a value-added score at the 25th percentile in one year (less effective than 75 percent of other teachers) is just as likely to be above average the next year as she is to be below average. And there is only a one-in-three chance that she will be that far below average in year two compared with year one. This is the old value added story — somewhat low stability, a lot of error (though this error likely decreases with more years of data). The report claims that this volatility is not too high to preclude the utility of value-added in evaluations, and indeed uses the results as evidence of value-added’s potential; Rothstein is not nearly so sure.
MET finding: Teachers with high value-added on state tests tend to promote conceptual understanding as well.
This finding is about whether value-added scores differ between tests (see here and here for prior work). It is based on a comparison of two different value-added scores for teachers: one derived from the regular state assessment (which varies by participating district) and the other from an alternative assessment that is specifically designed to measure students’ depth of higher-order conceptual understanding in each subject. The MET results, as presented, are meant to imply that teachers whose students show gains on the state test also produce gains on the test of conceptual understanding—that teachers are not just “teaching to the test."
Rothstein, on the other hand, concludes that these correlations are also weak, so much so that “it casts serious doubt on the entire value-added enterprise." For example, more than 20 percent of the teachers in the bottom quartile (lowest 25 percent) on the state math test are in the top two quartiles on the alternative assessment. And it’s even worse for the reading test. Rothstein characterizes these results as “only slightly better than coin tosses.” So, while the MET finding — that high value-added teachers “tend to” promote conceptual understanding—is technically true, the modest size of these correlations makes this characterization somewhat misleading. According to Rothstein, critically so.
MET finding: Teacher performance—in terms of both value-added and student perceptions—is stable across different classes taught by the same teacher.
This is supposed to represent evidence that estimates of a given teacher’s effects are fairly consistent, even with different groups of students. But, as Rothstein notes, the stability of value-added and student perceptions across classes is not necessarily due to the teacher. There may be other stable but irrelevant factors that are responsible for the consistency, but remain unobserved. For example, high scores on surveys querying a teacher’s ability to control students’ behavior may be largely a result of non-random classroom assignment (some teachers are deliberately assigned more or less difficult students), rather than actual classroom management skills. Since the initial MET report makes no attempt to adjust methods (especially the survey questions) to see if the stability is truly a teacher effect, the results, says Rothstein, must be considered inconclusive (the non-random assignment issue also applies to most of the report's other findings on value-added and student surveys).
The final MET report will, however, directly address this issue in its examination of how value-added estimates change when students are randomly assigned to classes (also see here for previous work). In the meantime, Rothstein points out that the MET researchers could have checked their results for evidence of bias but did not, and in ignoring non-random assignment, they seem to be anticipating the results of the unfinished final study.
In addition to these points, Rothstein raises a bunch of important general problems with the study. Many of them are well-known, and they mostly pertain to the presentation of results, but a few are worth repeating:
- Because it contains no literature review, the MET report largely ignores many of the serious concerns about value-added established by this literature – these issues may not be well-known to the average reader (for the record, there are citations in the study, but no formal review of prior research).
- The value-added model that the MET project employs, while common in the literature, is also not designed to address how the distribution of teacher effects varies between high- and low-performing classrooms (e.g., teachers of ELL classes are assumed to be of the same average effectiveness as teachers of gifted/talented classes) . The choice of models can have substantial impacts on results.
- Technical point (important, but I’ll keep it short here, and you can read the review if you want more detail): the report’s methods overstate the size of correlations by focusing only on that portion of performance variation that is “explained” by who the teacher is (even though most of the total variation is not explained).
- Perhaps most importantly, MET operates in a no-stakes context, but its findings—both preliminary and final—will be used to recommend high-stakes policies. There is no way to know how the introduction of these stakes will affect results (this goes for both value-added and the student surveys).
Pretty brutal, but not necessarily as perplexing as it seems. With a few exceptions, most of Rothstein’s criticisms pertain to how the results are presented and the conclusions drawn from them, rather than to actual methods. This is telling, and it brings us back to the two premises (out of three) that guide the MET project—that value-added measures should be included in evaluations, and that other measures should only be included if they are predictive of students’ test score growth.
You may also notice that both premises are among the most contentious issues in education policy today. They are less assumptions than empirical questions, which might be subject to testing (especially the first one). Now, in a sense, the MET addresses many of the big issues surrounding these questions – e.g., by checking how value-added varies across classrooms, tests, and years. And they are very clear about acknowledging the limitations of value-added, but they still think it needs to be, whenever possible, a part of multi-measure evaluations. In other words, they’re not asking whether value-added should be used, only how. Rothstein, on the other hand, is still asking the former question (and he is most certainly not alone).
Moreover, the two premises represent a tautology—student test score growth is the most important measure, and we have to choose other teacher evaluation measures based on their correlation with student test score growth because student test score growth is the most important measure… This point, by the way, has already been made about the Gates study, as well as about seniority-based layoffs and about test-based policies in general.
There is tension inherent in a major empirical research project being guided by circular assumptions that are the very questions the research might be trying to answer. Addressing these questions would be very difficult, and doing so might not change the technical methods of MET, but it might definitely influence how they interpret and present their results. This, I think, goes a long way towards explaining how Rothstein could draw opposite conclusions from the study based on the same set of results.
For example, some of Rothstein’s most important arguments pertain to the size of the correlations (e.g., value-added between years and classrooms). He finds many of these correlations to be perilously low, while the MET report concludes that they are high enough to serve as evidence of value-added’s utility. This may seem exasperating. But think about it: If your premise is that value-added is the most essential component of an evaluation (as does the MET project)—so much so that all other measures must be correlated with it in order to be included — your bar for precision may not be as high as someone who is still skeptical. You may be more likely to tolerate mistakes, even a lot of mistakes, as collateral damage. And you will be less concerned about asking whether evaluations should target short-term testing gains versus some other outcome.
For Rothstein and many others, the MET premises are still unanswered questions. Assuming them away leaves little room for the possibility that performance measures that are not correlated with value-added might be transmitting crucial information about teaching quality, or that there is a disconnect between good teaching and testing gains. The proper approach, as Rothstein notes, is not to ask whether all these measures correlate with each other or over time or across classrooms, but whether they lead to various types of better student outcomes in a high-stakes, real life context. In other words, testing policies, not measures.
The MET project's final product will provide huge insights into teacher quality and practice no matter what, but how it is received and interpreted by policymakers and the public is also critical. We can only hope that, in making these assumptions, the project is not compromising the usefulness and impact of what is potentially one of the most important educational research projects in a long time.
Thanks for the detailed review and response to Rothstein's work. Looks like this is one I'll dig into myself. There are a couple of other issues that aren't raised here, issues that, as a classroom teacher, I find essential to understanding what actually happens in schools.
First, there's a necessary assumption that no significant changes occurred from year to year if you intend to compare those data from year to year. But in a given year, how many changes might there be in my school or classroom? New curriculum, technology, collaborative partners, class-sizes, schedules, administrative/discipline policies... and based on studies of each item in that list, I can argue that separately, each of those changes might produce a corresponding effect on test scores. Can researchers identify all the relevant factors? If they could identify those factors, would they have enough data to do anything about it? If they had the data, what research would guide their decisions about the proper formula to weight each factor? I anticipate that researchers would argue those factors affect all teachers, and those who perform best are still worth identifying so that we see who makes the most of those situations. That argument assumes that each relevant in-school factor affects all teachers equally - an faulty assumption that could only be made by someone who hasn't taught in many schools or for very long. See how ridiculous this becomes, from a classroom perspective?
Furthermore, individual teacher effects cannot reasonably be isolated on reading tests. My students read outside of my class, and receive relevant instruction outside of my class; they read in every other class, and have teachers with varying abilities helping or hindering their progress. You would need many years of randomized student groups (holding all other factors constant - ha! See above) to make up for the fact that my students have other means of learning reading. That is probably less true in math, especially in higher grades, and I would speculate that is the reason that reading scores are less stable. (Less true because while students read in all classes, they are not doing arithmetic or algebra in all classes).
I agree with David -- great piece. But we have a problem: the MET study will get tons of coverage; analyses that criticize it won't. This is a major problem that can only be resolved by promotion -- and it would appear that Gates will win that round handily.
Regarding the theme of pre-determined conclusions, if you ever worked with Mr. Grates, or you ever worked at Microsoft, you may have detected not so much a sense of pre-determined conclusions but an emphatic confidence of being right in the moment.
As a young man, Bill Gates was never afraid to go to the mat for what he believed -- and to shred anyone who challenged him. This produced some very good results in the software business, a survival of the fittest in the idea fitness landscape.
I never knew Mr. Gates, however, to stick dogmatically or ideologically to many points. He switched to new ideas when old ones failed. He was just supremely confident all the time.
The Gates Foundation's work in education has been supremely confident all the time. It has never been tentative about anything. At the same time, it has yet to be right about anything.
I would submit that as the organization does more research, we see more and more of this surety -- until we don't. That is, Gates can continue to study the problem of education indefinitely in ways other organizations cannot. And it would be foolish to think that the style of research they conduct will ever change. They're just never going to do that humble, self-conscious, reverence for the past stuff that is the stock and trade of tradition academic research. It's not that they don't believe in it; it's just not their style.
The Gates Foundation doesn't invest in things it isn't sure about. It's not a think tank or a research organization. Research is a strategy for policy development which is a strategy for problem-solving. Policies are right until proven wrong, as are the the problems they are designed to solve. Small schools was a big success until it wasn't.
This may seem inappropriate, but if you've at worked in the culture of technology -- and especially with Microsoft in the pre-1995 years, you would recognize this not so much as a dubious stance on educational research but as a competitive approach to educational problem-solving -- with a certain kind of research as a strategy, not as an end in itself.
Gates may eventually "get it right". But if they do, it will be the same way Microsoft "got it right" -- by taking so many opportunities to get it wrong.
Thanks. I'm curious why you stressed "more than 20 percent of the teachers in the bottom quartile (lowest 25 percent) on the state math test are in the top two quartiles on the alternative assessment," when the more important issue, it seems to me, is a couple of sentences later. Rothstein wrote "More than 40% of those whose actually available state exam scores place them in the bottom quarter are in the top half on the alternative assessment."
As I read it, districts that use those tests would have to risk the sacrifice of two effective teachers to get rid of three alledgedly ineffective ones.
Also, the Appendix says "Value-added for state ELA scores is the least predictable of the measures, with an R-squared of 0.167. A teacher with predicted value-added for state ELA scores at the 25th percentile is actually more likely to fall in the top half of the distribution than in the bottom quarter!
Does that mean that the VAM is less likely than a coin flip to be accurate?
The finding you cite is the same as mine, but the estimates are corrected for measurement error (from the testing instrument). I chose the uncorrected figures because they represent the "best case scenario," as well as because it makes the case just fine on its own. I might have mentioned that the disattenuated estimates are even worse, but I had to pick and choose (the post is already too long!).
As for the coin flip situation, it depends on how you define "accurate." In the coin flip analogy, accuracy is defined in terms of stability between years, which is probably the relevant perspective when we're talking about using VA in high-takes decisions (there's no point in punishing/rewarding teachers based on end-of-year VA scores if they don't predict performance the next year). On the other hand, keep in mind that the MET project is only using two years of data, and stability increases with sample size (a point that would not matter if these measures are used for probationary teachers, of course).
Perhaps more importantly, remember that the stability between the 25th percentile in one year, and what happens the next year, is one particular slice of the distribution. For instance, the chance of a 10th percentile teacher ending up in the top half the next year would be lower than 50/50, and that is just as much a measure of accuracy as any other comparison, right?
The coin flip analogy is catchy, and it is a vivid demonstration of the instability (though interpretations of the size of these correlations seems to vary widely). But we might be careful about inferring it to characterize the accuracy of VA in general. The instability varies depending on what you choose as the "starting" and "ending" points, as well as with sample size.
So it would be accurate to say that districts adopting that method would be risking the invalid firing of two effective teachers in order to fire three ineffective ones?
That, in itself, should be a show-stopper at least if vAMs are solely in the hands of management.
I also appreciate any reluctance to speculate on legal matters, but I'd think that would call into question the legal basis of many or most terminations based significantly on vAMs in the hands of management. It is a fundamental principle of American juriprudence that the government can't take a person's property rights without evidence that applies to that person's case. I wouldn't think that a 60% chance that this evidence applies to a teacher would meet judicial scrutiny in many states.
One has to wonder how you leave comments so frequently and on so many blogs, and respond so quickly to replies, while also blogging yourself and having a day job. Doesn't seem fair to the rest of us.
The legal matters are way beyond my purview. I would check out Bruce Baker's work as a starting point.
And my response to the "firing three validly requires firing two invalidly" is the same as my response to the coin toss: in both cases, you are choosing a specific point in the distribution, and inferring it to value-added in general. You could only do so if a district or state adopted a policy that specifically fired teachers based *exclusively* on value-added scores that fell at or below the 25th percentile (and that won't happen [let's hope!]).
On a point unrelated to your comment, while I'm all about finding ways to express complex information in a vivid, understandable manner, and while I'm still highly skeptical about the role of value-added in evalutions and other policies (especially HOW it is used, and how quickly we are adopting it), I know you agree that everyone needs to be (or stay) more careful about how we present our case.
Call me naive, but I still believe that most (but not all) of the people pushing VA are, at least to some extent, reasonable and receptive to discussion. We make our case more effectively by staying cool and presenting our evidence fairly. There is plenty, and it stands on its own.
Just my opinion...
Thanks again for your comment,
Just to be clear, John: I'm holding you up as an example of someone who WAS careful with words (that's what I meant by "unrelated to your post;" didn't copyedit). You found a finding, and instead of using it in arguments, you asked whether it would be fair. Others don't always do the same (and I include myself in that, by the way), and that causes a lot of unnecessary polarization. It's an old and mundane point, but it always bears repeating, especially when the subject is as complicated as value-added.
Matthew (or others),
I read your post as well as Jesse Rothstein's piece. I was especially interested in the idea that there may be a difference between "teacher quality" depending on what kind of test is used to measure the outcomes. Specifically, Rothstein argues that the MET study shows that teachers who are good at improving scores on the regular state assessments are not necessarily good at improving scores on tests that measure higher-order thinking skills, despite the study authors' claims to the contrary.
Can you point me toward other research in this general area, namely, that the kind of test has some impact on a teacher's value-added scores?