Premises, Presentation And Predetermination In The Gates MET Study
** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post
The National Education Policy Center today released a scathing review of last month’s preliminary report from the Gates Foundation-funded Measures of Effective Teaching (MET) project. The critique was written by Jesse Rothstein, a highly-respected Berkeley economist and author of an elegant and oft-cited paper demonstrating how non-random classroom assignment biases value-added estimates (also see the follow-up analysis).
Very quickly on the project: Over two school years (this year and last), MET researchers, working in six large districts—Charlotte-Mecklenburg, Dallas, Denver, Hillsborough County (FL), Memphis, and New York City—have been gathering an unprecedented collection of data on teachers and students, grades 4-8. Using a variety of assessments, videotapes of classroom instruction, and surveys (student surveys are featured in the preliminary report), the project is attempting to address some of the heretofore under-addressed issues in the measurement of teacher quality (especially non-random classroom assignment and how different classroom practices lead to different outcomes, neither of which are part of this preliminary report). The end goal is to use the information to guide in the creation of more effective teacher evaluation systems that incorporate high-quality multiple measures.
Despite my disagreements with some of the Gates Foundation’s core views about school reform, I think that they deserve a lot of credit for this project. It is heavily-resourced, the research team is top-notch, and the issues they’re looking at are huge. The study is very, very important — done correctly.
But Rothstein’s general conclusion about the initial MET report is that the results “do not support the conclusions drawn from them." Very early in the review, the following assertion also jumps off the page: "there are troubling indications that the Project’s conclusions were predetermined."
On first read, it might sound like Rothstein is saying that the MET researchers cooked the books. He isn’t. In the very next sentence, he points out that the MET project has two stated premises guiding its work— that, whenever feasible, teacher evaluations should be based “to a significant extent” on student test score gains; and that other components of evaluations (such as observations), in order to be considered valid, must be correlated with test score gains. (For the record, there is also a third premise – that evaluations should include feedback on teacher practice to support growth and development – but it is not particularly relevant to his review.)
So, by “predetermined," Rothstein’s is saying that the MET team’s acceptance of the two premises colors the methods they choose and, more importantly, how they interpret and present their results. That is, since the project does not ask all the right questions, it cannot provide all of the most useful answers in the most useful manner. This issue is evident throughout Rothstein’s entire critique, and I will return to it later. But first, a summary of the MET report’s three most important findings and Rothstein’s analysis of each (I don’t review every one his points):
MET finding: In every grade and subject, a teacher’s past track record of value-added is among the strongest predictors of their students’ achievement gains in other classes and academic years.
This was perhaps the major conclusion of the preliminary report. The implication is that value-added scores are good signals of future teacher effects, and it therefore makes sense to evaluate teachers using these estimates. In response, Rothstein points out that one cannot assess whether value-added is the “among the strongest predictors” of teacher effects without comparing it with a broad array of alternative predictors. Although the final MET report will include several additional “competitors” (including scoring of live and videotaped classroom instruction), this initial report compares only two: value-added scores and the surveys of students’ perception of teachers’ skills and performance — not the strongest basis to call something “among the best." (Side note: One has to wonder whether it is reasonable to expect that any alternative measure would ever predict value-added better than value-added itself, and what it really proves if none does.)
Moreover, notes Rothstein, while it’s true that past value-added is a better predictor than are student perceptions, the actual predictive power of the former is quite low. Because the manner in which the data are presented makes this difficult to assess, Rothstein makes his own calculations (see his appendix). I’ll spare you the technical details, but he finds that the explanatory power is not impressive, and the correlations it implies are actually very modest. Even in math (which was more stable than reading), a teacher with a value-added score at the 25th percentile in one year (less effective than 75 percent of other teachers) is just as likely to be above average the next year as she is to be below average. And there is only a one-in-three chance that she will be that far below average in year two compared with year one. This is the old value added story — somewhat low stability, a lot of error (though this error likely decreases with more years of data). The report claims that this volatility is not too high to preclude the utility of value-added in evaluations, and indeed uses the results as evidence of value-added’s potential; Rothstein is not nearly so sure.
MET finding: Teachers with high value-added on state tests tend to promote conceptual understanding as well.
This finding is about whether value-added scores differ between tests (see here and here for prior work). It is based on a comparison of two different value-added scores for teachers: one derived from the regular state assessment (which varies by participating district) and the other from an alternative assessment that is specifically designed to measure students’ depth of higher-order conceptual understanding in each subject. The MET results, as presented, are meant to imply that teachers whose students show gains on the state test also produce gains on the test of conceptual understanding—that teachers are not just “teaching to the test."
Rothstein, on the other hand, concludes that these correlations are also weak, so much so that “it casts serious doubt on the entire value-added enterprise." For example, more than 20 percent of the teachers in the bottom quartile (lowest 25 percent) on the state math test are in the top two quartiles on the alternative assessment. And it’s even worse for the reading test. Rothstein characterizes these results as “only slightly better than coin tosses.” So, while the MET finding — that high value-added teachers “tend to” promote conceptual understanding—is technically true, the modest size of these correlations makes this characterization somewhat misleading. According to Rothstein, critically so.
MET finding: Teacher performance—in terms of both value-added and student perceptions—is stable across different classes taught by the same teacher.
This is supposed to represent evidence that estimates of a given teacher’s effects are fairly consistent, even with different groups of students. But, as Rothstein notes, the stability of value-added and student perceptions across classes is not necessarily due to the teacher. There may be other stable but irrelevant factors that are responsible for the consistency, but remain unobserved. For example, high scores on surveys querying a teacher’s ability to control students’ behavior may be largely a result of non-random classroom assignment (some teachers are deliberately assigned more or less difficult students), rather than actual classroom management skills. Since the initial MET report makes no attempt to adjust methods (especially the survey questions) to see if the stability is truly a teacher effect, the results, says Rothstein, must be considered inconclusive (the non-random assignment issue also applies to most of the report's other findings on value-added and student surveys).
The final MET report will, however, directly address this issue in its examination of how value-added estimates change when students are randomly assigned to classes (also see here for previous work). In the meantime, Rothstein points out that the MET researchers could have checked their results for evidence of bias but did not, and in ignoring non-random assignment, they seem to be anticipating the results of the unfinished final study.
General Issues
In addition to these points, Rothstein raises a bunch of important general problems with the study. Many of them are well-known, and they mostly pertain to the presentation of results, but a few are worth repeating:
- Because it contains no literature review, the MET report largely ignores many of the serious concerns about value-added established by this literature – these issues may not be well-known to the average reader (for the record, there are citations in the study, but no formal review of prior research).
- The value-added model that the MET project employs, while common in the literature, is also not designed to address how the distribution of teacher effects varies between high- and low-performing classrooms (e.g., teachers of ELL classes are assumed to be of the same average effectiveness as teachers of gifted/talented classes) . The choice of models can have substantial impacts on results.
- Technical point (important, but I’ll keep it short here, and you can read the review if you want more detail): the report’s methods overstate the size of correlations by focusing only on that portion of performance variation that is “explained” by who the teacher is (even though most of the total variation is not explained).
- Perhaps most importantly, MET operates in a no-stakes context, but its findings—both preliminary and final—will be used to recommend high-stakes policies. There is no way to know how the introduction of these stakes will affect results (this goes for both value-added and the student surveys).
Pretty brutal, but not necessarily as perplexing as it seems. With a few exceptions, most of Rothstein’s criticisms pertain to how the results are presented and the conclusions drawn from them, rather than to actual methods. This is telling, and it brings us back to the two premises (out of three) that guide the MET project—that value-added measures should be included in evaluations, and that other measures should only be included if they are predictive of students’ test score growth.
You may also notice that both premises are among the most contentious issues in education policy today. They are less assumptions than empirical questions, which might be subject to testing (especially the first one). Now, in a sense, the MET addresses many of the big issues surrounding these questions – e.g., by checking how value-added varies across classrooms, tests, and years. And they are very clear about acknowledging the limitations of value-added, but they still think it needs to be, whenever possible, a part of multi-measure evaluations. In other words, they’re not asking whether value-added should be used, only how. Rothstein, on the other hand, is still asking the former question (and he is most certainly not alone).
Moreover, the two premises represent a tautology—student test score growth is the most important measure, and we have to choose other teacher evaluation measures based on their correlation with student test score growth because student test score growth is the most important measure… This point, by the way, has already been made about the Gates study, as well as about seniority-based layoffs and about test-based policies in general.
There is tension inherent in a major empirical research project being guided by circular assumptions that are the very questions the research might be trying to answer. Addressing these questions would be very difficult, and doing so might not change the technical methods of MET, but it might definitely influence how they interpret and present their results. This, I think, goes a long way towards explaining how Rothstein could draw opposite conclusions from the study based on the same set of results.
For example, some of Rothstein’s most important arguments pertain to the size of the correlations (e.g., value-added between years and classrooms). He finds many of these correlations to be perilously low, while the MET report concludes that they are high enough to serve as evidence of value-added’s utility. This may seem exasperating. But think about it: If your premise is that value-added is the most essential component of an evaluation (as does the MET project)—so much so that all other measures must be correlated with it in order to be included — your bar for precision may not be as high as someone who is still skeptical. You may be more likely to tolerate mistakes, even a lot of mistakes, as collateral damage. And you will be less concerned about asking whether evaluations should target short-term testing gains versus some other outcome.
For Rothstein and many others, the MET premises are still unanswered questions. Assuming them away leaves little room for the possibility that performance measures that are not correlated with value-added might be transmitting crucial information about teaching quality, or that there is a disconnect between good teaching and testing gains. The proper approach, as Rothstein notes, is not to ask whether all these measures correlate with each other or over time or across classrooms, but whether they lead to various types of better student outcomes in a high-stakes, real life context. In other words, testing policies, not measures.
The MET project's final product will provide huge insights into teacher quality and practice no matter what, but how it is received and interpreted by policymakers and the public is also critical. We can only hope that, in making these assumptions, the project is not compromising the usefulness and impact of what is potentially one of the most important educational research projects in a long time.
Comments