Evaluating Individual Teachers Won't Solve Systemic Educational Problems

** Also posted here on "Valerie Strauss' Answer Sheet" in the Washington Post

Our guest author today is David K. Cohen, John Dewey Collegiate Professor of Education and professor of public policy at the University of Michigan, and a member of the Shanker Institute’s board of directors.  

What are we to make of recent articles (here and here) extolling IMPACT, Washington DC’s fledging teacher evaluation system, for how many "ineffective" teachers have been identified and fired, how many "highly effective" teachers rewarded? It’s hard to say.

In a forthcoming book, Teaching and Its Predicaments (Harvard University Press, August 2011), I argue that fragmented school governance in the U.S. coupled with the lack of coherent educational infrastructure make it difficult either to broadly improve teaching and learning or to have valid knowledge of the extent of improvement. Merriam-Webster defines "infrastructure" as: "the underlying foundation or basic framework (as of a system or organization)." The term is commonly used to refer to the roads, rail systems, and other frameworks that facilitate the movement of things and people, or to the physical and electronic mechanisms that enable voice and video communication. But social systems also can have such "underlying foundations or basic frameworks". For school systems around the world, the infrastructure commonly includes student curricula or curriculum frameworks, exams to assess students’ learning of the curricula, instruction that centers on teaching that curriculum, and teacher education that aims to help prospective teachers learn how to teach the curricula. The U.S. has had no such common and unifying infrastructure for schools, owing in part to fragmented government (including local control) and traditions of weak state guidance about curriculum and teacher education.

Like many recent reform efforts that focus on teacher performance and accountability, IMPACT does not attempt to build infrastructure, but rather assumes that weak individual teachers are the problem. There are some weak individual teachers, but the chief problem has been a non-system that offers no guidance or support for strong teaching and learning, precisely because there has been no infrastructure. IMPACT frames reform as a matter of solving individual problems when the weakness is systemic.

Teacher Evaluations: Don't Begin Assembly Until You Have All The Parts

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

Over the past year or two, roughly 15-20 states have passed or are considering legislation calling for the overhaul of teacher evaluation. The central feature of most of these laws is a mandate to incorporate measures of student test score growth, in most cases specifying a minimum percentage of a teacher’s total score that must consist of these estimates.

There’s some variation across states, but the percentages are all quite high. For example, Florida and Colorado both require that at least 50 percent of an evaluation must be based on growth measures, while New York mandates a minimum of 40 percent. These laws also vary in terms of other specifics, such as the degree to which the growth measure proportion must be based on state tests (rather than other assessments), how much flexibility districts have in designing their systems, and how teachers in untested grades and subjects are evaluated. But they all share that defining feature of mandating a minimum proportion – or “weight” – that must be attached to a test-based estimate of teacher effects (at least for those teachers in tested grades and subjects).

Unfortunately, this is typical of the misguided manner in which many lawmakers (and the advocates advising them) have approached the difficult task of overhauling teacher evaluation systems. For instance, I have discussed previously the failure of most systems to account for random error. The weighting issue is another important example, and it violates a basic rule of designing performance assessment systems: You should exercise extreme caution in pre-deciding the importance of any one component until you know what the other components will be. Put simply, you should have all the parts in front of you before you begin the assembly process.

The Faulty Logic Of Using Student Surveys In Accountability Systems

In a recent post, I discussed the questionable value of student survey data to inform teacher evaluation models. Not only is there little research support for such surveys, but the very framing of the idea often reflects faulty reasoning.

A quote from a recent Educators 4 Excellence white paper helps to illustrate the point:

For a system that aims to serve students, young people’s interests are far too often pushed aside. Students’ voices should be at the forefront of the education debate today, especially when it comes to determining the effectiveness of their teacher.

This sounds noble… but seriously, why should students’ opinions be "at the forefront of the education debate"? Are students’ needs better served when we ask students what they need directly? Research on this is explicit: no, not really.

Student Surveys of Teachers: Be Careful What You Ask For

Many believe that current teacher evaluation systems are a formality, a bureaucratic process that tells us little about how to improve classroom instruction. In New York, for example, 40 percent of all teacher evaluations must consist of student achievement data by 2013. Additionally, some are proposing the inclusion of alternative measures, such as “independent outside observations” or “student surveys” among others. Here, I focus on the latter.

Educators for Excellence (E4E), an “organization of education professionals who seek to provide an independent voice for educators in the debate surrounding education reform”, recently released a teacher evaluation white paper proposing that student surveys account for 10 percent of teacher evaluations.

The paper quotes a teacher saying: “for a system that aims to serve students, young people’s interests are far too often pushed aside. Students’ voices should be at the forefront of the education debate today, especially when it comes to determining the effectiveness of their teacher." The authors argue that “the presence of effective teachers […] can be determined, in part, by the perceptions of the students that interact with them." Also, “student surveys offer teachers immediate and qualitative feedback, recognize the importance of student voice […]". In rare cases, the paper concedes, “students could skew their responses to retaliate against teachers or give high marks to teachers who they like, regardless of whether those teachers are helping them learn."

But student evaluations are not new.

The Ethics of Testing Children Solely To Evaluate Adults

The recent New York Times article, “Tests for Pupils, but the Grades Go to Teachers," alerts us of an emerging paradox in education – the development and use of standardized student testing solely as a means to evaluate teachers, not students. “We are not focusing on teaching and learning anymore; we are focusing on collecting data," says one mother quoted in the article. Now, let’s see: collecting data on minors that is not explicitly for their benefit – does this ring a bell?

In the world of social/behavioral science research, such an enterprise – collecting data on people, especially on minors – would inevitably require approval from the Institutional Review Board (IRB). For those not familiar, IRB is a committee that oversees research that involves people and is responsible for ensuring that studies are designed in an ethical manner. Even in conducting a seemingly harmless interview on political attitudes or observing a group studying in a public library, the researcher would almost certainly be required to go through a series of steps to safeguard participants and ensure that the norms governing ethical research will be observed.

Very succinctly, IRBs’ mission is to see that (1) the risk-benefit ratio of conducting the research is favorable; (2) any suffering or distress that participants may experience during or after the study is understood, minimized, and addressed; and (3) research participants’ agreed to participate freely and knowingly – usually, subjects are requested to sign an informed consent which includes a description of the study’s risks and benefits, a discussion of how confidentiality will be guaranteed, a statement on the voluntary nature of involvement, and a clarification that refusal or withdrawal at any time will involve no penalty or loss of benefits. When the research involves minors, parental consent and sometimes child assent are needed.

In short, IRB procedures exist to protect people. To my knowledge, student evaluation procedures and standardized testing are exempt from this sort of scrutiny. So the real question is: Should they be? Perhaps not.

Value-Added In Teacher Evaluations: Built To Fail

With all the controversy and acrimonious debate surrounding the use of value-added models in teacher evaluation, few seem to be paying much attention to the implementation details in those states and districts that are already moving ahead. This is unfortunate, because most new evaluation systems that use value-added estimates are literally being designed to fail.

Much of the criticism of value-added (VA) focuses on systematic bias, such as that stemming from non-random classroom assignment (also here). But the truth is that most of the imprecision of value-added estimates stems from random error. Months ago, I lamented the fact that most states and districts incorporating value-added estimates into their teacher evaluations were not making any effort to account for this error. Everyone knows that there is a great deal of imprecision in value-added ratings, but few policymakers seem to realize that there are relatively easy ways to mitigate the problem.

This is the height of foolishness. Policy is details. The manner in which one uses value-added estimates is just as important – perhaps even more so – than the properties of the models themselves. By ignoring error when incorporating these estimates into evaluation systems, policymakers virtually guarantee that most teachers will receive incorrect ratings. Let me explain.

The New Layoff Formula Project

In a previous post about seniority-based layoffs, I argued that, although seniority may not be the optimal primary factor upon which to base layoff decisions, we do not yet have an acceptable alternative in most places—one that would permit the “quality-based” layoffs that we often hear mentioned. In short, I am completely receptive to other layoff criteria, but insofar as new teacher evaluation systems are still in the design phase in most places, states and districts might want to think twice before chucking a longstanding criterion that has (at least some) evidence of validity before they have a workable replacement.

The New Teacher Project (TNTP) recently released a short policy brief outlining a proposed alternative. Let’s take a quick look at what they have to offer.

In Performance Evaluations, Subjectivity Is Not Random

Employment policies associated with unions – e.g., seniority, salary schedules – are frequently criticized for not placing the highest premium on performance. Detractors also argue that such policies, originally designed to protect workers against discrimination (by gender, race, etc.), are no longer necessary now that federal laws are in place. Accordingly, those seeking to limit collective bargaining among teachers have proposed that current policies be replaced by “performance-based” evaluations – or at least a system that would make it easier to reward and punish based on performance.

Be careful, argues Samuel A. Culbert in a recent New York Times article, “Why Your Boss is Wrong About You." Culbert warns that there are serious risks to deregulating the employment relationship, and leaving it even partially in the hands of the employer and his/her performance review:

Now, maybe your boss is all-knowing. But I’ve never seen one that was. In a self-interested world, where imperfect people are judging other imperfect people, anybody reviewing somebody else’s performance ... is subjective.
This viewpoint may sound obvious, but social science research reminds us that the whims of subjective human judgment are not random. The inefficiencies that Culbert mentions are inevitable, but so is the fact that bias tends to operate in a manner that disproportionately affects workers from traditionally disadvantaged social groups, such as women and African Americans. What’s worse – it’s just as likely to occur within as between groups, and we often do it without realizing.

Value-Added: Theory Versus Practice

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

About two weeks ago, the National Education Policy Center (NEPC) released a review of last year’s Los Angeles Times (LAT) value-added analysis – with a specific focus on the technical report upon which the paper’s articles were based (done by RAND’s Richard Buddin). In line with prior research, the critique’s authors – Derek Briggs and Ben Domingue – redid the LAT analysis, and found that teachers’ scores vary widely, but that the LAT estimates would be different under different model specifications; are error-prone; and conceal systematic bias from non-random classroom assignments.  They were also, for reasons yet unknown, unable to replicate the results.

Since then, the Times has issued two responses. The first was a quickly-published article, which claimed (including in the headline) that the LAT results were confirmed by Briggs/Domingue – even though the review reached the opposite conclusions. The basis for this claim, according to the piece, was that both analyses showed wide variation in teachers’ effects on test scores (see NEPC’s reply to this article). Then, a couple of days ago, there was another response, this time on the Times’ ombudsman-style blog. This piece quotes the paper’s Assistant Managing Editor, David Lauter, who stands by the paper’s findings and the earlier article, arguing that the biggest question is:

...whether teachers have a significant impact on what their students learn or whether student achievement is all about ... factors outside of teachers’ control. ... The Colorado study comes down on our side of that debate. ... For parents and others concerned about this issue, that’s the most significant finding: the quality of teachers matters.
Saying “teachers matter” is roughly equivalent to saying that teacher effects vary widely - the more teachers vary in their effectiveness, controlling for other relevant factors, the more they can be said to “matter” as a factor explaining student outcomes. Since both analyses found such variation, the Times claims that the NEPC review confirms their “most significant finding."

The review’s authors had a much different interpretation (see their second reply). This may seem frustrating. All the back and forth has mostly focused on somewhat technical issues, such as model selection, sample comparability, and research protocol (with some ethical charges thrown in for good measure). These are essential matters, but there is also an even simpler reason for the divergent interpretations, one that is critically important and arises constantly in our debates about value-added.

Premises, Presentation And Predetermination In The Gates MET Study

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

The National Education Policy Center today released a scathing review of last month’s preliminary report from the Gates Foundation-funded Measures of Effective Teaching (MET) project. The critique was written by Jesse Rothstein, a highly-respected Berkeley economist and author of an elegant and oft-cited paper demonstrating how non-random classroom assignment biases value-added estimates (also see the follow-up analysis).

Very quickly on the project: Over two school years (this year and last), MET researchers, working  in six large districts—Charlotte-Mecklenburg, Dallas, Denver, Hillsborough County (FL), Memphis, and New York City—have been gathering an unprecedented collection of data on teachers and students, grades 4-8.  Using a variety of assessments, videotapes of classroom instruction, and surveys (student surveys are featured in the preliminary report), the project is attempting to address some of the heretofore under-addressed issues in the measurement of teacher quality (especially non-random classroom assignment and how different classroom practices lead to different outcomes, neither of which are part of this preliminary report). The end goal is to use the information to guide in the creation of more effective teacher evaluation systems that incorporate high-quality multiple measures.

Despite my disagreements with some of the Gates Foundation’s core views about school reform, I think that they deserve a lot of credit for this project. It is heavily-resourced, the research team is top-notch, and the issues they’re looking at are huge.  The study is very, very important — done correctly. 

But Rothstein’s general conclusion about the initial MET report is that the results “do not support the conclusions drawn from them." Very early in the review, the following assertion also jumps off the page: "there are troubling indications that the Project’s conclusions were predetermined."