Skip to:

Teacher Evaluation

  • What Value-Added Research Does And Does Not Show

    Written on December 1, 2011

    Value-added and other types of growth models are probably the most controversial issue in education today. These methods, which use sophisticated statistical techniques to attempt to isolate a teacher’s effect on student test score growth, are rapidly assuming a central role in policy, particularly in the new teacher evaluation systems currently being designed and implemented. Proponents view them as a primary tool for differentiating teachers based on performance/effectiveness.

    Opponents, on the other hand, including a great many teachers, argue that the models’ estimates are unstable over time, subject to bias and imprecision, and that they rely entirely on standardized test scores, which are, at best, an extremely partial measure of student performance. Many have come to view growth models as exemplifying all that’s wrong with the market-based approach to education policy.

    It’s very easy to understand this frustration. But it's also important to separate the research on value-added from the manner in which the estimates are being used. Virtually all of the contention pertains to the latter, not the former. Actually, you would be hard-pressed to find many solid findings in the value-added literature that wouldn't ring true to most educators.

  • The Categorical Imperative In New Teacher Evaluations

    Written on November 22, 2011

    There is a push among many individuals and groups advocating new teacher evaluations to predetermine the number of outcome categories – e.g., highly effective, effective, developing, ineffective, etc. - that these new systems will include. For instance, a "statement of principles" signed by 25 education advocacy organizations recommends that the reauthorized ESEA law require “four or more levels of teacher performance." The New Teacher Project’s primary report on redesigning evaluations made the same suggestion.* For their part, many states have followed suit, mandating new systems with a minimum of 4-5 categories.

    The rationale here is pretty simple on the surface: Those pushing for a minimum number of outcome categories believe that teacher performance must be adequately differentiated, a goal on which prior systems, most of which relied on dichotomous satisfactory/unsatisfactory schemes, fell short. In other words, the categories in new evaluation systems must reflect the variation in teacher performance, and that cannot be accomplished when there are only a couple of categories.

    It’s certainly true that the number of categories matters – it is an implicit statement as to the system’s ability to tease out the “true” variation in teacher performance. The number of categories a teacher evaluation system employs should depend on how on how well it can differentiate teachers with a reasonable degree of accuracy. If a system is unable to pick up this “true” variation, then using several categories may end up doing more harm than good, because it will be providing faulty information. And, at this early stage, despite the appearance of certainty among some advocates, it remains unclear whether all new teacher evaluation systems should require four or more levels of “effectiveness."

  • The Impact Of The Principal In The Classroom

    Written on November 3, 2011

    Direct observation is way of gathering data by watching behavior or events as they occur; for example, a teacher teaching a lesson. This methodology is important to teacher induction and professional development, as well as teacher evaluation. Yet, direct observation has a major shortcoming: it is a rather obtrusive data gathering technique. In other words, we know the observer can influence the situation and the behavior of those being observed. We also know people do not behave the same way when they know they are being watched. In psychology, these forms of reactivity are known as the Hawthorne effect, and the observer- or experimenter- expectancy effect (also here).

    Social scientists and medical researchers are well aware of these issues and the fact that research findings don’t mean a whole lot when the researcher and/or the study participants know the purpose of the research and/or are aware that they are being observed or tested. To circumvent these obstacles, techniques like “mild deception” and “covert observation” are frequently used in social science research.

    For example, experimenters often take advantage of “cover stories” which give subjects a sensible rationale for the research while preventing them from knowing (or guessing) the true goals of the study, which would threaten the experiment’s internal validity – see here. Also, researchers use double-blind designs, which, in the medical field, mean that neither the research participant nor the researcher know when the treatment or the placebo are being administered.

  • A Few Other Principles Worth Stating

    Written on October 12, 2011

    Last week, a group of around 25 education advocacy organizations, including influential players such as Democrats for Education Reform and The Education Trust, released a "statement of principles" on the role of teacher quality in the reauthorization of the Elementary and Secondary Education Act (ESEA). The statement, which is addressed to the chairs and ranking members of the Senate and House committees handling the reauthorization, lays out some guidelines for teacher-focused policy in ESEA (a draft of the full legislation was released this week; summary here).

    Most of the statement is the standard fare from proponents of market-based reform, some of which I agree with in theory if not practice. What struck me as remarkable was the framing argument presented in the statement's second sentence:

    Research shows overwhelmingly that the only way to close achievement gaps – both gaps between U.S. students and those in higher-achieving countries and gaps within the U.S. between poor and minority students and those more advantaged – and transform public education is to recruit, develop and retain great teachers and principals.
    This assertion is false.
  • Quality Control, When You Don't Know The Product

    Written on August 30, 2011

    Last week, New York State’s Supreme Court issued an important ruling on the state’s teacher evaluations. The aspect of the ruling that got the most attention was the proportion of evaluations – or “weight” – that could be assigned to measures based on state assessments (in the form of estimates from value-added models). Specifically, the Court ruled that these measures can only comprise 20 percent of a teacher’s evaluation, compared with the option of up to 40 percent for which Governor Cuomo and others were pushing. Under the decision, the other 20 percent must consist entirely of alternative test-based measures (e.g., local assessments).

    Joe Williams, head of Democrats for Education Reform, one of the flagship organizations of the market-based reform movement, called the ruling “a slap in the face” and “a huge win for the teachers unions." He characterized the policy impact as follows: “A mediocre teacher evaluation just got even weaker."

    This statement illustrates perfectly the strange reasoning that seems to be driving our debate about evaluations.

  • Certainty And Good Policymaking Don't Mix

    Written on August 24, 2011

    Using value-added and other types of growth model estimates in teacher evaluations is probably the most controversial and oft-discussed issue in education policy over the past few years.

    Many people (including a large proportion of teachers) are opposed to using student test scores in their evaluations, as they feel that the measures are not valid or reliable, and that they will incentivize perverse behavior, such as cheating or competition between teachers. Advocates, on the other hand, argue that student performance is a vital part of teachers’ performance evaluations, and that the growth model estimates, while imperfect, represent the best available option.

    I am sympathetic to both views. In fact, in my opinion, there are only two unsupportable positions in this debate: Certainty that using these measures in evaluations will work; and certainty that it won’t. Unfortunately, that’s often how the debate has proceeded – two deeply-entrenched sides convinced of their absolutist positions, and resolved that any nuance in or compromise of their views will only preclude the success of their efforts. You’re with them or against them. The problem is that it's the nuance - the details - that determine policy effects.

    Let’s be clear about something: I'm not aware of a shred of evidence – not a shred – that the use of growth model estimates in teacher evaluations improves performance of either teachers or students.

  • Test-Based Teacher Evaluations Are The Status Quo

    Written on August 18, 2011

    We talk a lot about the “status quo” in our education debates. For instance, there is a common argument that the failure to use evidence of “student learning” (in practice, usually defined in terms of test scores) in teacher evaluations represents the “status quo” in this (very important) area.

    Now, the implication that “anything is better than the status quo” is a rather massive fallacy in public policy, as it assumes that the costs of alternatives will outweigh benefits, and that there is no chance the replacement policy will have a negative impact (almost always an unsafe assumption). But, in the case of teacher evaluations, the “status quo” is no longer what people seem to think.

    Not counting Puerto Rico and Hawaii, the ten largest school districts in the U.S. are (in order): New York City; Los Angeles; Chicago; Dade County (FL); Clark County (NV); Broward County (FL); Houston; Hillsborough (FL); Orange County (FL); and Palm Beach County (FL). Together, they serve about eight percent of all K-12 public school students in the U.S., and over one in ten of the nation’s low-income children.

    Although details vary, every single one of them is either currently using test-based measures of effectiveness in its evaluations, or is in the process of designing/implementing these systems (most due to statewide legislation).

  • Again, Niche Reforms Are Not The Answer

    Written on August 9, 2011

    Our guest author today is David K. Cohen, John Dewey Collegiate Professor of Education and professor of public policy at the University of Michigan, and a member of the Shanker Institute’s board of directors.

    A recent response to my previous post on these pages helps to underscore one of my central points: If there is no clarity about what it will take to improve schools, it will be difficult to design a system that can do it.  In a recent essay in the Sunday New York Times Magazine, Paul Tough wrote that education reformers who advocated "no excuses" schooling were now making excuses for reformed schools' weak performance.  He explained why: " Most likely for the same reason that urban educators from an earlier generation made excuses: successfully educating large numbers of low-income kids is very, very hard." 

     In his post criticizing my initial essay, "What does it mean to ‘fix the system’?," the Fordham Institute’s Chris Tessone told the story of how Newark Public Schools tried to meet the requirements of a federal school turnaround grant. The terms of the grant required that each of three failing high school replace at least half of their staff. The schools, he wrote, met this requirement largely by swapping a portion of their staffs with one another, a process which Tessone and school administrators refer to as the “dance of the lemons.”Would such replacement be likely to solve the problem?

    Even if all of the replaced teachers had been weak (which we do not know), I doubt that such replacement could have done much to help.

  • Success Via The Presumption Of Accuracy

    Written on July 27, 2011

    In our previous post, Professor David K. Cohen argued that reforms such as D.C.’s new teacher evaluation system (IMPACT) will not by themselves lead to real educational improvement, because they focus on the individual rather than systemic causes of low performance. He framed this argument in terms of the new round of IMPACT results, which were released two weeks ago. While the preliminary information was limited, it seems that the distribution of teachers across the four ratings categories (highly effective, effective, minimally effective, and ineffective) were roughly similar to last year’s - including a small group of teachers fired for receiving the lowest “ineffective” rating, and a somewhat larger group (roughly 200) fired for having received the “minimally effective” label for two consecutive years.

    Cohen’s argument on the importance of infrastructure does not necessarily mean that we should abandon the testing of new evaluation systems, only that we should be very careful about how we interpret their results and the policy conclusions we draw from them (which is good advice at all times). Unfortunately, however, it seems that caution is in short supply. For instance, shortly after the IMPACT results were announced, the Washington Post ran an editorial, entitled “DC Teacher Performance Evaluations Are Working," in which a couple of pieces of “powerful evidence” were put forward in an attempt to support this bold claim. The first was that 58 percent of the teachers who received a “minimally effective” rating last year and remained in the district were rated either “effective” or “highly effective” this year. The second was that around 16 percent of DC teachers were rated “highly effective” this year, and will be offered bonuses, which the editorial writers argued shows that most teachers “are doing a good job” and being rewarded for it.

    The Post’s claim that these facts represent evidence - much less “powerful evidence” - of IMPACT’s success is a picture-perfect example of the flawed evidentiary standards that too often drive our education debate. The unfortunate reality is that we have virtually no idea whether IMPACT is actually “working," and we won’t have even a preliminary grasp for some time. Let’s quickly review the Post’s evidence.

  • Evaluating Individual Teachers Won't Solve Systemic Educational Problems

    Written on July 26, 2011

    ** Also posted here on "Valerie Strauss' Answer Sheet" in the Washington Post

    Our guest author today is David K. Cohen, John Dewey Collegiate Professor of Education and professor of public policy at the University of Michigan, and a member of the Shanker Institute’s board of directors.  

    What are we to make of recent articles (here and here) extolling IMPACT, Washington DC’s fledging teacher evaluation system, for how many "ineffective" teachers have been identified and fired, how many "highly effective" teachers rewarded? It’s hard to say.

    In a forthcoming book, Teaching and Its Predicaments (Harvard University Press, August 2011), I argue that fragmented school governance in the U.S. coupled with the lack of coherent educational infrastructure make it difficult either to broadly improve teaching and learning or to have valid knowledge of the extent of improvement. Merriam-Webster defines "infrastructure" as: "the underlying foundation or basic framework (as of a system or organization)." The term is commonly used to refer to the roads, rail systems, and other frameworks that facilitate the movement of things and people, or to the physical and electronic mechanisms that enable voice and video communication. But social systems also can have such "underlying foundations or basic frameworks". For school systems around the world, the infrastructure commonly includes student curricula or curriculum frameworks, exams to assess students’ learning of the curricula, instruction that centers on teaching that curriculum, and teacher education that aims to help prospective teachers learn how to teach the curricula. The U.S. has had no such common and unifying infrastructure for schools, owing in part to fragmented government (including local control) and traditions of weak state guidance about curriculum and teacher education.

    Like many recent reform efforts that focus on teacher performance and accountability, IMPACT does not attempt to build infrastructure, but rather assumes that weak individual teachers are the problem. There are some weak individual teachers, but the chief problem has been a non-system that offers no guidance or support for strong teaching and learning, precisely because there has been no infrastructure. IMPACT frames reform as a matter of solving individual problems when the weakness is systemic.



Subscribe to Teacher Evaluation


This web site and the information contained herein are provided as a service to those who are interested in the work of the Albert Shanker Institute (ASI). ASI makes no warranties, either express or implied, concerning the information contained on or linked from The visitor uses the information provided herein at his/her own risk. ASI, its officers, board members, agents, and employees specifically disclaim any and all liability from damages which may result from the utilization of the information provided herein. The content in the Shanker Blog may not necessarily reflect the views or official policy positions of ASI or any related entity or organization.