On Teacher Evaluation: Slow Down And Get It Right

** Reprinted here in the Washington Post

The following is written by Morgan S. Polikoff and Matthew Di Carlo. Morgan is Assistant Professor in the Rossier School of Education at the University of Southern California.

One of the primary policy levers now being employed in states and districts nationwide is teacher evaluation reform. Well-designed evaluations, which should include measures that capture both teacher practice and student learning, have great potential to inform and improve the performance of teachers and, thus, students. Furthermore, most everyone agrees that the previous systems were largely pro forma, failed to provide useful feedback, and needed replacement.

The attitude among many policymakers and advocates is that we must implement these systems and begin using them rapidly for decisions about teachers, while design flaws can be fixed later. Such urgency is undoubtedly influenced by the history of slow, incremental progress in education policy. However, we believe this attitude to be imprudent.

The risks to excessive haste are likely higher than whatever opportunity costs would be incurred by proceeding more cautiously. Moving too quickly gives policymakers and educators less time to devise and test the new systems, and to become familiar with how they work and the results they provide.

Moreover, careless rushing may result in avoidable erroneous high stakes decisions about individual teachers. Such decisions are harmful to the profession, they threaten the credibility of the evaluations, and they may well promote widespread backlash (such as the recent Florida lawsuits and the growing "opt-out" movement).  Making things worse, the opposition will likely “spill over” into other promising policies, such as the already-fragile effort to enact the Common Core standards and aligned assessments.

Finally, we must not underestimate the costs, financial and otherwise, of making large changes to these systems once they are in place. A perfect example is NCLB – it had many obvious design flaws that were known early on, but few of these have been corrected, even in states’ ESEA “flexibility” applications.

In short, given these risks and the difficulty of fairly and accurately measuring teacher effectiveness, it seems short-sighted to rush into full-blown implementation without ensuring that the new systems are up to the task.

To that end, we would like to highlight four issues to which states and districts must pay attention in the short term.

The first is that the details of the evaluations, some of which may seem esoteric, in fact matter tremendously. Important choices include (but are not limited to): selecting measures, particularly for teachers in non-tested grades and subjects; reporting evaluation results to educators in a manner that is useful to their practice; ensuring accuracy in state data systems; choosing cut scores (if desired) to separate more and less effective educators; and designing scoring systems that preserve each measure’s intended importance, or “weight." All of these decisions are important, but even a cursory read of states' new evaluation policies under the waivers or Race to the Top highlights many decisions that contradict what little we know about effective teacher evaluation systems.

And, as is often the case with new policies, the flow of research in this area lags far behind the breakneck pace of policy making. For instance, a large number of states have chosen as their growth models for teacher evaluation a variant on what’s commonly called the “student growth percentile” (SGP) model. However, recent evidence suggests that value-added models can do a better job of leveling the playing field across classes.  Similarly, the Measures of Effective Teaching project offered useful guidance for designing evaluation systems, but its results were released after many states and districts had already made these decisions.

A second issue is simple bad timing: The roll-out of the Common Core standards and new Core-aligned assessments creates serious complications for new teacher evaluation systems. Perhaps the most important of these is that curriculum, standards, and assessments are not yet in sync. New York has recently experienced this issue, administering new assessments before teachers have been supported to implement the Common Core through curriculum materials. And, while the stated hope is that the tests, curricula, and standards will be seamlessly aligned in a few years, if history is any guide this is far from guaranteed.

Doing evaluation reform and Common Core implementation at the same time may well be too much for states, districts, and schools to handle. Furthermore, evaluating teachers on the basis of tests that are not aligned with what they are supposed to be teaching is a fundamentally invalid use of those data.

The third issue is the need for states to avoid being overly prescriptive. Most notably, many schools and districts have well established evaluation systems already in place, and it makes little sense to uproot these systems and force a state-mandated model. Similarly, districts should be given room to experiment with system design and with different ways to use the results for personnel decisions. The state's optimal role may be to enforce a minimum standard for teacher evaluation, rather than mandating a particular evaluation model statewide.

Fourth and finally, new evaluations – as with any major policy – require significant time and resources to plan and pilot, and there must be substantial capacity building for educators to understand and carry out these systems. Policies should not move directly from the drawing board to high-stakes implementation if the goal is maximizing the policies' effectiveness and minimizing their negative unintended consequences. We recommend that schools and districts should have a year for planning and two years of implementation prior to tying ratings to high stakes decisions.

We conclude where we began – as two individuals who believe that improved teacher evaluation systems could indeed help elevate teaching and learning in U.S. schools. We are concerned that the overly quick, insufficiently careful manner in which many new systems are being installed threatens their likelihood of success.

Put simply, we need to slow down and work to create the best systems possible. Schools and districts in the middle of the design and implementation process should focus on the details of their systems and partner with researchers and other sites to study system effectiveness. In those places where evaluations are already in force, we would strongly advise policymakers to take a step back and consider our suggestions.

And, no matter the situation, high stakes decisions about teachers should not be made on the basis of assessment data collected during Common Core roll-out. Doing so is unfair and inappropriate and may cause serious harm.

We acknowledge that our arguments here do not fit neatly into the polarized, tribal framework that defines education policy discourse today. In fact, they may not resonate with either “side” of the reform debates, as we support evaluation reform but not unconditionally. To be clear, we do not expect that the new systems will ever be "perfect," and we fully acknowledge that there will be mistakes and adjustments going forward. Nevertheless, we believe that research and history show that time and attention to detail are usually the difference between policies that succeed and those that fail to improve outcomes. If this is worth doing, it is worth doing correctly.

- Morgan Polikoff and Matt Di Carlo

Permalink

The challenges and delays you describe could be minimized if we could escape the grip of test-score fetishization. Why are we so determined to be unique in the world for our willingness to try evaluating incredibly complex work through incredibly simplistic tools, distorted with complex and subjective formulations and calibrations? No top-performing countries are doing this. No independent schools I've ever heard of are doing this. And why not? Because it's actually not that hard to do evaluations well - it's merely hard to do it well in the current economy and political climate. Good evaluations are time-consuming, and therefore, expensive. Or, if we make it less time-consuming for individual evaluators by distributing the responsibility, it's expensive to train more evaluators. Let's be honest about that and decide what we can do about it. And maybe, several years from now, when there's enough data to evaluate CCSS assessments, we'll find we actually improved evaluations in the interim, without having to reduce a student's "growth" to a test score. Let teachers and their evaluators choose a variety of actual student work to examine as part of evaluation, and help teachers focus on the most important work they do in any and every subject and setting. If anyone ever proposed evaluating my work as a high-school English teacher based on the shoddy tests currently used by California, I'd fight it every step of the way. I'm very thankful I've never worked for a school or district that insulted my work or my intelligence, or my students, by suggesting there was anything useful for us in the simplistic and flawed tests we've had throughout my career.