How Not To Improve New Teacher Evaluation Systems

One of the more interesting recurring education stories over the past couple of years has been the release of results from several states’ and districts’ new teacher evaluation systems, including those from New York, Indiana, Minneapolis, Michigan and Florida. In most of these instances, the primary focus has been on the distribution of teachers across ratings categories. Specifically, there seems to be a pattern emerging, in which the vast majority of teachers receive one of the higher ratings, whereas very few receive the lowest ratings.

This has prompted some advocates, and even some high-level officials, essentially to deem as failures the new systems, since their results suggest that the vast majority of teachers are “effective” or better. As I have written before, this issue cuts both ways. On the one hand, the results coming out of some states and districts seem problematic, and these systems may need adjustment. On the other hand, there is a danger here: States may respond by making rash, ill-advised changes in order to achieve “differentiation for the sake of differentiation,” and the changes may end up undermining the credibility and threatening the validity of the systems on which these states have spent so much time and money.

Granted, whether and how to alter new evaluations are difficult decisions, and there is no tried and true playbook. That said, New York Governor Andrew Cuomo’s proposals provide a stunning example of how not to approach these changes. To see why, let’s look at some sound general principles for improving teacher evaluation systems based on the first rounds of results, and how they compare with the New York approach.*

Do not judge teacher evaluation systems based solely (or even mostly) on the distribution of results. One would think that this goes without saying. That is, the looking at the results of an evaluation system is just one of many ways to assess the design and implementation of these systems. State and district officials, researchers, and other stakeholders need to be looking at a range of other outcomes, such as recruitment/retention and the opinions of and feedback from educators. And, in the long run, the only truly important outcome is whether the new systems generate improvements.

Yet, Governor Cuomo seems to have rendered his judgment based entirely on the proportion of teachers statewide rated “ineffective” or “developing,” juxtaposed with inappropriate comparisons of these results with student proficiency rates. Moreover, as shown below, most of his proposed “solutions” are designed not to improve the measures, but rather to engineer a distribution of results that he considers more appropriate (i.e., lower ratings), even when the latter goal comes at the expense of the former. Teacher evaluation systems cannot be assessed by eyeballing four percentages, and shifting the ratings distribution to the left is not by itself improvement.

Don’t make huge changes immediately; gather 2-3 years of data, perform a detailed analysis, and then proceed. Teacher evaluations in many states and districts were hastily designed and implemented, sometimes without so much as a pilot year. Administrators had to be trained, new data systems set up, and so on. Moreover, in many states, districts had to install brand new local assessments, sometimes dozens of them, as well as decide on how these assessments were going to be incorporated into teachers’ evaluation scores.

There is always some shake down time with new policies, and it’s entirely possible that teacher evaluations would produce different results in the first couple of years than they do in subsequent years, even without any design changes. Accordingly, while small adjustments are often a good idea in the first few years, states should commit themselves to avoid making large, sweeping changes to their new systems until they have collected at least two, and preferably three, years of data, and have analyzed it thoroughly and discussed all options. This includes the suggestions discussed below, as well as many others.

Governor Cuomo, on the other hand, has rendered blanket judgment after the first year: The system is a failure (in his words, “baloney”). And he has proposed what amounts to a complete overhaul. To some people, that may seem like “bold, assertive leadership." In a sense, perhaps it is (it’s at least bold, give him that). But it’s also bad policy. One can only cringe when imagining educators' reactions to all this. 

Examine inter-district differences in results and design/implementation. In many states that mandated new evaluations, systems vary pretty widely by district in terms of design (and, of course, in how they were implemented). There are often different observation rubrics, weights, local assessments, etc. Accordingly, as would be expected, results differ as well.

This is absolutely true in New York. Each of the state's several hundred school districts had to negotiate their own local “learning measures” for all teachers (20 percent of teachers’ final scores), as well as alternatives measures for teachers in non-tested grades and subjects (also 20 percent). They also had some flexibility in choosing an observation protocol and a scoring system for it. As a consequence, predictably, districts varied quite a bit in their first year results. In fact, in the districts serving the largest proportions of disadvantaged students, the evaluation results are quite a bit lower than elsewhere in the state (see here).

If there is a problem with how the ratings turned out overall, one key to understanding why the results occurred and how to proceed is to exploit this inter-district variation – i.e., to examine differences in district design, and see if it they are associated with variation in the results. The idea here is that some districts will inevitably do a better job than others, and we should learn from them.

If this has been done in New York, it was not made public, and it is not at all reflected in the Governor’s remarks or plan. The proposal would essentially cancel all existing systems and impose a new one. This is the opposite of thoughtful, data-driven education reform.

Disaggregate overall ratings by component, part one (classroom observations). Ratings that seem high overall might very well be driven by just one or two subcomponents of the system. In most instances, this will be the classroom observation component. Observations, unlike standard test-based growth measures, are not constrained to produce a spread of scores – theoretically, all teachers can receive the highest scores on their classroom observations. 

So, what can be done in situations where observation scores seem to be producing high ratings? That is a tough question. There may be a problem with the protocol itself, or with how the raw observation scores are incorporated into teachers’ scores. Part of the issue may simply be a case of principals’ decisions. For example, principals may be hesitant to give their teachers poor ratings due to the very understandable (perhaps even beneficial) fear that they won’t be able to replace their weaker teachers with superior alternatives. Or maybe principals just “overrate” their own people, which is a very human thing to do.

In any case, states and districts have options. They may, for instance, want to employ highly-trained external observers to supplement and/or check principal observations. They may also analyze the data to identify principals whose observations turns out better (or worse) than expected, and provide additional training and guidance. Finally, requiring 3-4 observations per year, while costly, can improve the reliability and quality of observational data.

In New York, Governor Cuomo characterized the state’s observations as “not standard across districts and entirely manipulable.” In order to improve classroom observations, he proposes that each teacher still receive (at least) two observations per year: One from a “third party” observer, and another from the school’s principal or other administrator within the building (currently, school personnel conduct both observations). And, under the proposal, observations would count as 50 percent of teachers’ final ratings, rather than 60 percent.

The idea of employing “third party” observers is reasonable on the surface. But there are a couple of big problems with the (somewhat vague) approach in the governor's proposal. For one thing, there is nothing in the proposal that focuses on training/monitoring principals to improve observations of their own teachers (and/or to use the "third party" observations toward this end). Principals' observations are simply characterized, in a painful generalization, as "manipulable" (ironic, given that the governor's proposals, including those pertaining to classroom observations, seem largely intended to produce lower ratings), and reduced in "importance" from 60 to 15 percent of teachers' final scores. If you have a problem with a component of your evaluation system, particularly one as long-standing and with as much potential formative value as principal observations, you should at least make an effort to address it directly, not just downgrade its importance.

Second, the governor's plan is to weight the “third party” observation scores as 35 percent of teachers’ scores, and the principal observations as 15 percent. This implies that the “third party” observations are far more valid/reliable than principals’, which is an empirically shaky assumption even under ideal circumstances**, but particularly questionable in the context of the governor's plan.

For example, setting up a system of "third party" observers requires careful planning and meticulous attention to standardization (the lack of the latter is one of the governor's primary criticisms of current observation systems). Yet the proposal allows these external observers to be one of the following: State-trained observers; principals/administrators from another school inside or outside the district; or state-appointed SUNY/CUNY education school faculty. There are so many unanswered questions here. Who makes the decision as to which "type" of external observer will be used? Regardless of who performs them, will external observations be used to "check" principal observations, and vice-versa? Will external administrators have to be certified/appointed by the state, and will requests that they observe teachers in other schools be voluntary? In short, if you're going to base over one-third of teachers' final ratings on a single classroom observation by people who don't even work in those teachers' schools, these are not just trivial little details. The use of "third party" observers requires up-front investment; it's not something you can throw together on the cheap.

Third, all of this sends a very clear signal to principals (and their staffs) that their judgments of their teachers’ performance are not to be trusted, and are significantly less important than those of almost any nominally qualified person from outside their building. This potentially undermines principals’ authority among teachers. It is also disrespectful to say the least.

(By the way, if these changes are indeed intended to improve the quality of information provided by the observational component, why does the governor’s proposal also downweight observations as a proportion of teachers’ final scores - from 60 to 50 percent?)

Finally, this change will represent a massive setback. The plan does not state whether it will impose a single observation protocol on all New York districts, but it does imply as much (e.g., the current system is criticized as “not standard across districts,” and the use of “third party” observers would probably require a statewide protocol, as this would facilitate the training of external observers).

If so, it would require many districts’ to scrap their current observation systems (their scoring systems, and, in some cases, part of their protocols). The time they spent negotiating those systems will have been lost, and they will have to train for a new system. Furthermore, the fact that districts and unions chose these protocols and scoring systems, albeit from among somewhat constrained options, represented some degree of “buy-in” and personalization. That too will be forfeited, and the proposal provides virtually no assurance that all this will be worth the costs, or even much acknowledgment that there are costs.

(Side note: Overall, the governor's reaction to the observation results is very ironic. The fact that observations are not standard across districts, and that they are conducted by building administrators, are both a direct result of New York State's evaluation law - districts were required to do it this way. And the state, in its ill-advised rush to get new evaluations up and running as quickly as possible, also required districts/unions to negotiate their observation protocols and scoring systems before they were given a chance to see how the results would turn out, and to do so while having to deal with the complications of “fitting in” these scores to the overall scoring bands pre-determined by the state. This was a very difficult situation that constrained many districts' options as to how they scored observations. Now, observations are being criticized for their lack of consistency across districts, school officials are being chastised for the results, and the state, under Governor Cuomo’s proposal, would once again be imposing observations and scoring systems on districts without knowing how they’ll turn out. It’s almost comical.)

Disaggregate overall ratings by component, part two (value-added and other "learning measures"). The decision of how much weight to assign to growth model estimates (i.e., value-added) was among the most contentious in all states that went ahead with new systems. As mentioned above, value-added models, unlike observations, are designed to produce a spread of ratings. As a result, states can predict with reasonable confidence how the statewide results will turn out, and they can be confident that the value-added estimates will be spread out – some high, some low, most near the average (see this post for a discussion of how this variation affects the "true" weights of different measures).

This very convenient property of the estimates can be useful in policy applications, but it might also represent a temptation to many policymakers and advocates to assign a weight to value-added that is as high as possible, as doing so would tend to distribute overall ratings more evenly. Some states, such as D.C., went a little overboard (in my view), and assigned a 50 percent weight (they have since lowered it to 35). Others, including New York, went considerably lower. There is no “correct weight,” and the best decision going in was probably to let districts make their own decisions (some states did this).

In any case, faced with first year results that seem too high to some people, there may be a knee-jerk reaction of simply raising the weight assigned to value-added scores. This is a clumsy, ham-handed approach. The weights assigned to various components should not be chosen just to produce a certain distribution of results. They are substantive choices about the importance and reliability of different measures.

Governor Cuomo was, thankfully (unless you’re in New York), among the only state officials who took the value-added bait here (other districts, including D.C., are actually moving in the opposite direction). He proposes to increase drastically the weight assigned to value-added from 25 to 50 percent. In addition, teachers who receive ineffective ratings on either the observation or value-added/”learning” component will be unable to receive overall ratings higher than developing.

He characterizes these changes as "simplifying" evaluation systems. But performance evaluation, for teachers or any other professionals, is not simple.

In reality, this is a crude, empirically questionable approach that seems intended to lower teachers’ ratings regardless of the costs in terms of the quality of the overall ratings. The available research on teacher evaluations suggests that different components (e.g., observations, growth model estimates, surveys, SLOs, etc.) tend to be only modestly correlated with each other, but that, done correctly, they each can add unique information to teachers’ final ratings (e.g., Jacob and Lefgren 2008; Harris and Sass 2009; Gates Foundation 2013Mihaly et al. 2013). Rather than examining and improving the "learning measures" in current systems for teachers in tested grades/subjects, the governor’s proposal simply removes one of them and doubles the weight assigned to the other.

And the fact that ineffective ratings on either component would, under this proposal, mean an automatic rating of developing or lower in some respects contradicts the idea of using weights in the first place – i.e., a single component can “overrule” the other ones. This assumes that teachers who score poorly in terms of "learning measures" or classroom observations could not possibly be considered "effective teachers." Either you believe in “multiple measures,” or you do not.

It also bears mentioning that, since value-added scores will tend to be more evenly distributed across the categories, this "overruling" most likely will disproportionately affect teachers in tested grades and subjects, which may serve as a disincentive to teach in those grades and subjects (not to mention an incentive to focus on tested math and reading content for those teachers who remain).

In addition, according to this proposal, teachers in non-tested grades and subjects, who don’t receive growth scores from the state, will instead be judged (50 percent of their scores) based on “a student growth measure that measures one year of academic growth.” The casual tone of this recommendation, and the complete lack of details accompanying it (it does not even specify whether these measures will be statewide or vary by district), suggest that the governor may not realize how incredibly complicated it will be to design and implement this kind of system. It will require assessments for dozens of grade/subject combinations, longitudinal data collection, and methods for determining "one year of academic growth." Again, these details are not just trivial side issues that can be addressed later. These measures will constitute half of most teachers' final ratings. 

Finally, note that the governor frames this decision as a means to “reduce unnecessary testing.” This is difficult to understand. Given the assessments required to measure "academic growth" in non-tested grades and subjects, it is far from certain that these new measures will reduce testing. Depending on the details (which, to reiterate, are largely unspecified), the testing burden could very well increase.

And, of course, once again, districts will lose some, and perhaps most, of the work they’ve done over the past two years designing and implementing alternative “learning measures,” and they will have new ones imposed on them from above.


In conclusion, New York districts spent a lot of time and effort and resources negotiating their systems, which were constrained by New York’s law. All of the hard work fell upon administrators and educators in individual districts, many of which lack the resources and personnel for implementing large, complex policy changes in short order, and all of which are already dealing with other issues, such as Common Core implementation and, of course, running their schools. Many of these districts did not want (or need) new evaluations. Tensions are already high.

Many people (myself included) discuss teacher evaluations in rather technocratic language. At their core, however, they are, like all accountability systems, primarily about encouraging productive behavioral changes. This means that credibility and buy-in are absolutely crucial. These systems will not work if administrators and teachers resent them and/or do not believe they are valuable

Governor Cuomo's proposal for New York teacher evaluations is imperious, ill-considered and unsupportable. If the state goes ahead with it, the quality of the measures will suffer, as will the credibility of the systems. At the risk of sounding overly alarmist, I believe this proposal would threaten New York’s entire teacher evaluation policy. This is strictly my opinion, and it is speculation. I hope I am wrong. But I hope even more that we don’t have to find out.


* Quick background: According to New York’s evaluation law, every teacher was to receive a final score between 0-100: 60 points for classroom observation; 20 points for state “learning” measures; and 20 points for local “learning” measures (the weight assigned to state "learning measures" was to increase to 25 percent after the first year). The law imposed the scheme for sorting these final 0-100 scores into performance categories (highly effective, effective, developing, ineffective). The state also assigned a score of 0-20 (the state “learning” measure) to teachers in tested grades and subjects. For all other measures (observations, local “learning,” and state “learning” for teachers in non-tested grades and subject), each district had to negotiate with its union what the measures consisted of, and how they were scored. This was a somewhat strange mix of prescription and flexibility, one that presented severe challenges for districts.

** The comparison of observations conducted by “third party” observers suggests that differences in reliability/validity compared to principals’ observations vary. This report, for example, finds that principals’ observations are more reliable than those conducted by observers from outside the school, but less valid (i.e., less strongly associated with value-added scores the following year). But the differences tend to be rather minor, and they most certainly do not support the 35/15 split recommended by the governor’s proposal.

Issues Areas