## The Great Teacher Evaluation Evaluation: New York Edition

A couple of weeks ago, the New York State Education Department (NYSED) released data [1] from the first year of the state's new teacher and principal evaluation system (called the “Annual Professional Performance Review," or APPR). In what has become a familiar pattern [2], this prompted a wave of criticism from advocates, much of it focused on the proportion of teachers in the state to receive the lowest ratings.

To be clear, evaluation systems that produce non-credible results should be examined and improved, and that includes those that put implausible proportions of teachers in the highest and lowest categories. Much of the commentary surrounding this and other issues has been thoughtful and measured. As usual [3], though, there have been some oversimplified reactions, as exemplified by this piece [4] on the APPR results from Students First NY (SFNY).

SFNY notes what it considers to be the low proportion of teachers rated “ineffective," and points out that there was more differentiation across rating categories for the state growth measure (worth 20 percent of teachers’ final scores), compared with the local “student learning” measure (20 percent) and the classroom observation components (60 percent). Based on this, they conclude that New York’s "*state test is the only reliable measure of teacher performance*" (they are actually talking about validity, not reliability, but we’ll let that go). Again, this argument is not representative of the commentary surrounding the APPR results, but let’s use it as a springboard for making a few points, most of which are not particularly original. (**UPDATE**: After publication of this post, SFNY changed the headline of their piece from "the only reliable measure of teacher performance" to "the most reliable measure of teacher performance.")

First, and most basically, if estimates derived from the state tests are the “only reliable measure of teacher performance," we’re in big trouble, since it means that most teachers simply cannot be evaluated (as they don’t teach in tested grades/subjects). In that case, it would seem that the only responsible recommendation would be to scrap the entire endeavor. I doubt that’s what SFNY is trying to argue here, and so it follows that they might want to be more careful with their rhetoric.

If, on the other hand, they had advocated for increasing the weight, or importance, assigned to the state growth model results, this at least is a defensible suggestion (approached crudely though it sometimes is). But weighting, even putting aside all the substantive issues (e.g., value judgments, reliability), is a lot more complicated in practice than in theory. In reality, the "true weight" of any given measure depends on how much it varies compared with the other measures [5]. (It should also be pointed out that the nominal weight assigned to state growth estimates is already scheduled to increase to 25 percent.)

Second, regardless of the particular aspect of the APPR results one is discussing, it is absolutely crucial to bear in mind that the results released *did not include New York City*. In other words, a non-random block of teachers -- roughly 35 percent of the state’s teacher workforce -- were not included, which means that any blanket statements about the statewide results are, to put it kindly, premature.

Third, some of the coverage [6] of these results paid a great deal of attention to the low proportion of teachers (0.47 percent) rated “ineffective” on the classroom observations component (SFNY, in a terribly misleading fashion, claimed that “zero percent” of teachers were “ineffective” on the observations component). The distributions of APPR categorical ratings, by component and overall, are presented in the table below.*

For one thing, these categorical ratings, at least for the observation component, by themselves mean nothing for teachers’ final ratings. Observation scores are incorporated directly into teachers’ final scores as a total number of points (0-60).

Now, to be fair, the raw point totals for classroom observations varied quite a bit less than those for the learning measures (this has been the case in other states as well). The primary reason why the state test-based growth model results tend to exhibit more variation is most likely that *they are designed to do so*. And, by the way, one could easily force classroom observation scores to produce greater spread across the 60 points and/or four categories if one wanted to do so.

This doesn’t mean that there’s anything wrong with the imposed distribution of state growth model results (there is a case to be made that it is a strength), but it’s important to remember why it occurs, particularly when using the results to make sweeping statements about which measures do and do not belong in the system. And to argue that classroom observations -- or any measure, for that matter -- simply have no business in teacher evaluations based on the spread across categories is absurd on its face, at least for those of us who believe that the quality of measures matters, and that the evaluation process should have some formative value. It's amazing how quickly and forcefully some people are willing to judge measures, or even entire systems, based on nothing more than relative frequencies.

The fourth point I’d like to make about the APPR results is that the results vary widely by district. Districts were able to design their own systems for the local “learning” and observation components of the evaluations, and for the state learning component among non-tested teachers. As a result, APPR results were really all over the map. To illustrate this variation, take a look at the distribution of results for districts of different “levels of need" (these categories are provided by the state).

In the larger, more urban districts of Buffalo, Rochester, Yonkers and Syracuse, the spread of ratings was extremely different than it was statewide, with over one-quarter of all teachers receiving one of the two lowest ratings. And there’s also more variation in final ratings, compared with the state, in the districts classified as “urban/suburban high needs."

On the one hand, as discussed here [7], the fact that teachers in higher needs districts received lower APPR ratings, on average, than teachers in more affluent districts may indicate bias in the measures. It may also be a result of “real” differences.**

In either case, the statewide results mask the variation, both overall and by school “type." Instead of passing judgment on the system as a whole based on aggregate results, a better approach would be to see whether and why districts with different systems/scoring produced different results.

(Side note: If the results for other "higher needs" districts are any indication, the inclusion of New York City in these figures would have made a substantial difference in the distribution of final teacher ratings.)

The fifth and final point I’d like to make is a discussion of what these results *actually* mean, and what we can learn from them. Well, when it comes to the “technical” properties of these evaluations, there are a couple of worthwhile lessons here. For instance, I think it’s fair to say that the evaluation system overall produced a somewhat implausible distribution in some places, and there should be adjustments considered (carefully) on a district-by-district basis. It would seem, however, that the systems producing the least variation across categories are those in more affluent districts, which tend not to be the focus of reform efforts.

Another thing that bears mentioning about the results is that NY districts were allowed to submit for approval their own measures (e.g., the local learning measures, and the state learning measures for non-tested teachers), which in some cases varied by school/classroom, and it seems that they did so in a manner that produced results that, while varying great deal between districts, were not grossly maldistributed. This is actually substantively important, not because of the “ineffective” ratings fetish, but rather because there are (valid) concerns that districts/schools might try to set up their systems such that teachers scored very well. This does not seem to have been the case in most places.

But the biggest thing to keep in mind about these results is that most of the important lessons cannot be gleaned *from the results alone*. Perhaps the most important considerations is how teachers and other stakeholders (e.g., principals) respond to the system. For example, do teachers change their classroom practice based on the scores or feedback from observations? Do the ratings and feedback influence teachers’ decisions to stay in the profession (or in their school/district)? How do these outcomes vary between districts using different measures or scoring? These are also questions that really matter, and they are not answerable in the short-term, and they certainly cannot be addressed looking at highly aggregate distributions across rating categories and imposing one’s pre-existing beliefs on how they should turn out.

- Matt Di Carlo

*****

* Note that roughly 25,000 teachers represented in the first row ("state growth model") are also represented in the second row, and they are combined with the approximately 100,000 non-NYC teachers who did not receive growth model estimates from the state test, but whose state learning component is measured differently (with "Student Learning Objectives," or SLOs).

** There is, for example, clear evidence [8] that the state growth model is systematically associated with subsidized lunch eligibility. However, given that only about one in five teachers receive these estimates, as well as the fact that they are only one component, that by itself is not driving the overall differences.