A Look At The Changes To D.C.'s Teacher Evaluation System

D.C. Public Schools (DCPS) recently announced a few significant changes to its teacher evaluation system (called IMPACT), including the alteration of its test-based components, the creation of a new performance category (“developing”), and a few tweaks to the observational component (discussed below). These changes will be effective starting this year.

As with any new evaluation system, a period of adjustment and revision should be expected and encouraged (though it might be preferable if the first round of changes occurs during a phase-in period, prior to stakes becoming attached). Yet, despite all the attention given to the IMPACT system over the past few years, these new changes have not been discussed much beyond a few quick news articles.

I think that’s unfortunate: DCPS is an early adopter of the “new breed” of teacher evaluation policies being rolled out across the nation, and any adjustments to IMPACT’s design – presumably based on results and feedback – could provide valuable lessons for states and districts in earlier phases of the process.

Accordingly, I thought I would take a quick look at three of these changes.

Less importance for measures based on state assessments, more on local assessments. Although a few news stories reported that test scores will now be less heavily weighted in IMPACT, this is not quite true. Under the previous/current system, the district’s value-added measure (which is based on DC’s “state” assessment) constituted 50 percent of a final evaluation (at least for teachers in tested grades/subjects). It is going to be downweighted to 35 percent.

For the other 15 percent, “each teacher will work with his or her principal to collaboratively select an assessment and set learning goals against which the teacher will be evaluated." In other words, the results of student assessments will still represent 50 percent of tested teachers’ scores, but these measures are now “diversified."

Such diversification is quite common around the nation, with several states opting to use a combination of state and local measures (some districts, such as Austin, TX, have been using "student learning objectives" and similar systems for a while). It is, needless to say, very difficult to know how this will turn out in terms of results, and it will vary within and between districts and states. They will have to pay close attention to "quality control" and comparability (it's also quite a logistical effort).

This also means that, in DC and elsewhere, research on the properties and use of these measures will be extremely important, and I hope states and districts have set up ways to facilitate this work.

One thing is clear: Many teachers seem to strongly prefer designing their own “student learning” measures over value-added from state tests – and credibility among teachers is an important consideration in gauging the utility of any evaluation system.

Dropping the lowest observation score. Under IMPACT, DCPS teachers are observed five times a year, three times by administrators and twice by "master educators." Starting with the implementation of these new changes, if any one of those scores is at least one point lower than the average of the other four, the low score will be dropped. The idea here is that sometimes teachers (or observers) have a bad day or, for whatever reason, just don’t get as high a score as they would normally. When that happens, they shouldn’t be penalized for it.

This seems like a sensible change, representing a kind of rough error correction. In D.C. and elsewhere, one of the advantages of taking the time and expense to perform so many annual observations is that you can do things like this.*

However, just to prove that no good deed goes unpunished, I would add that I find it curious that DCPS puts forth so much effort to improve the reliability of teacher observations (five times a year is a lot, and it’s expensive and difficult to do it), yet they take value-added scores at face value, with no attention to error margins (see here for more discussion of this issue).**

New performance rating scheme. Perhaps the most significant change of all is that to IMPACT’s scoring rubric. The simple graphic below summarizes the old and new scoring systems.

Under both the old and new systems, "highly effective" teachers are eligible for bonuses, while teachers receiving an “ineffective” rating in any given year are subject to dismissal, as are those receiving two consecutive “minimally effective” ratings. But now, teachers who receive between 175-199 points will be rated “ineffective” rather than “minimally effective." In addition, there is a new category of “developing," which is basically created by partitioning the previous “effective” category. Teachers who receive three consecutive “developing” ratings will be subject to dismissal.

The new scoring system is almost certain to result in more terminations over the medium- to long-term. However, the actual extent to which this will occur is still an open question, since it's unclear how the other changes – e.g., dropping outlier observations and using 15 percent local assessments – will affect final evaluation scores. (In other words, DCPS is changing the scoring rubric at the same time they're changing the components of those scores. This is something that must be monitored going forward.)

That said, the press release asserts that the purpose of the scoring change is is to create "higher standards," and that the decisions were made “after carefully analyzing three years of performance data and talking with an extensive group of stakeholders.” To DCPS's credit, their fact sheet elaborates on the justification with a bit of empirical evidence, and, if you're interested, I discuss it in the third footnote, below.***

But my main point here is actually applicable around the nation, and it's that I am concerned about the degree to which the design of teacher evaluations is being driven by a desire for the results to match up with preconceptions of how they should look.

In this instance, DCPS is saying that rating two-thirds (68 percent) of teachers as “effective” does not represent sufficiently "high standards," that the “true performance” of these teachers varies widely, and that further differentiation is necessary to separate the higher from the lower performers. Thus, dividing the category in two will address the problem.

Setting high standards is certainly fine, and I have no doubt that the performance of the large group of “effective” DC teachers varies a great deal. However, while I fully acknowledge that human judgment inevitably plays a starring role in these determinations, the big question here is not whether performance varies within a category or categories but rather whether an evaluation system, especially one that's relatively new, can make these distinctions with a reasonable degree of accuracy.

And, around the nation, evaluation systems are being judged and designed/altered based largely on the idea that “real” teacher effectiveness varies considerably, and therefore teacher evaluation ratings must necessarily vary accordingly, with a good spread of teachers across several categories.****

To a degree, this makes plenty of sense, as differentiation is essential to any performance evaluation (it's kind of the whole point). If, for example, the vast majority of teachers are getting the same rating, then it’s safe to conclude that the system is failing to draw meaningful distinctions between them. And adjusting scoring rubrics or, preferably, the components and weights that constitute those scores, if done thoughtfully, is how you get this done.

There is, however, a limit to "differentiation for differentiation's sake." At this extremely early phase of designing an entirely new evaluation system, the simple fact that the percentages of teachers assigned to each rating seem realistic or appropriate isn’t necessarily positive if you’ve set up or altered the system so as to achieve those results, no matter what more credulous observers might say.

If anything, the available evidence suggests that performance measures are not that good at distinguishing between teachers in the middle of the distribution, and this goes for both value-added estimates as well as principal observations (with the latter typically assessed vis-a-vis value-added). This is obviously not to to say that it's impossible or shouldn't be tried, but it must be done very carefully, and, at the very least, a concentration of teachers in a single mid-distribution category is not automatically a red flag.

It may very well be the case that the new evaluation systems are simply not (or perhaps not yet) equipped to achieve the level of differentiation that we might otherwise want. Over many years of gathering and analyzing results, the situation will likely improve. In the meantime, there is no shame in admitting that measuring teacher performance is a long-term process, and it calls for a little patience, humility and judiciousness.

- Matt Di Carlo

*****

* On the flip side of this, I have (what may seem like trivial) concerns about DCPS’ decision, announced last year, to let teachers who receive “highly effective” ratings for two consecutive years forego three of their required five observations the next year. This is certainly a welcome change for these teachers, and that matters, but it also carries research implications – there will be fewer data with which to analyze the properties of relationships between IMPACT’s components, and this will be the case for teachers receiving one of the high-stakes ratings (“highly effective” teachers are eligible for large bonuses, which means that the accuracy of this rating is very important).

** To DCPS’s credit, however, they do use a strong model that employs a statistical technique (called “shrinkage”) that, put simply, adjusts estimates for sample size. It's therefore not fair to say they do nothing to mitigate imprecision.

*** A few points about the evidence DCPS presents for splitting up the "effective" category (and it's likely they have more, as this is a one-page fact sheet), which is that roughly two-thirds of teachers fall into this category, but "teachers scoring at the low end ... produced 8 fewer months of learning in math and 6 fewer months of learning in reading than did teachers at the top end of the category (350)" (note that this only applies to the minority of DC teachers - those in tested grades/subjects). There's a little bit of circularity here – insofar as 50 percent of IMPACT scores are based on value-added, it’s hardly surprising that there is a difference in value-added between teachers with final IMPACT scores 100 points apart (it may be disproportionately large relative to the full distribution, but that's not entirely clear). It also focuses on the extremes of the soon-to-be-partitioned "effective" category (250 versus 350), which means that you might roughly recast this datum by saying that teachers near the low end of the "highly effective" range produce 6-8 months more "learning" than those at the high end of "minimally effective" (and that would make sense). A more useful comparison would be between teachers in the middle of each of the two new categories (“effective” and “developing”), not at the outskirts of the old one. Most importantly, my guess is that a large chunk of the teachers in the "effective" category are above average (just some more so than others), while many of those with point estimates below the average are too close to be considered meaningfully different from it (and the models impose this distribution). Given the instability of the value-added scores, as well as the importance of prediction in any performance evaluation, a better illustration would have been to compare 2011-12 value-added scores among teachers rated "effective" in the prior year; the discrepancy would have almost certainly been smaller.

**** As another example, see the recommendations from a different “early adopter” state – Tennessee.

Blog Topics

Classroom Observation

Teacher Evaluation

Interesting. Question -- statistically speaking, is there any meaningful difference between dropping the low score, versus dropping both the low and high score?

MG,

If you're asking whether teachers' overall observation scores would be different under two scenarios - dropping the low score versus dropping the low and high scores - the answer is almost certainly yes (though the extent might not be as large as one would anticipate, depending on the distribution [and on whether you retain the DCPS policy that the outlier must be one point different from the teacher-level average]).

If, on the other hand, you're asking whether dropping the low and high score would have a large impact on results versus a scenario in which neither was dropped (i.e., whether they would cancel out), then my answer is that I'm not sure, without having the data in front of me. Again, it depends on the distribution.

Finally, if you're asking a different question altogether that I'm missing, let me know.

Thanks for the comment,
MD

thanks MD.