Teacher Evaluations: Don't Begin Assembly Until You Have All The Parts

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

Over the past year or two, roughly 15-20 states have passed or are considering legislation calling for the overhaul of teacher evaluation. The central feature of most of these laws is a mandate to incorporate measures of student test score growth, in most cases specifying a minimum percentage of a teacher’s total score that must consist of these estimates.

There’s some variation across states, but the percentages are all quite high. For example, Florida and Colorado both require that at least 50 percent of an evaluation must be based on growth measures, while New York mandates a minimum of 40 percent. These laws also vary in terms of other specifics, such as the degree to which the growth measure proportion must be based on state tests (rather than other assessments), how much flexibility districts have in designing their systems, and how teachers in untested grades and subjects are evaluated. But they all share that defining feature of mandating a minimum proportion – or “weight” – that must be attached to a test-based estimate of teacher effects (at least for those teachers in tested grades and subjects).

Unfortunately, this is typical of the misguided manner in which many lawmakers (and the advocates advising them) have approached the difficult task of overhauling teacher evaluation systems. For instance, I have discussed previously the failure of most systems to account for random error. The weighting issue is another important example, and it violates a basic rule of designing performance assessment systems: You should exercise extreme caution in pre-deciding the importance of any one component until you know what the other components will be. Put simply, you should have all the parts in front of you before you begin the assembly process.

States’ weighting decisions have been (rightfully) criticized as arbitrary and politically-motivated. It is certainly true that we will never know the “correct weight” for teacher effect estimates, so there will always be some degree of informed guessing in the weighting process. But there are empirical means by which the weighting issue might be addressed, and it’s important not to set any one component’s relative importance until you can also consider the properties and importance of all of the other parts.

There’s a rather large menu of these possible components for new teacher evaluation systems – the aforementioned growth model estimates, principal observations, peer observations, student surveys, alternative student learning measures (e.g., research papers, portfolios, etc.), measures of teacher “inputs” (e.g., sample lesson plans), professional development credits, school-level performance measures (e.g., graduation rates), school environment measures, etc. The manner in which these various components “come together” to form a teacher’s total score is in many respects as important as the properties of any one individual component.

For example, the degree to which any one component varies (or does not vary) can have a dramatic effect on the relative importance of the other parts. Let’s say we have a system in which teachers’ scores on classroom observations count for 50 percent, while value-added (or other growth model) estimates count for the other 50 percent. If it turns out that the vast majority of teachers receive the same rating on their observation, this essentially means that, for most teachers, their value-added estimates will count for 100 percent of their final evaluation scores.

Now, we should certainly hope that the new set of measures – observation protocols and others – being designed in states and districts around the nation will do a better job of differentiating teachers than that, but it’s really all a matter of degree. Measures with lower variation will necessarily affect the relative importance of the other components, as will those with high variation. But, with the exception of the student growth measures (which often impose a distribution by judging teachers relative to each other), we don’t yet know how much each individual component will vary in any given year. As a result, the high minimum weights for teacher growth measures might end up precluding states and districts from calibrating their systems’ components in a manner that reflects their distributional properties. (Side note: These issues clearly speak to the advisability of “phase-in” time for new evaluations, which would permit such testing and calibration.)

Then there’s always the issue of simple accuracy. As a rule, if any one component of an evaluation is highly imprecise, it should be less heavily-weighted, and vice-versa. Needless to say, gauging the accuracy of most measures (including growth model estimates) will require time, assessment of inter-measure correlations, and feedback from teachers and administrators. But if, for instance, a district’s newly-designed principal observation measure ends up being fantastic by all measures, we should want to weight it more heavily. The problem, in states that have already specified very high minimum weights (40-50 percent) for value-added or other growth model estimates, is that the degree to which we are able to assign greater importance to any other measure or combination of measures that ends up doing a good job is severely limited. Having that type of flexibility can be crucial in achieving an evaluation system that is widely regarded as fair and accurate. 

Now, in fairness, pre-setting the weights of one component is not necessarily a terrible practice – a lot depends on how and why you do it. For one thing, the risk of setting minimum weights might depend on how high they are. For example, if states specified that at least 10 percent of an evaluation had to consist of growth model scores, this would be far less constraining – some districts would go higher, while others would stick with the 10 percent, and this variation in configurations could be used to inform all districts. But no states – at least of which I’m aware – have done this. Instead, those that have predetermined growth model weights have all set them around 40-50 percent (which will inevitably end up being the weight that most districts use, since they can’t go lower, and are unlikely to exceed such a high minimum).

There’s also the issue of why the minimums are chosen. There’s certainly a case for setting them, even at high levels, if we are relatively certain that the indicator is accurate and important enough to merit that relative power. For instance, the owner of a business might well decide to base a high proportion of his or her employees’ compensation on sales volume, even before deciding on how the remainder will be decided. That’s because sales volume is not only usually easy to measure (high reliability), but it is also perceived to be an accurate reflection of what the owner wants his or her employees to produce – i.e., profits (high validity). Even then, though, it is critical that the incentives be properly structured.

But such confidence is far more elusive in the context of teacher evaluation, at least at this early stage. It’s certainly true that, if a state wants to ensure that a given measure is included in all teacher evaluations, the easiest way is to impose a minimum weight. And there are many people who believe that growth model estimates are ready to be a mandatory part of these new systems. But even if, for the sake of argument, one concedes that point (and many experts do not), it is really just impossible to make a compelling case that these measures merit the 40-50 percent minimum weighting that they are being given at this point, especially when the risks of predetermining one component’s importance without knowing what the others will be, and how they will perform – are so high.

The truth is that, despite completely absurd declarations to the contrary, what we don’t know about these new evaluation measures outweighs what we know by an order of magnitude. That goes for not only growth model estimates – whether and how they will improve teacher performance – but also for most of the other components being considered, and how they are all best combined. Most of this is all very new. How it will play out, in the short- and long-term, is still very much an open question.

Lawmakers and advocates who argue for the incorporation of heavily-weighted growth model estimates in evaluation systems are taking a massive leap of faith, one fraught with risk. They are trying to answer empirical questions without empirical evidence, and violating basic principles of system design by writing assembly instructions without knowing all the parts. It’s called an evaluation "system" for a reason.

- Matt Di Carlo