Value-Added Versus Observations, Part One: Reliability

Although most new teacher evaluations are still in various phases of pre-implementation, it’s safe to say that classroom observations and/or value-added (VA) scores will be the most heavily-weighted components toward teachers’ final scores, depending on whether teachers are in tested grades and subjects. One gets the general sense that many - perhaps most - teachers strongly prefer the former (observations, especially peer observations) over the latter (VA).

One of the most common arguments against VA is that the scores are error-prone and unstable over time - i.e., that they are unreliable. And it's true that the scores fluctuate between years (also see here), with much of this instability due to measurement error, rather than “real” performance changes. On a related note, different model specifications and different tests can yield very different results for the same teacher/class.

These findings are very important, and often too casually dismissed by VA supporters, but the issue of reliability is, to varying degrees, endemic to all performance measurement. Actually, many of the standard reliability-based criticisms of value-added could also be leveled against observations. Since we cannot observe “true” teacher performance, it’s tough to say which is “better” or “worse," despite the certainty with which both “sides” often present their respective cases. And, the fact that both entail some level of measurement error doesn't by itself speak to whether they should be part of evaluations.*

Nevertheless, many states and districts have already made the choice to use both measures, and in these places, the existence of imprecision is less important than how to deal with it. Viewed from this perspective, VA and observations are in many respects more alike than different.

One useful way to look at the utility of performance measures in the context of high-stakes decision making is whether or not they can predict the future. If, for example, evaluation scores vary a great deal between years for reasons other than “real” changes in performance, you might give a bonus to retain a high-scoring teacher, or dismiss a low-scoring teacher, even though both teachers are likely to receive different scores the next year.

This is why reliability – again, the degree to which scores don't entail measurement error and stay the same when measured multiple times (e.g., between years) – is so important in teacher and other types of performance evaluations. Decisions made based on extremely noisy measures, even if those measures are valid, may not have the desired effect.**

There has been endless focus on VA’s reliability, and that attention is absolutely justified. But what about the reliability of the detailed observation protocols that states and districts are adopting? For instance, would a teacher get a different score if observed on a different day, or on the same day by a different observer?

The research suggests, put simply, that observation scores are also prone to variability. The recently released results of the Gates MET study found that, even with extensive training of observers and across multiple scoring rubrics, observation ratings varied widely between lessons and observers. This is consistent with other research in this area, such as this examination of observation scores in Chicago, which found discrepancies between the ratings given by principals versus other observers (also see here).

These findings make perfect sense – teachers’ observable performance might vary by lesson, and even well-trained observers using explicit criteria might score different performances differently. In addition, both teachers and students, like everyone else, have good and bad days.

So, the “error/instability argument” that is so often used to argue against value-added also applies to observations. To reiterate, this doesn't necessarily speak to whether they're valid measures - i.e., whether they are telling us what we want to know about teacher (and, ultimately, student) performance. Moreover, there are important differences in the sources and interpretations of error of test-based estimates from econometric models versus that of scores from a human being observing performance in person.***

But the essential cause is similar – in both cases, we are trying to approximate “true” teacher performance based on small samples of limited information (test scores or observations of lessons), which is being “plugged in” to imperfect instruments (models and human-scored observation protocols). There's a lot of room for noise to creep into this process.

For states and districts that have already chosen their measures, the question at this point isn’t really whether they are imprecise (they are), but rather, what can be done about it. A comprehensive answer to this question is beyond the scope of a single post. In very general terms, it’s about more and better data, and using it wisely.

For instance, the MET project’s results demonstrate that observation scores are highly unstable unless the results of multiple observations are incorporated (the team recommends at least four). Combining the scores of different observations helps to “smooth out” the error that may arise from things like observer judgment or just having a bad day. In addition, quality checks must be in place: Observers must be well-trained, and there should be an ongoing effort to verify the results (e.g., measuring inter-rater reliability).

And, of course, the same goes for value-added scores, which should not be used in any high-stakes decisions without samples that are sufficiently large – at least two years of data, depending on class size, and preferably three or more. Furthermore, even when multiple years of data are available, any incorporation of these estimates into evaluations should make use of confidence intervals, which would also reduce the imprecision of the estimates, both within and between years, by means of proper interpretation.

For both VA and observations, these precautions certainly do not eliminate error, but they can at least mitigate it to an extent.

Unfortunately, the (relatively few) states and districts that have actually finalized new evaluations seem to be coming up short on one or both counts. On the VA front, with few exceptions (Hillsborough County, FL being one), the new systems are taking at face value imprecisely-estimated growth model scores and making high-stakes decisions based on as little as one year of data.****

Steps such as sample size restrictions and using confidence intervals are very basic, and there is no justification for ignoring them.

At least on the surface, things seem a little bit better in the case of observations. States and districts appear to be grasping the need for multiple observations. But this is not helpful if the quality of the observations suffers as a result of the quantity. In Tennessee, for example, principals or assistant principals must evaluate their teachers four times a year (six times for probationary teachers), and there are anecdotal reports that this burden is consuming administrators’ jobs, while the rush to implementation (Tennessee's plan is called "First to the Top") gave teachers and leaders insufficient time to become familiar with the new protocol (but, again, this is all based on isolated accounts). Similarly, Chicago just announced a new system, which, somewhat incredibly, the district will have to fully implement (train observers, etc.) over the course of just a few months (starting next school year).

Now, to reiterate, the details of the new systems are still being hammered out in many places, and there is still hope that design and implementation will be better among states and districts that allowed at least a year or two for planning and phase-in. (Personal note: I am more optimistic about new systems taking steps to increase the reliability of observations than I am about their addressing the value-added issues.)*****

In any case, the main point here is that the mere existence of imprecision isn’t grounds for dismissing VA or any other measure. The intense focus on the reliability of value-added estimates has been useful and the issues are important, but, at the same time, massive changes are taking place. The success or failure of these efforts will be in no small part determined by the kinds of (admittedly unsexy) details discussed above. There will be plenty of disagreement about the choice of components for any new evaluation, but there should be none about making sure the measures are used in a manner that maximizes precision.

- Matt Di Carlo


* Whether one or both should be included depends as much on validity as reliability. A valid measure is, put simply, one that tells us what we want to know - in this case, about teacher performance. Put differently, whereas reliability looks at the precision of the answers, validity is about whether we’re using them to answer the right questions. Even the most reliable teacher performance measures aren’t helpful unless they are valid, and vice-versa. I will discuss the (somewhat thornier) issue of validity of value-added versus observations in a subsequent post.

** One often hears the common argument that the new evaluations must necessarily be better than the current systems. I personally find the certainty with which this argument is sometimes made to be excessive. The idea that even poorly-designed new evaluations are sure to be better strikes me as an unrealistic and dangerous premise. But even if it’s true, this cannot justify making poor choices about design/implementation, especially given the cost and difficulty of putting in new systems (and changing them once they’re in place).

*** For example, insofar as evaluation should be a vehicle for improvement (and it should), it’s easier for a teacher to understand the reasons for a change in observation scores (by looking at scores on the sub-components) than a change in growth model estimates (the “black box” syndrome).

**** The choice of models (or, perhaps, of observation protocols) is a related issue here, though it is in many respects more of a validity issue. There are trade-offs in model selection, and, as usual, no “correct answer," but many states have adopted models that are not appropriate for causal inferences, and thus for high-stakes decisions (see Bruce Baker’s discussions here and here).

***** In addition to dealing with reliability in each measure separately, states and districts must also pay attention to how these data are "combined" into a final score. There are as yet no empirically-grounded guidelines for the proper weights of VA and observations, but, as a rule, the more precise a given measure, the larger a role it should play. The smart move would be to allow districts to try different configurations and see how they worked out. But this is not an option in most states with new systems, as the weights for VA/observations are being predetermined. In addition, there seems to be little attention to how VA/observation scores are actually incorporated - e.g., as categories, raw scores, etc. This matters - even if two measures have the same nominal weight, the distribution of results can cause one to have a higher effective weight. For instance, if VA and observations are both weighted 50/50, but most teachers receive the same observation score, the VA component will determine most of the variation in final outcomes.


Hey Matt,

As usual, I love your insights into this. Quick question about the implementation and, specifically, time-to-implement side of this. Is there any research that quantifies the value of longer phase-in time for new initiatives? I see the side of "do it right -- take your time," and yet, as a former teacher, I also saw the urgent need for these systems ASAP. Is there any way to represent this tension mathematically, or at least find some sort of "sweet spot?"

Thanks, and keep up the great work.


I think you are missing the bigger picture. The purpose of an observation is to reinforce effective teaching behaviors and eliminate ineffective practices. The appearance of either type will vary from lesson to lesson. Observations were never designed to evaluate. We need to stop worrying about measurement and focus on improving instruction.



With respect to value-added models you state "On a related note, different model specifications and different tests can yield very different results for the same teacher/class."

You seem to acknowledge the importance of this issue in your note: "**** The choice of models (or, perhaps, of observation protocols) is a related issue here, though it is in many respects more of a validity issue. There are trade-offs in model selection, and, as usual, no “correct answer,” but many states have adopted models that are not appropriate for causal inferences, and thus for high-stakes decisions (see Bruce Baker’s discussions here and here)."

The fact that different model specifications and outcome tests results in substantially different attributions of effectiveness is certainly a validity issue, if not THE validity issue for value-added measurement. If different, defensible VAMs "can yield very different results for the same teacher/class," how can value-added measurement provide valid measures of teacher effectiveness? To my knowledge, no legislated evaluation system attempts to measure or average the results of different models and none use alternative standardized tests.

Until this validity question is addressed, the entire VAM enterprise seems fundamentally suspect to me.