In a previous post, I compared value-added (VA) and classroom observations in terms of reliability – the degree to which they are free of error and stable over repeated measurements. But even the most reliable measures aren’t useful unless they are valid – that is, unless they’re measuring what we want them to measure.
Arguments over the validity of teacher performance measures, especially value-added, dominate our discourse on evaluations. There are, in my view, three interrelated issues to keep in mind when discussing the validity of VA and observations. The first is definitional – in a research context, validity is less about a measure itself than the inferences one draws from it. The second point might follow from the first: The validity of VA and observations should be assessed in the context of how they’re being used.
Third and finally, given the difficulties in determining whether either measure is valid in and of itself, as well as the fact that so many states and districts are already moving ahead with new systems, the best approach at this point may be to judge validity in terms of whether the evaluations are improving outcomes. And, unfortunately, there is little indication that this is happening in most places.
Let’s start by quickly defining what is usually meant by validity. Put simply, whereas reliability is about the precision of the answers, validity addresses whether we’re using them to answer the correct questions. For example, a person’s weight is a reliable measure, but this doesn’t necessarily mean it’s valid for gauging the risk of heart disease. Similarly, in the context of VA and observations, the question is: Are these indicators, even if they can be precisely estimated (i.e., they are reliable), measuring teacher performance in a manner that is meaningful for student learning?
Needless to say, this is an exceedingly difficult – perhaps impossible - question to answer. Due in no small part to the availability of evidence, virtually all of the discussion about VA and observations proceeds based on a somewhat abstract, academic (though still very important) standard of validity.
In the case of VA, the standard might be whether the models provide relatively unbiased estimates of teachers’ causal effects on student test scores. And, it is well-established that there are problems here (though, all else being equal, random error [reliability] is arguably a bigger concern).
Namely, different value-added models can yield different results, as can different tests under the same model. Similarly, systematic bias in the estimates may arise from unmeasured factors such as curricular variation, peer effects, consistency of tested content or classroom assignments. There is little dispute that these (and other) issues exist, though there is disagreement as to their severity.
But, as is the case with reliability, the same problems can also apply to observations. You might say observations are valid if the protocols gauge the degree to which a teacher exemplifies the practices that promote student learning. It’s quite a challenge to know whether that is indeed the case. As is the case with value-added, different observation protocols yield different results. In addition, the assignment of students to certain teachers might very well influence the results of an observation.
Since all these indicators are ostensibly trying to measure a similar phenomenon (e.g., teacher performance), one common means of getting an idea as to the validity of a given measure is to see if it corresponds with other measures. And the available evidence suggests that there is a moderate relationship between VA estimates and observation scores (also here and here), and that both predict future student performance.*
Overall, to whatever extent we can draw conclusions about the validity of VA and observations by this "research-oriented" standard, it’s fair to say that both have their strengths and weaknesses, and the issue of validity is not exclusive to either.
But this doesn’t necessarily tell us very much about whether we should use one or both measures in actual evaluations, as it ignores the second key point I’d like to raise: In policy discussions, the assessment of a measure’s validity must also consider the purposes for which it is used. In other words, a given measure might be appropriate and/or useful for some types of decision and not others (this is related to the concept of “consequential validity”).
For example, using a performance indicator like value-added to target professional development might require much lower (or different) validity standards than using the estimates to make high-stakes decisions about compensation or employment, since, among other reasons, the costs of making mistakes are considerably lower in the former situation.
Also, even if VA models or observations were perfect, it’s still possible that they would be less than effective in improving teacher practice or quality, and that actual policy use might partially threaten whatever validity they have. For instance, it is possible that high-stakes use of VA will compel so-called “teaching to the test," which dilutes the degree to which the scores reflect “true” student learning (observations are not immune from this type of bias either). On a similar note, teacher buy-in is important: An unpopular system might increase turnover/mobility, and/or make it less likely that teachers use the results to improve their practice.
From this perspective, the implications of the evidence discussed above, though important, are somewhat limited.**
This brings us to the third and final point: The debate over VA and observations in teacher evaluations is about policy, and, in this context, it may be that the best question to ask is not whether VA and observations are valid by some absolute Platonic standard (which we cannot determine), but rather whether the manner in which they are used improves outcomes. Put simply, it’s much easier to assess the effect of policies than the absolute validity of measures they use.
Unfortunately but predictably, given that we are still at an early stage, evidence regarding the effects of the kinds of new evaluation systems currently being designed and implemented is still a bit scarce, and is mostly limited to low-stakes applications. Nevertheless, these types of studies are very important for discussions of the validity of teacher evaluations or their constituent components.
For instance, this paper reaches the encouraging conclusion that high-quality observations can improve teacher performance (as measured by value-added) among mid-career teachers. An evaluation of low-stakes use of value-added in Pennsylvania (i.e., teachers and administrators used the data to inform instruction) found no effects on student achievement, though the program was limited in its scope, and there wasn’t sufficient time to train users. In contrast, a randomized experiment in New York City, where test-based accountability is more well-established, found that giving principals access to value-added did have a discernible impact – it influenced their “subjective” opinions of teachers’ performance.
We know almost nothing about the validity or effects of measures in teacher evaluations for high-stakes use. Given the importance of this issue, one would expect that states implementing new systems would be making sure to have rigorous, independent program evaluations in place as an essential part of their overall plan.
They should be gathering data and closely monitoring the process and its effects – both short- and long-term, on a variety of different outcomes – at every step. And, insofar as these projects might require at least several years of data (and interviews, etc.) to reach conclusions, this effort should begin as soon as the new systems go online.
I see little indication that this is happening with teacher evaluations. Personally, my biggest concern about all this is that we’re making drastic changes but are failing to see whether and why they work (or don’t work). Instead, the design and results of new evaluation systems are being judged based on unsupported preconceptions as to what they should look like, not whether they’re accurate or improving outcomes.
Despite all the rhetoric about the validity (or lack thereof) of value-added and observations, there is little if any support for certainty. We can either argue about whether the new systems are working, or we can check.
- Matt Di Carlo
* In addition, it’s worth noting that VA has been partially validated in an experimental analysis, while there is recent evidence that teacher-induced test score improvements are associated with very small increases in future earnings, educational attainment and other outcomes.
** Much of the heated debate over evaluations, especially value-added, appears to stem from differences in beliefs as to the purpose of evaluating teachers, and how the final scores should be used in decisions. Those who view evaluations as formative tools – to be used to identify strengths and weaknesses in teacher practice - tend to be more skeptical toward value-added, which is less well-suited for these purposes than observations. Yet, even the most ardent opponents of VA in evaluations often acknowledge that VA might play a useful role in evaluations as a “trigger” for some form of corrective action, such as professional development. In fact, many of these opponents are even borderline receptive when presented with a hypothetical scenario in which VA scores comprise, say, 10 percent of a teacher’s final score. People understand validity is context-specific.