Thoughts On Using Value Added, And Picking A Model, To Assess Teacher Performance
Our guest author today is Dan Goldhaber, Director of the Center for Education Data & Research and a Research Professor in Interdisciplinary Arts and Sciences at the University of Washington Bothell.
Let me begin with a disclosure: I am an advocate of experimenting with using value added, where possible, as part of a more comprehensive system of teacher evaluation. The reasons are pretty simple (though articulated in more detail in a brief, which you can read here). The most important reason is that value-added information about teachers appears to be a better predictor of future success in the classroom than other measures we currently use. This is perhaps not surprising when it comes to test scores, certainly an important measure of what students are getting out of schools, but research also shows that value added predicts very long run outcomes, such as college going and labor market earnings. Shouldn’t we be using valuable information about likely future performance when making high-stakes personnel decisions?
It almost goes without saying, but it’s still worth emphasizing, that it is impossible to avoid making high-stakes decisions. Policies that explicitly link evaluations to outcomes such as compensation and tenure are new, but even in the absence of such policies that are high-stakes for teachers, the stakes are high for students, because some of them are stuck with ineffective teachers when evaluation systems suggest, as is the case today, that nearly all teachers are effective.
I also believe the use of value added has helped drive a much deeper conversation about evaluating teachers, a conversation that would not be occurring were it not for the value added “threat," or, perhaps more to the point, the threat associated with using a system that forces differentiated performance measures. Put another way, value added may be a key catalyst for broader changes to teacher evaluation. Finally, it’s clearly a judgment call, but I believe the bar for experimenting with alternatives to today’s evaluation systems is pretty low given that most of them fail to recognize that there’s a big difference between the most and least effective teachers, which both casual observation and statistical analysis shows exists.
But beyond whether to use value added is the question of what approach ought to be implemented. In very general terms, the idea behind value-added models is the translation of test-based measures of student achievement growth into a gauge of teacher performance; these models, however, come in a variety of different flavors, meaning policymakers employing them have to make choices. I won’t go into too much detail here about the different statistical approaches (for more on the nitty gritty details, go here), but the choice of statistical model does, at least in some cases, have consequences for how teacher performance is judged. Moreover, these implications can be masked by very high correlations in teacher effectiveness rankings when comparing rankings for the teacher workforce (covered by valued added) as a whole.
As an example (more detail about the comparisons across different types of models can be found here), value added models and student growth percentile models, the two most common general types of models being used in teacher evaluations today, are correlated at over 0.90 -- a very strong relationship -- even though they differ substantially in terms of how, and the extent to which, they account for differences in students’ backgrounds (put very simply, value-added models, unlike most student growth percentile models being used by states, often control directly for student characteristics such as free/reduced-price lunch eligibility).
In other words, these two models strongly agree with each other in the teacher rankings they yield. Yet, despite this, we see that the value added models that include student background adjustments tend to show teachers responsible for instructing more relatively disadvantaged students as performing better than the growth percentile models (and the converse is also true). Put differently, teachers with relatively high proportions of disadvantaged students tend to get better ratings from value-added than from growth percentile models. It also appears to matter a great deal whether comparisons of teacher effectiveness are made within schools or both within and between schools.
So, what can be made of this? What is the “right” model?
Unfortunately, it is quite difficult to know from a statistical standpoint. Differences in teacher rankings according to the kinds of students taught might reflect bias in the model, but they might also reflect true differences in teacher quality across different kinds of students (e.g. disadvantaged students tend to have less experienced and credentialed teachers, and, by some estimates, teachers with lower value added), or limitations of the way that a model adjusts for students’ backgrounds. Likewise, differences in teacher rankings associated with whether a model compares teachers only within schools or within and between schools could be based on school-level factors (e.g. the environment created by principals) or on real differences in the distribution of teacher quality across schools.
(As an important aside, while modeling dilemmas are being vetted thoroughly in the case of value added, these same fundamental issues arise for any means of evaluating teachers, such as when it comes to picking an observational rubric to use.)
But the statistical standpoint is not the only relevant perspective here. Part of the reason that there is no “right” or “wrong” answer when it comes to model choice is that we ultimately care about how the use of a particular model affects the quality of the teacher workforce and, hence, student learning. This is likely to depend on using a model that produces reasonably valid estimates of teacher effectiveness - i.e., a model that yields causal estimates of a teacher’s contribution to student learning, which must be suitable for potential uses of performance evaluations, such as deciding on tenure or dismissals.
We also care about how teachers (and prospective teachers) perceive a model and how that might affect their behavior. The best model from a statistical validity standpoint might, for instance, not be the model that teachers trust, which could affect their motivation to change their practices. The bottom line is that we can’t know the “right model” up front because we do not know how teachers will react to these estimates’ use in performance evaluations. Fortunately, the experimentation with different models will afford us the opportunity to learn more about this issue over time.
In the meantime, my view is that part of the process of model adoption ought to involve policymakers applying different models to their data in order to make the differences in teacher rankings explicit to stakeholders, explaining the reasons for those differences, and hopefully getting buy-in upfront for the model that is adopted. This type of transparency would no doubt give some ammunition to those who oppose using value added. But showing that some teacher ratings may change according the specific model used is not conceptually different than finding that teachers may be judged to be different under different rubrics used for classroom observation.
In other words, value added, like any other system of evaluation, will be an imperfect measure of true performance (which is never observed). Thus, I come full-circle and conclude that when it comes to using value added, the question is not whether it is “right” in some absolute sense, but rather: Does it provide us with more or better information about teachers than the other feasible systems for evaluating them?
- Dan Goldhaber
Comments