Skip to:

Student Sorting And Teacher Classroom Observations

Although value added and other growth models tend to be the focus of debates surrounding new teacher evaluation systems, the widely known but frequently unacknowledged reality is that most teachers don’t teach in the tested grades and subjects, and won’t even receive these test-based scores. The quality and impact of the new systems therefore will depend heavily upon the quality and impact of other measures, primarily classroom observations.

These systems have been in use for decades, and yet, until recently, relatively little is known about their properties, such as their association with student and teacher characteristics, and there are, as yet, only a handful of studies of their impact on teachers’ performance (e.g., Taylor and Tyler 2012). The Measures of Effective Teaching (MET) Project, conducted a few years ago, was a huge step forward in this area, though at the time it was perhaps underappreciated the degree to which MET’s contribution was not just in the (very important) reports it produced, but also in its having collected an extensive dataset for researchers to use going forward. A new paper, just published in Educational Evaluation and Policy Analysis, is among the many analyses that have and will use MET data to address important questions surrounding teacher evaluation.

The authors, Rachel Garrett and Matthew Steinberg, look at classroom observation scores, specifically those from Charlotte Danielson’s widely employed Framework for Teaching (FFT) protocol. These results are yet another example of how observation scores share most of the widely-cited (statistical) criticisms of value added scores, most notably their sensitivity to which students are assigned to teachers.

Summarizing those results in brief and simple terms, Garrett and Steinberg exploit the randomization of students to teachers embedded in the MET research design, and find that FFT scores are strongly associated with student achievement – that is, students of teachers with higher FFT scores also tend to perform better on standardized tests, all else being equal.  This relationship, however, is driven largely by the sorting of students across classrooms. In other words, some teachers are assigned higher performing students, on average, and these teachers tend to get higher FFT scores.

The researchers conclude that such sorting, as well as the instability of measures between years, constrains the ability of policymakers to use observations alone for identifying teacher effectiveness.

If a follower of education policy heard that a teacher performance measure was associated with student characteristics, influenced by non-random assignment, and unstable between years, the natural assumption among a great many would be that the measure in question is value added (or other growth model) scores. Yet the well-established, uncomfortable truth is that these properties seem also to apply to observation scores.

It’s important to note that the formative value of observations – i.e., the fact that the observation process and its results can offer substantive feedback for teachers to improve their practice – is not shared by value added scores. Nevertheless, most attempts by value added critics to dismiss the utility of value added measures rely on arguments that pertain equally to the most common alternative, including observations.

The policy relevance of this issue is obvious. The reform of teacher evaluation systems, spurred by Race to the Top, was among the fastest, most widespread policy changes in recent memory. Most states are either making decisions based on new teacher evaluation scores already, or are in the process of phasing in these systems. And the new observation protocols, particularly for teachers in non-tested grades/subjects, but even for some of those who do receive growth model scores, are often the most heavily weighted component of teachers’ ratings.

This paper by Garrett and Steinberg offers a few important, policy-relevant conclusions, but the one I’d like to highlight is that it’s yet another example of research illustrating the sensitivity of teacher performance measures to the students in teachers’ classrooms. Put simply, teachers who are assigned higher performing students tend to get better scores. Such assignment may very well have beneficial effects on student performance, measured and unmeasured. Still, in the teacher evaluation context, it matters for several reasons, including the perceived fairness of observations, the possibility of classification errors, etc.

In the case of observations, such results were not necessarily predictable. It is not unreasonable to believe that observations should produce scores that are unassociated with student characteristics, since a teacher’s practice can theoretically be assessed by a trained eye using a good protocol in a manner that, in a sense, ignores the audience. The idea is intuitively appealing, but the emerging evidence suggests that this ideal does not apply in the real world (though, as Garrett and Steinberg report, some of the FFT's components are not associated with student characteristics).

One big policy question here is whether to adjust teachers’ scores to account for the observable characteristics of the students they teach (see, for example, Whitehurst et al. 2014). This general approach is a bedrock principle of most test-based growth models in use in evaluations, but, in the case of classroom observations, there are few if any systems that employ it. Such adjustment would attenuate, albeit imperfectly, the potential for observations (and thus evaluations in general) to penalize teachers of lower performing students, but it might also, for instance, result in backlash against what could be perceived as added complexity to a familiar measure.

Classroom observations, whatever their limitations, tend to have more credibility among educators, relative to growth model estimates. This matters a great deal, as the primary purpose of any accountability system is to change behavior, and this is less likely to happen if teachers cease to believe in the measures used. It is difficult to predict how addressing this issue, such as by adjusting observation scores for student characteristics, would play out in the important arena of credibility.

In any case, however, it would seem time for policymakers and other stakeholders to start paying more attention to the properties and role of classroom observations in new teacher evaluation systems, in addition to that of test-based measures, which thus far have received most of the attention.

Issues Areas


I read this paper as saying something different-- that there is a positive but very weak association between observable characteristics of classrooms and student achievement, even growth in student achievement, because the latter is something that teachers and schools have relatively little control over. I linked to the paper briefly as part of this post:

Thank you. As should have long been obvious, the vam portions and the observation portion of evaluations are likely to both be biased against teachers with classrooms of students from generational poverty who have survived extreme trauma. The double whammy of combining the two is an extreme threat to inner city teachers. For instance, in my classrooms, we often had 40% of our class on IEPs - mostly for conduct disorders or serious emotional disturbances - and nearly a quarter on ELLs. The majority of my students had a felony rap, usually fairly minor but often due to home invasions with a gun or something comparable, and many had full-blown mental illness as opposed to mental health issues. Someone transferred in or out every day and attendance was erratic. These problems are worsened by policies that keep teachers from receiving disciplinary backing. And, what happens in years when the students have to attend multiple funerals of classmates and/or family members. But, in places like D.C., evaluators are likely to believe that good teaching looks the same in all sorts of classrooms. I've long asked researchers to conduct an experiment along these lines. First, do the observation according to the standard rubric. Then, brief other evaluators on the type of information I just described. Would they reach the same conclusions on the teachers' instruction? John Thompson

Add new comment

This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.


This web site and the information contained herein are provided as a service to those who are interested in the work of the Albert Shanker Institute (ASI). ASI makes no warranties, either express or implied, concerning the information contained on or linked from The visitor uses the information provided herein at his/her own risk. ASI, its officers, board members, agents, and employees specifically disclaim any and all liability from damages which may result from the utilization of the information provided herein. The content in the Shanker Blog may not necessarily reflect the views or official policy positions of ASI or any related entity or organization.