Evidence From A Teacher Evaluation Pilot Program In Chicago

The majority of U.S. states have adopted new teacher evaluation systems over the past 5-10 years. Although these new systems remain among the most contentious issues in education policy today, there is still only minimal evidence on their impact on student performance or other outcomes. This is largely because good research takes time.

A new article, published in the journal Education Finance and Policy, is among the handful of analyses examining the preliminary impact of teacher evaluation systems. The researchers, Matthew Steinberg and Lauren Sartain, take a look at the Excellence in Teaching Project (EITP), a pilot program carried out in Chicago Public Schools starting in the 2008-09 school year. A total of 44 elementary schools participated in EITP in the first year (cohort 1), while an additional 49 schools (cohort 2) implemented the new evaluation systems the following year (2009-10). Participating schools were randomly selected, which permits researchers to gauge the impact of the evaluations experimentally.

The results of this study are important in themselves, and they also suggest some more general points about new teacher evaluations and the building body of evidence surrounding them.

First, it bears noting that the teacher evaluations installed under the EITP consisted entirely of classroom observations using the framework introduced by Charlotte Danielson (1996), and now widely adopted in a number of states across the nation. The system did not include measures (e.g., estimates from value-added and other growth models) based on student testing data (though, even if it had, the majority of teachers would not have received these scores anyway, as most teachers are not in tested grades/subjects). Each teacher, regardless of seniority, was evaluated formally twice during the year, both times by the principal, and the process included both pre- and post-observation conferences. Principals received rather extensive training and support for these tasks, but, as discussed below, such help was more limited for cohort 2 schools compared with their counterparts in the first year.

Also keep in mind that this particular analysis examines the impact of teacher evaluations on overall school performance, rather than on individual teachers’ effectiveness. Given the experimental design, however, we can still assume with confidence that any observed effects are due to the teacher evaluations.

The results show that students in the schools participating in the first year (cohort 1) exhibited higher reading and math test scores at the end of that year, compared to cohort 2 schools (and non-EITP schools), although the difference in math was not statistically discernible (i.e., it might be due to random noise). After the second year of the program, the estimated impact remained positive but statistically insignificant in math, while the reading effect was again positive, significant, and educationally meaningful in magnitude.

In both subjects, however, the results were driven largely by schools serving higher scoring, more advantaged populations. In fact, in the highest poverty participating schools, the impact of the EITP program was effectively nil. Moreover, cohort 2 schools exhibited little improvement as a result of the pilot evaluation program, which is discussed a bit further below. 

Overall, then, the findings of this study would seem to suggest a few important conclusions and general ideas.

The (still limited) evidence thus far suggests that teacher evalutions using classroom observations alone do have the potential to improve student testing results. The EITP results, while not cut and dry (e.g., the cohort 1 math results are less compelling than those in reading, while the cohort 2 results are weak), are generally positive, and consistent with the results of a different "observation-only" evaluation system (Taylor and Tyler 2012). This is preliminary evidence that carefully implemented classroom observation systems might very well end up being a proven means of improving student performance (at least to the degree that performance can be measured by such standardized tests).

Moreover, such positive effects do not necessarily require that high stakes be attached to the results of the evaluations, or that test-based measures be included in the systems. This is not to imply that attaching stakes or using test-based measures in the systems would have attenuated the impact of the EITP; indeed, such elements might have helped (see, for example, Dee and Wyckoff 2015). But there is growing evidence that classroom observations have formative value that can translate into improved student performance on tests, and this is an important finding. Of course, the actual changes in behavior that lead to such improvement – e.g., feedback, collaboration, etc. –  remain an extremely important topic for future research, as does the question of whether and how much the observed impact persists over time.

The magnitude and distribution of the impact of teacher evaluations may depend heavily upon the observers themselves and the support they receive. This point may sound mundane and obvious, but it is nonetheless important, and it is a glaring implication of this study. To reiterate, the results of the EITP pilot indicate that the impact of the program was substantially weaker among cohort 2 schools compared with those participating in the first year. There are many reasons why this may be the case, but one particularly compelling possibility is the fact that principals in the first year received a much greater degree of training and ongoing support than their colleagues in the second year.

Such considerations are important in today’s policy environment, in which new teacher evaluations are being or have been implemented en masse, often in a rather rushed fashion, in many thousands of schools at the same time. In Chicago, there were only around 100 EITP schools, and yet capacity and support issues still persisted.

And such dependency on evaluator training and support can generate intra-district/state heterogeneity in the impact of new evaluations. The aforementioned weak results among EITP schools serving disadvantaged students may reflect differences in the quality of leadership between schools serving different populations. For example, schools serving more disadvantaged populations often have more trouble attracting and retaining leaders, which may seriously compromise the efficacy of evaluation systems that depend so heavily on principal training and ongoing feedback between principals and teachers.

One cannot overstate the advantage and importance of building into new policies mechanisms to evaluate their impact. The reason we are able to evaluate the EITP program and its important implications for similar policies elsewhere is the fact that Chicago district and union officials had the foresight to build directly into the program a means by which its effect could be isolated (i.e., the random assignment of the evaluation “treatment” to schools). This is often not the case elsewhere, which means that program evaluation, while still possible, will remain difficult and its conclusions far more tentative. Given the sheer number of states that have new evaluation systems, and the resources that have been invested in them, this is troubling at best.

Issues Areas