Evaluating The Results Of New Teacher Evaluation Systems

A new working paper by researchers Matthew Kraft and Allison Gilmour presents a useful summary of teacher evaluation results in 19 states, all of which designed and implemented new evaluation systems at some point over the past five years. As with previous evaluation results, the headline result of this paper is that only a small proportion of teachers (2-5 percent) were given the low, “below proficiency” ratings under the new systems, and the vast majority of teachers continue to be rated as satisfactory or better.

Kraft and Gilmour present their results in the context of the “Widget Effect,” a well-known 2009 report by the New Teacher Project showing that the overwhelming majority of teachers in the 12 districts for which they had data received “satisfactory” ratings. The more recent results from Kraft and Gilmour indicate that this hasn’t changed much due to the adoption of new evaluation systems, or, at least, not enough to satisfy some policymakers and commentators who read the paper.

The paper also presents a set of findings from surveys of and interviews with observers (e.g., principals). These are in many respects more interesting and important results from a research and policy perspective, but let’s nevertheless focus a bit on the findings on the distribution of teachers across rating categories, as they caused a bit of a stir. I have several comments to make about them, but will concentrate on three in particular (all of which, by the way, pertain not to the paper’s discussion, which is cautious and thorough, but rather to some of the reaction to it in our education policy discourse).

Student Sorting And Teacher Classroom Observations

Although value added and other growth models tend to be the focus of debates surrounding new teacher evaluation systems, the widely known but frequently unacknowledged reality is that most teachers don’t teach in the tested grades and subjects, and won’t even receive these test-based scores. The quality and impact of the new systems therefore will depend heavily upon the quality and impact of other measures, primarily classroom observations.

These systems have been in use for decades, and yet, until recently, relatively little is known about their properties, such as their association with student and teacher characteristics, and there are, as yet, only a handful of studies of their impact on teachers’ performance (e.g., Taylor and Tyler 2012). The Measures of Effective Teaching (MET) Project, conducted a few years ago, was a huge step forward in this area, though at the time it was perhaps underappreciated the degree to which MET’s contribution was not just in the (very important) reports it produced, but also in its having collected an extensive dataset for researchers to use going forward. A new paper, just published in Educational Evaluation and Policy Analysis, is among the many analyses that have and will use MET data to address important questions surrounding teacher evaluation.

The authors, Rachel Garrett and Matthew Steinberg, look at classroom observation scores, specifically those from Charlotte Danielson’s widely employed Framework for Teaching (FFT) protocol. These results are yet another example of how observation scores share most of the widely-cited (statistical) criticisms of value added scores, most notably their sensitivity to which students are assigned to teachers.

Beyond Teacher Quality

Beyond PD: Teacher Professional Learning in High-Performing Systems is a recent report from the Learning First Alliance and the International Center for Benchmarking in Education at the National Center for Education and the Economy. The paper describes practices and policies from four high-performing school systems – British Columbia, Hong Kong, Shanghai, and Singapore – where professional learning is believed to be the primary vehicle for school improvement.

My first reaction was: This sounds great, but where is the ubiquitous discussion of “teacher quality?” Frankly, I was somewhat baffled that a report on school improvement never even mentioned the phrase.* Upon close reading, I found the report to be full of radical (and very good) ideas. It’s not that the report proposed anything that would require an overhaul of the U.S. education system; rather, they were groundbreaking because these ideas did not rely on the typical assumptions about how the youth or the adults in these systems learn and achieve mastery. Because, while things are changing a bit in the U.S. with regard to our understanding of student learning – e.g., we now talk about “deep learning” – we have still not made this transition when it comes to teachers.

In the U.S., a number of unstated but common assumptions about “teacher quality” suffuse the entire school improvement conversation. As researchers have noted (see here and here), instructional effectiveness is implicitly viewed as an attribute of individuals, a quality that exists in a sort of vacuum (or independent of the context of teachers’ work), and which, as a result, teachers can carry with them, across and between schools. Effectiveness also is often perceived as fairly stable: teachers learn their craft within the first few years in the classroom and then plateau,** but, at the end of the day, some teachers have what it takes and others just don’t. So, the general assumption is that a “good teacher” will be effective under any conditions, and the quality of a given school is determined by how many individual “good teachers” it has acquired.

Evidence From A Teacher Evaluation Pilot Program In Chicago

The majority of U.S. states have adopted new teacher evaluation systems over the past 5-10 years. Although these new systems remain among the most contentious issues in education policy today, there is still only minimal evidence on their impact on student performance or other outcomes. This is largely because good research takes time.

A new article, published in the journal Education Finance and Policy, is among the handful of analyses examining the preliminary impact of teacher evaluation systems. The researchers, Matthew Steinberg and Lauren Sartain, take a look at the Excellence in Teaching Project (EITP), a pilot program carried out in Chicago Public Schools starting in the 2008-09 school year. A total of 44 elementary schools participated in EITP in the first year (cohort 1), while an additional 49 schools (cohort 2) implemented the new evaluation systems the following year (2009-10). Participating schools were randomly selected, which permits researchers to gauge the impact of the evaluations experimentally.

The results of this study are important in themselves, and they also suggest some more general points about new teacher evaluations and the building body of evidence surrounding them.

Where Al Shanker Stood: The Importance And Meaning Of NAEP Results

In this New York Times piece, published on July 29, 1990, Al Shanker discusses the results of the National Assessment of Educational Progress (NAEP), and what they suggested about the U.S. education system at the time.

One of the things that has influenced me most strongly to call for radical school reform has been the results of the National Assessment of Educational Progress (NAEP) examinations. These exams have been testing the achievement of our 9, 13 and 17-year olds in a number of basic areas over the past 20 years, and the results have been almost uniformly dismal.

According to NAEP results, no 17-year-olds who are still in school are illiterate and innumerate - that is, all of them can read the words you would find on a cereal box or a billboard, and they can do simple arithmetic. But very few achieve what a reasonable person would call competence in reading, writing or computing.

For example, NAEP's 20-year overview, Crossroads in American Education, indicated that only 2.6 percent of 17-year-olds taking the test could write a good letter to a high school principal about why a rule should be changed. And when I say good, I'm talking about a straightforward presentation of a couple of simple points. Only 5 percent could grasp a paragraph as complicated as the kind you would find in a first-year college textbook. And only 6 percent could solve a multi-step math problem like this one:"Christine borrowed $850 for one year from Friendly Finance Company. If she paid 12% simple interest on the loan, what was the total amount she repaid?"

The Magic Of Multiple Measures

Our guest author today is Cara Jackson, Assistant Director of Research and Evaluation at the Urban Teacher Center.

Teacher evaluation has become a contentious issue in U.S.  Some observers see the primary purpose of these reforms as the identification and removal of ineffective teachers; the popular media as well as politicians and education reform advocates have all played a role in the framing of teacher evaluation as such.  But, while removal of ineffective teachers was a criterion under Race to the Top, so too was the creation of evaluation systems to be used for teacher development and support.

I think most people would agree that teacher development and improvement should be the primary purpose, as argued here.  Some empirical evidence supports the efficacy of evaluation for this purpose (see here).  And given the sheer number of teachers we need, declining enrollment in teacher preparation programs, and the difficulty disadvantaged schools have retaining teachers, school principals are probably none too enthusiastic about dismissing teachers, as discussed here.

Of course, to achieve the ambitious goal of improving teaching practice, an evaluation system must be implemented well.  Fans of Harry Potter might remember when Dolores Umbridge from the Ministry of Magic takes over as High Inquisitor at Hogwarts and conducted “inspections” of Hogwart’s teachers in Book 5 of J.K. Rowling’s series.  These inspections pretty much demonstrate how not to approach classroom observations: she dictates the timing, fails to provide any of indication of what aspects of teaching practice she will be evaluating, interrupts lessons with pointed questions and comments, and evidently does no pre- or post-conferencing with the teachers. 

Research On Teacher Evaluation Metrics: The Weaponization Of Correlations

Our guest author today is Cara Jackson, Assistant Director of Research and Evaluation at the Urban Teacher Center.

In recent years, many districts have implemented multiple-measure teacher evaluation systems, partly in response to federal pressure from No Child Left Behind waivers and incentives from the Race to the Top grant program. These systems have not been without controversy, largely owing to the perception – not entirely unfounded - that such systems might be used to penalize teachers.  One ongoing controversy in the field of teacher evaluation is whether these measures are sufficiently reliable and valid to be used for high-stakes decisions, such as dismissal or tenure.  That is a topic that deserves considerably more attention than a single post; here, I discuss just one of the issues that arises when investigating validity.

 The diagram below is a visualization of a multiple-measure evaluation system, one that combines information on teaching practice (e.g. ratings from a classroom observation rubric) with student achievement-based measures (e.g. value-added or student growth percentiles) and student surveys.  The system need not be limited to three components; the point is simply that classroom observations are not the sole means of evaluating teachers.   

In validating the various components of an evaluation system, researchers often examine their correlation with other components.  To the extent that each component is an attempt to capture something about the teacher’s underlying effectiveness, it’s reasonable to expect that different measurements taken of the same teacher will be positively related.  For example, we might examine whether ratings from a classroom observation rubric are positively correlated with value-added.

Empower Teachers To Lead, Encourage Students To Be Curious

Our guest author today is Ashim Shanker, a former English Language Arts teacher in public schools in Tokyo, Japan. Ashim has a Master’s Degree in International Education Policy from Harvard University and is the author of three books, including Don’t Forget to Breathe. Follow him on Twitter at @ashimshanker.

In the 11 years that I was a public school teacher in Japan, I came to view education as a holistic enterprise. Schools in Japan not only imbued students with relevant skills, but also nurtured within them the wherewithal to experience a sense of connection with the larger world, and the exploratory capacity to discover their place within it.

In my language arts classes, I encouraged students to read about current events and human rights issues around the world. I asked them to make lists of the electronics they used, the garments they wore, and the food products they consumed on a daily basis. I then had them research where these products were made and under what labor conditions.

The students gave presentations on child laborers and about modern-day slavery. They debated about government secrecy laws in Japan and cover-ups in the aftermath of the Fukushima nuclear disaster. They read an essay on self-reliance by Emerson and excerpts on civil disobedience by Thoreau, and I asked them how these two activists might have felt about the actions of groups like Anonymous, or about whistleblowers like Edward Snowden. We discussed the Milgram Experiment and the Stanford Prison experiment, exploring how obedience and situational role conformity might tip even those with the best of intentions toward acts of cruelty. We talked about bullying, and shared anecdotes of instances in which we might unintentionally have hurt others. There were opportunities for self-reflection, engagement, and character building—attributes that I would like to think foster the empathic foundations for better civic engagement and global citizenship.

Do We Know How To Hold Teacher Preparation Programs Accountable?

This piece is co-authored by Cory Koedel and Matthew Di Carlo. Koedel is an Associate Professor of Economics and Public Policy at the University of Missouri, Columbia.

The United States Department of Education (USED) has proposed regulations requiring states to hold teacher preparation programs accountable for the performance of their graduates. According to the proposal, states must begin assigning ratings to each program within the next 2-3 years, based on outcomes such as graduates’ “value-added” to student test scores, their classroom observation scores, how long they stay in teaching, whether they teach in high-needs schools, and surveys of their principals’ satisfaction.

In the long term, we are very receptive to, and indeed optimistic about, the idea of outcomes-based accountability for teacher preparation programs (TPPs). In the short to medium term, however, we contend that the evidence base underlying the USED regulations is nowhere near sufficient to guide a national effort toward high-stakes TPP accountability.

This is a situation in which the familiar refrain of “it’s imperfect but better than nothing” is false, and rushing into nationwide design and implementation could be quite harmful.

Will Value-Added Reinforce The Walls Of The Egg-Crate School?

Our guest author today is Susan Moore Johnson, Jerome T. Murphy Research Professor in Education at Harvard Graduate School of Education. Johnson directs the Project on the Next Generation of Teachers, which examines how best to recruit, develop, and retain a strong teaching force.

Academic scholars are often dismayed when policymakers pass laws that disregard or misinterpret their research findings. The use of value-added methods (VAMS) in education policy is a case in point.

About a decade ago, researchers reported that teachers are the most important school-level factor in students’ learning, and that that their effectiveness varies widely within schools (McCaffrey, Koretz, Lockwood, & Hamilton 2004; Rivkin, Hanushek, & Kain 2005; Rockoff 2004). Many policymakers interpreted these findings to mean that teacher quality rests with the individual rather than the school and that, because some teachers are more effective than others, schools should concentrate on increasing their number of effective teachers.

Based on these assumptions, proponents of VAMS began to argue that schools could be improved substantially if they would only dismiss teachers with low VAMS ratings and replace them with teachers who have average or higher ratings (Hanushek 2009). Although panels of scholars warned against using VAMS to make high-stakes decisions because of their statistical limitations (American Statistical Association, 2014; National Research Council & National Academy of Education, 2010), policymakers in many states and districts moved quickly to do just that, requiring that VAMS scores be used as a substantial component in teacher evaluation.