Value-Added, For The Record

People often ask me for my “bottom line” on using value-added (or other growth model) estimates in teacher evaluations. I’ve written on this topic many times, and while I have in fact given my overall opinion a couple of times, I have avoided expressing it in a strong “yes or no” format. There's a reason for this, and I thought maybe I would write a short piece and explain myself.

My first reaction to the queries about where I stand on value-added is a shot of appreciation that people are interested in my views, followed quickly by an acute rush of humility and reticence. I know think tank people aren’t supposed to say things like this, but when it comes to sweeping, big picture conclusions about the design of new evaluations, I’m not sure my personal opinion is particularly important.

Frankly, given the importance of how people on the ground respond to these types of policies, as well as, of course, their knowledge of how schools operate, I would be more interested in the views of experienced, well-informed teachers and administrators than my own. And I am frequently taken aback by the unadulterated certainty I hear coming from advocates and others about this completely untested policy. That’s why I tend to focus on aspects such as design details and explaining the research – these are things I feel qualified to discuss.  (I also, by the way, acknowledge that it’s very easy for me to play armchair policy general when it's not my job or working conditions that might be on the line.)

That said, here’s my general viewpoint, in two parts. First, my sense, based on the available evidence, is that value-added should be given a try in new teacher evaluations.

New Teacher Evaluations Are A Long-Term Investment, Not Test Score Arbitrage

One of the most important things in education policy to keep an eye on is the first round of changes to new teacher evaluation systems. Given all the moving parts and the lack of evidence on how these systems should be designed and their impact, course adjustments along the way are not just inevitable, but absolutely essential.

Changes might be guided by different types of evidence, such as feedback from teachers and administrators or analysis of ratings data. And, of course, human judgment will play a big role. One thing that states and districts should not be doing, however, is assessing their new systems – or making changes to them – based whether or not raw overall test scores go up or down within the first few years.

Here’s a little reality check: Even the best-designed, best-implemented new evaluations are unlikely to have an immediate measurable impact on aggregate student performance. Evaluations are an investment, not a quick fix. And they are not risk-free. Their effects will depend on the quality of systems, how current teachers and administrators react to them and how all of this shapes and plays out in the teacher labor market. As I’ve said before, the realistic expectation for overall performance – and this is no guarantee – is that there will be some very small, gradual improvements, unfolding over a period of years and decades.

States and districts that expect anything more risk making poor decisions during these crucial, early phases.

A Look At The Changes To D.C.'s Teacher Evaluation System

D.C. Public Schools (DCPS) recently announced a few significant changes to its teacher evaluation system (called IMPACT), including the alteration of its test-based components, the creation of a new performance category (“developing”), and a few tweaks to the observational component (discussed below). These changes will be effective starting this year.

As with any new evaluation system, a period of adjustment and revision should be expected and encouraged (though it might be preferable if the first round of changes occurs during a phase-in period, prior to stakes becoming attached). Yet, despite all the attention given to the IMPACT system over the past few years, these new changes have not been discussed much beyond a few quick news articles.

I think that’s unfortunate: DCPS is an early adopter of the “new breed” of teacher evaluation policies being rolled out across the nation, and any adjustments to IMPACT’s design – presumably based on results and feedback – could provide valuable lessons for states and districts in earlier phases of the process.

Accordingly, I thought I would take a quick look at three of these changes.

The Weighting Game

A while back, I noted that states and districts should exercise caution in assigning weights (importance) to the components of their teacher evaluation systems before they know what the other components will be. For example, most states that have mandated new evaluation systems have specified that growth model estimates count for a certain proportion (usually 40-50 percent) of teachers’ final scores (at least those in tested grades/subjects), but it’s critical to note that the actual importance of these components will depend in no small part on what else is included in the total evaluation, and how it's incorporated into the system.

In slightly technical terms, this distinction is between nominal weights (the percentage assigned) and effective weights (the percentage that actually ends up being the case). Consider an extreme hypothetical example – let’s say a district implements an evaluation system in which half the final score is value-added and half is observations. But let’s also say that every teacher gets the same observation score. In this case, even though the assigned (nominal) weight for value-added is 50 percent, the actual importance (effective weight) will be 100 percent, since every teacher receives the same observation score, and so all the variation between teachers’ final scores will be determined by the value-added component.

This issue of nominal/versus effective weights is very important, and, with exceptions, it gets almost no attention. And it’s not just important in teacher evaluations. It’s also relevant to states’ school/district grading systems. So, I think it would be useful to quickly illustrate this concept in the context of Florida’s new district grading system.

Value-Added Versus Observations, Part Two: Validity

In a previous post, I compared value-added (VA) and classroom observations in terms of reliability – the degree to which they are free of error and stable over repeated measurements. But even the most reliable measures aren’t useful unless they are valid – that is, unless they’re measuring what we want them to measure.

Arguments over the validity of teacher performance measures, especially value-added, dominate our discourse on evaluations. There are, in my view, three interrelated issues to keep in mind when discussing the validity of VA and observations. The first is definitional – in a research context, validity is less about a measure itself than the inferences one draws from it. The second point might follow from the first: The validity of VA and observations should be assessed in the context of how they’re being used.

Third and finally, given the difficulties in determining whether either measure is valid in and of itself, as well as the fact that so many states and districts are already moving ahead with new systems, the best approach at this point may be to judge validity in terms of whether the evaluations are improving outcomes. And, unfortunately, there is little indication that this is happening in most places.

Value-Added Versus Observations, Part One: Reliability

Although most new teacher evaluations are still in various phases of pre-implementation, it’s safe to say that classroom observations and/or value-added (VA) scores will be the most heavily-weighted components toward teachers’ final scores, depending on whether teachers are in tested grades and subjects. One gets the general sense that many - perhaps most - teachers strongly prefer the former (observations, especially peer observations) over the latter (VA).

One of the most common arguments against VA is that the scores are error-prone and unstable over time - i.e., that they are unreliable. And it's true that the scores fluctuate between years (also see here), with much of this instability due to measurement error, rather than “real” performance changes. On a related note, different model specifications and different tests can yield very different results for the same teacher/class.

These findings are very important, and often too casually dismissed by VA supporters, but the issue of reliability is, to varying degrees, endemic to all performance measurement. Actually, many of the standard reliability-based criticisms of value-added could also be leveled against observations. Since we cannot observe “true” teacher performance, it’s tough to say which is “better” or “worse," despite the certainty with which both “sides” often present their respective cases. And, the fact that both entail some level of measurement error doesn't by itself speak to whether they should be part of evaluations.*

Nevertheless, many states and districts have already made the choice to use both measures, and in these places, the existence of imprecision is less important than how to deal with it. Viewed from this perspective, VA and observations are in many respects more alike than different.

Beware Of Anecdotes In The Value-Added Debate

A recent New York Times "teacher diary" presents the compelling account of a New York City teacher whose value-added rating was 6th percentile in 2009 – one of the lowest scores in the city – and 96th percentile the following year, one of the highest. Similar articles - for example, about teachers with errors in their rosters or scores that conflict with their colleagues'/principals' opinions - have been published since the release of the city’s teacher data reports (also see here). These accounts provoke a lot of outrage and disbelief, and that makes sense – they can sound absurd.

Stories like these can be useful as illustrations of larger trends and issues - in this case, of the unfairness of publishing the NYC scores,  most of which are based on samples that are too small to provide meaningful information. But, in the debate over using these estimates in actual policy, we need to be careful not to focus too much on anecdotes. For every one NYC teacher whose value-added rank changed over 90 points between 2009 and 2010, there are almost 100 teachers whose ranks were within 10 points (and percentile ranks overstate the actual size of all these differences). Moreover, even if the models yielded perfect measures of test-based teacher performance, there would still be many implausible fluctuations between years - those that are unlikely to be "real" change - due to nothing more than random error.*

The reliability of value-added estimates, like that of all performance measures (including classroom observations), is an important issue, and is sometimes dismissed by supporters in a cavalier fashion. There are serious concerns here, and no absolute answers. But none of this can be examined or addressed with anecdotes.

Ready, Aim, Hire: Predicting The Future Performance Of Teacher Candidates

In a previous post, I discussed the idea of “attracting the best candidates” to teaching by reviewing the research on the association between pre-service characteristics and future performance (usually defined in terms of teachers’ estimated effect on test scores once they get into the classroom). In general, this body of work indicates that, while far from futile, it’s extremely difficult to predict who will be an “effective” teacher based on their paper traits, including those that are typically used to define “top candidates," such as the selectivity of the undergraduate institutions they attend, certification test scores and GPA (see here, here, here and here, for examples).

There is some very limited evidence that other, “non-traditional” measures might help. For example, a working paper, released last year, found a statistically discernible, fairly strong association between first-year math value-added and an index constructed from surveys administered to Teach for America candidates. There was, however, no association in reading (note that the sample was small), and no relationships in either subject found during these teachers’ second years.*

A recently-published paper – which appears in the peer-reviewed journal Education Finance and Policy, originally released as working paper in 2008 –  represents another step forward in this area. The analysis, presented by the respected quartet of Jonah Rockoff, Brian Jacob, Thomas Kane, and Douglas Staiger (RJKS), attempts to look beyond the set of characteristics that researchers are typically constrained (by data availability) to examine.

In short, the results do reveal some meaningful, potentially policy-relevant associations between pre-service characteristics and future outcomes. From a more general perspective, however, they are also a testament to the difficulties inherent in predicting who will be a good teacher based on observable traits.

A Big Open Question: Do Value-Added Estimates Match Up With Teachers' Opinions Of Their Colleagues?

A recent article about the implementation of new teacher evaluations in Tennessee details some of the complicated issues with which state officials, teachers and administrators are dealing in adapting to the new system. One of these issues is somewhat technical – whether the various components of evaluations, most notably principal observations and test-based productivity measures (e.g., value-added) – tend to “match up." That is, whether teachers who score high on one measure tend to do similarly well on the other (see here for more on this issue).

In discussing this type of validation exercise, the article notes:

If they don't match up, the system's usefulness and reliability could come into question, and it could lose credibility among educators.
Value-added and other test-based measures of teacher productivity may have a credibility problem among many (but definitely not all) teachers, but I don’t think it’s due to – or can be helped much by – whether or not these estimates match up with observations or other measures being incorporated into states’ new systems. I’m all for this type of research (see here and here), but I’ve never seen what I think would be an extremely useful study for addressing the credibility issue among teachers: One that looked at the relationship between value-added estimates and teachers’ opinions of each other.

Trial And Error Is Fine, So Long As You Know The Difference

It’s fair to say that improved teacher evaluation is the cornerstone of most current education reform efforts. Although very few people have disagreed on the need to design and implement new evaluation systems, there has been a great deal of disagreement over how best to do so – specifically with regard to the incorporation of test-based measures of teacher productivity (i.e., value-added and other growth model estimates).

The use of these measures has become a polarizing issue. Opponents tend to adamantly object to any degree of incorporation, while many proponents do not consider new evaluations meaningful unless they include test-based measures as a major element (say, at least 40-50 percent). Despite the air of certainty on both sides, this debate has mostly been proceeding based on speculation. The new evaluations are just getting up and running, and there is virtually no evidence as to their effects under actual high-stakes implementation.

For my part, I’ve said many times that I'm receptive to trying value-added as a component in evaluations (see here and here), though I disagree strongly with the details of how it’s being done in most places. But there’s nothing necessarily wrong with divergent opinions over an untested policy intervention, or with trying one. There is, however, something wrong with fully implementing such a policy without adequate field testing, or at least ensuring that the costs and effects will be carefully evaluated post-implementation. To date, virtually no states/districts of which I'm aware have mandated large-scale, independent evaluations of their new systems.*

If this is indeed the case, the breathless, speculative debate happening now will only continue in perpetuity.