Creating A Valid Process For Using Teacher Value-Added Measures

** Reprinted here in the Washington Post

Our guest author today is Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, Value-Added Measures in Education, provides an excellent, accessible review of the technical and practical issues surrounding these models.

Now that the election is over, the Obama Administration and policymakers nationally can return to governing. Of all the education-related decisions that have to be made, the future of teacher evaluation has to be front and center.

In particular, how should “value-added” measures be used in teacher evaluation? President Obama’s Race to the Top initiative expanded the use of these measures, which attempt to identify how much each teacher contributes to student test scores. In doing so, the initiative embraced and expanded the controversial reliance on standardized tests that started under President Bush’s No Child Left Behind.

In many respects, The Race was well designed. It addresses an important problem - the vast majority of teachers report receiving limited quality feedback on instruction. As a competitive grants program, it was voluntary for states to participate (though involuntary for many districts within those states). The Administration also smartly embraced the idea of multiple measures of teacher performance.

But they also made one decision that I think was a mistake. They encouraged—or required, depending on your vantage point—states to lump value-added or other growth model estimates together with other measures. The raging debate since then has been over what percentage of teachers’ final ratings should be given to value-added versus the other measures. I believe there is a better way to approach this issue, one that focuses on teacher evaluations not as a measure, but rather as a process.

The idea of combining the measures has some advantages. For example, as I wrote in my book on about value-added measures, combined measures have greater reliability and probably better validity as well. But there is also one major issue: Teachers by and large do not like or trust value-added measures. There are some good reasons for this: The measures are not very reliable and therefore bounce around from year to year in ways that have nothing to do with actual performance. There is more debate about whether the measures are, in any given year, providing useful information about “true” teacher performance (i.e., whether they are valid).

The larger problem is that policymakers have tended to look at the teacher evaluation problem like measurement experts rather than school leaders. Measurement experts naturally want validity and reliable measures—ones that accurately capture teacher effectiveness. School leaders, on the other hand, can and should be more concerned about whether the entire process leads to valid and reliable conclusions about teacher effectiveness. The process includes measures, but also clear steps, checks and balances, and opportunities to identify and fix evaluation mistakes. It is that process, perhaps as much as the measures themselves, that instills trust in the system among educators. But the idea of combining multiple measures has short-circuited discussion about how the multiple measures—and especially value-added—could be used to create a better process.

One possible process comes from the medical profession. It is common for doctors to “screen” for major diseases, using procedures that can identify all the people who do have the disease, but some who do not (the latter being false positives). Those who are positive on the screening test are given another “gold standard” test that is more expensive but almost perfectly accurate. They do not average the screening test together with the gold standard test to create a combined index. Instead, the two pieces are considered in sequence.

Ineffective teachers could be identified the same way.

Value-added measures could become the educational equivalent of screening tests. They are generally inexpensive and somewhat inaccurate. As in medicine, a value-added score, combined with some additional information, should lead us to engage in additional classroom observations to identify truly low-performing teachers and to provide feedback to help those teachers improve. If all else fails, within a reasonable amount of time, after continued observation, administrators could counsel the teacher out or pursue a formal dismissal procedure.

The most obvious problem with this approach is that value-added measures, unlike the medical screening tests, do not capture all potential low-performers. They are statistically noisy, for example, and so many low-performers will get high scores by chance. For this reason, value-added would not be the sole screener. Instead, some other measure could also be used as a screener. If teachers failed on either measure, then that would be a reason for collecting additional information. (This approach also solves another problem discussed later.)

There is a second way in which value-added could be used as a screener – not of teachers, but of their teacher evaluators. To explain how, I need to say more about the “other” measures in an evaluation system. Almost every school system that has moved to alternative teacher evaluations has chosen to also use classroom observations by peers, master teachers, and/or school principals. The Danielson Framework, PLATO, and others are now household names among educators. Classroom observations have many advantages: They allow the observer to take account of the local context. They yield information that is more useful to teachers for improving practice. And we can increase their reliability by observing teachers more often.

The difficulty is that these measures, too, have validity and reliability issues. Two observers can look at the same classroom and see different things. That problem is more likely when the observers vary in their training. Also, some observers might know teachers’ value-added scores and let those color their views during the observations - they might think, “I already know this teacher is not very good so I will give her a low score."

Value-added measures might actually be used to fix these problems with classroom observations. To see how, note that researchers have found consistent, positive correlations between value-added and classroom observations scores. They are far from perfect correlations (mainly because of statistical noise), but they provide a benchmark against which we can compare (validate, if you will) the scores across individual observers. Inaccurate classroom observation scores would likely show up as low correlations with value-added. Conversely, if observers were having their scores influenced by value-added, then the correlations might be very high, which might also be a red flag.

In these cases, an additional observer might be used to make sure the information is accurate. In other words, value-added can screen the performance of not only teachers, but observers as well. Used in these ways, value-added would be a key part of the system but without being the determining factor in personnel decisions.

This screening approach would solve a host of problems.

The screening approach maintains the new and important focus on teacher evaluation and the use of student test scores in those systems. The NEA and AFT themselves have been rightly critical of traditional-style evaluation systems because they provide so little useful feedback to teachers. Screening with value-added places the emphasis on formative, feedback-based measures such as observations.
The screening approach represents a “feedback loop” in which both value-added and observations are used to ensure that the other is functioning well - i.e., observations are used to verify the identification of low-performing teachers based on value-added (and help them improve), while value-added is used to identify observers whose performance may be lacking. All measures have their flaws and value-added can help address these.
The screening approach ensures that value-added measures are never the primary determinants of high-stakes personnel decisions. Rather, in this alternative proposal, value-added would only serve to trigger a closer look at a teacher’s performance, but the actual decisions would be based on classroom observations by experts. These have much greater support among teachers and provide more useful feedback.
The screening approach helps schools focus their evaluation resources where they count: On low-performing teachers and low-performing classroom observers. This is crucial in these tough economic and fiscal times, during which schools must allocate resources carefully.
The screening approach can be applied to all teachers, not just those in tested grades and subjects. A common criticism of value-added is that it cannot be applied to all teachers. With the approach I am proposing, only the initial screening process would differ (e.g., a single classroom observation that all teachers would receive) and the remainder of the process could be based on a more standard set of measures (additional classroom observations).
The screening approach, because it works in all grades and subjects, avoids the unfortunate response, in states such as Florida, of expanding testing to every grade and subject. Teaching to the test is a real problem and this will make it worse. Value-added could serve to test screeners even of non-tested grades and subjects as long as those same screeners have some teachers in tested classrooms.
The screening approach ensures that there is enough information that educational leaders will be able sleep at night knowing they are making the best possible personnel decisions - that their tough choices will not be over-turned by lawsuits alleging arbitrary and capricious firings.

Since I started with a medical analogy, some might want to call this a “triage” approach. This term fits in some ways but not in others. In both cases, the focus is on allocating resources in cost-effective ways. The higher-performing teachers get less attention just as healthier patients do. On the other hand, there is a difference between this approach and medical triage, as the latter entails devoting few resources to those who are least likely to make it. Instead, part of this point is to collect more information on these struggling teachers so that personnel decisions can be made with confidence and in keeping with legal requirements.

The screening approach certainly wouldn’t solve all the problems with the new teacher evaluation systems. The choice of additional measures beyond value-added, and the implementation of these measures, are critical. So are the ways in which the evaluations are used in personnel decisions.

Value-added measures have played a valuable role in sparking this important debate, but they need not do all the heavy lifting for our reformed teacher evaluation systems. We need more than a number, but a process for identifying low-performing teachers and helping them get better.

- Douglas Harris

Blog Topics

Kudos to Dr. Harris for providing a clear approach as to how data can be used to drive improvement, and for acknowledging and addressing the political challenges of implementing teacher evaluation reforms.

This is a slightly more palatable approach, but I reject the medical analogy. Medical tests are so narrow and so simple. The classroom is a thoroughly dynamic combination of dozens of people, and dozens of factors known and unknown - and it differs each year. I am hardly reassured by the fact that economists find enough patterns in the data to rationalize the use of measures that are neither valid nor reliable, and then reassure teachers that everything will be okay. As it applies to my own subject area (HS English), the use of VAM could not even attempt to measure most of what I teach, because the tests are simply that awful - and we can't even look at the tests in order to determine their potential (unlikely) usefulness in improving teaching. Let's dispense with VAM for evaluation and dedicate our time and energy to developing and constantly improving evaluation methods that enjoy greater support among those engaged in the work. After all, none of the top schools or systems in the world got there by using VAM for evaluation.

Something to consider when using two different measures and treating 'failure' (or whatever one wants to call it) as low performance on either measure is that you increase the false positive rate (i.e., a teacher isn't performing poorly despite one poor measurement). If there is a cost to overestimating poor performance (e.g., administrative time, teacher demoralization, etc.), this would seem to be a problem.

Dr. Harris' system is an improvement, but it's mostly for the PR-ish reason that it's easy to understand and doesn't come with accompanying rhetoric designed to scare teachers. As for the key reason of ensuring balance -- #3 on the list -- most current systems do in fact "ensure that value-added measures are never the primary determinants of high-stakes personnel decisions." That's why value-added measures generally make up less than 40% of an evaluation. If you don't perform poorly on the other 60%-90%, you're not going to lose your job.

The fact that this system is seen an as a real improvement -- and it is -- shows that much of the hubbub over value-added measures is about politics and not policy.

Eric, I disagree. This is a significant departure and improvement. First, it is used not to rate teachers on a bell curve, which is what most systems do, but to identify those who need the most support for intensive intervention. Right now, I am spending an inordinate amount of my time observing strong teachers who frankly need one yearly observation and perhaps some short drop ins. Last year, before APPR in NY, we could spend our time working with teachers who really needed support. That is at the heart of this system.
Second, because the number generated from VAM is not part of an evaluation number, it avoids the unintended consequences of intense teaching to the test, or the fear teachers now experience when difficult to teach students are assigned to their class.
Third, the structure of many systems place far more weight to the VAM or achievement score portion than what appears at first glance. In NY student achievement is 40%--however, because you need 65 points to get out of ineffective which is "the dismissal zone", that 40% takes on significant weight--if you score ineffective, then you are ineffective overall. Which state has a system where VAM is a true 10%? I know of know.
Dough Harris gets the problem. I assure you, as a longtime principal, of an excellent school, this is not about politics. It is about opposition to an evaluation system that will harm school improvement efforts.

Correction... Last sentence paragraph 3. I know of none.

I agree that the system is an improvement -- and said so -- particularly because of the efficiencies you mention regarding not focusing on strong teachers. But in terms of avoiding things like teaching to the test the system is not logically different. Your point seems to be that you don't have to worry about test scores because you will only raise red flags if your observations are bad. But even under standard systems that weight observations enough to give them a "veto" you don't have to worry about test scores if your observations aren't bad.

I think what you're really saying is that Dr. Harris' system is better because it guarantees observations this "veto," and thus ensures that less weight will be placed on test scores. That's a fair point to make, but it has nothing to do with the clever design of the system. My point is that If you take two systems where test scores are not weighted enough to cause a dismissal on their own, the decision over whether to count observations and test scores together or in sequence will have no effect on a teacher's chances of being dismissed. The fact that you think this is the case supports my first point -- using observations and value-added measures in sequence sounds like it's less arbitrary.

Eric--I disagree as well. As Bruce Baker, Matt (I believe), Carol, and others have pointed out, initial weights that have VAMs at less than 50% of an overall eval score can end up with VAMs acting as a sole determinant of effectiveness if the variance in VAMs is much greater than the variance in the other components. Non-VAM achievement measures tend to be worrisome as well becayuuse they don;t control for the factors outside the control of the teacher (not that VAMs control for all these factors, but at least they try to).

And the convincing empirical basis for the use of value-added measures in educational improvement is? Well?

There are more effective ways to spend this huge, and ever growing amount of money involved in this VAM-enterprise with dubious or simply useless outcomes. Measurement policies of 'teacher effectiveness' have led nowhere but to demoralization of teachers, and the erosion of the teaching profession.

That is typically *not* in the interest of our students.

We "could" do this, it may even "work" to some degree, but is the best way to do the initial triage?

Any consideration of that question has to include the human cost to our students of the extensive testing necessary to produce VAM measures that have any worth at all (as well as the financial costs, the costs in staff time...).