Value-Added As A Screening Device: Part II

Our guest author today is Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, Value-Added Measures in Education, provides an accessible review of the technical and practical issues surrounding these models.

This past November, I wrote a post for this blog about shifting course in the teacher evaluation movement and using value-added as a “screening device.” This means that the measures would be used: (1) to help identify teachers who might be struggling and for whom additional classroom observations (and perhaps other information) should be gathered; and (2) to identify classroom observers who might not be doing an effective job.

Screening takes advantage of the low cost of value-added and the fact that the estimates are more accurate in making general assessments of performance patterns across teachers, while avoiding the weaknesses of value-added—especially that the measures are often inaccurate for individual teachers, as well as confusing and not very credible among teachers when used for high-stakes decisions.

I want to thank the many people who responded to the first post. There were three main camps.

One group seemed very supportive of the idea. For example, Carol Burris, a principal who is active in challenging the system in New York State, was among the most active supporters.

A second group of commenters were critical mainly of the fact that I did not go further and recommend eliminating all uses of value-added measures in teacher evaluation. I think that would be going too far because the screening approach avoids the models’ main weaknesses, while still serving the useful purpose of pushing forward the Obama Administration’s call for the development of effective teacher evaluation and feedback systems (more on this below).

Of those who wanted to go further in scaling back value-added, some argued that we should curtail student testing more generally. Count me as one who thinks testing is going too far, especially in states like Florida, which have decided to test everybody no matter their age or their course subject. As I wrote in my book on value-added, we should at least find some evidence that this approach to teacher evaluation works in grades and subjects that are already tested before expanding it further. (As an aside, I started writing this post from Asia, where school officials tell me they are continuing to reduce the amount of testing, though their exams remain extremely high-stakes.) In my proposal, however, I was just taking the testing world as it is today and trying to come up with better uses.

There were also some responses that raised more policy questions. Chad Aldeman’s piece on the Education Sector blog suggested that the screening approach I proposed was already possible within the Race to the Top (RTTT), Teacher Incentive Fund (TIF), and ESEA waivers, all of which require that student growth or value-added be a “significant factor.” What constitutes a “significant factor," however, is unclear. The screening idea could be used in combination with composite measures, but in its “pure form," on which I’ll focus here, value-added is not part of the composite measure. Rather, the focus is on ensuring an effective evaluation process, but there is no role for value-added in final decisions. It’s certainly not obvious that this is compatible with the “significant factor” language.

I consulted several colleagues with recent experience working in TIF and RTTT. All seemed to agree with Aldeman that the screening idea could fit within the rules, although none came up with any examples where it had been done. This is not surprising because relying on screening approach in the initial competitive grant proposals would have been a risky strategy.

If you were a state or school district applying for a competitive grant or waiver and you knew only a fraction of the submissions would win, you would do everything you could to make your proposal rise above the rest, not take a chance on an idea that was not yet part of the conversation, and which seemed to contradict the “significant factor” language. This is why, to my knowledge, almost every RTTT and TIF proposal included the composite index approach (see exceptions below). It is also why, in my original post, I wrote that the Department “encouraged—or required, depending on your vantage point—states to lump value-added or other growth model estimates together with other measures." I still think that’s accurate.

Aldeman made several additional points, and he and I have subsequently had some productive conversation by email. He pointed out, for example, that some of the winning RTTT proposals involved using value-added in a “matrix” in which teachers could not be labeled low-performing if they had high value-added scores (and vice versa for high-performing teachers). This is different from creating a composite index, although I would argue this approach shares the same weaknesses. Many ineffective teachers have high value-added scores and some effective teachers have low value-added scores. This implies that a teacher with a low classroom observation score and a high value-added score would not be labeled ineffective under either the composite approach or the matrix approach; in the composite approach, the low and high scores would average out. Of course, when the two scores do line up, all the approaches yield the same answer, so those are not the relevant cases. For this reason, I do not see the matrix approach as a good solution.

In his blog post, Aldeman also lays out two problems with the screening idea:

[Harris] would have student growth come first. A low rating could lead to closer inspection of a teacher’s classroom practice. But there are two main problems with doing it this way. One is that it doesn’t work as well from a timing standpoint. Student growth scores often come much later than observation results. And two, most teacher evaluation systems historically have not done a good job of differentiating teachers and providing them the feedback they need to improve. If low student growth just leads to the same high evaluation scores, it’s hard to say student growth played a “significant part” in a teacher’s overall rating.

On the first point, it is important to note that value-added measures pose a timing problem no matter how they are used. This is partly because we have to wait for the student test scores and then wait again for the district or outside vendor to create the value-added measures. In addition, some states and districts are following researchers’ advice and using multiple years of test score data for each value-added calculation. For example, a composite measure in November of the current year would be based on a weighted average of last year’s classroom observation and a value-added measure from the year that just ended, averaged with value-added from one or more prior years. Aldeman is correct that this “mismatch” in timing is slightly exacerbated with the screening approach because the value-added from prior years would only trigger additional classroom observations later in the current year. However, the screening approach reduces the role of value-added as a direct factor in personnel decisions, so it is not clear that a slightly larger mismatch is relevant.

On Aldeman’s second point, a major part of my rationale in favor of the screening approach is to avoid exactly the problem he identifies—that all teachers should not receive the same high ratings when in fact their performance varies, often considerably. If all teachers receive high classroom observations, then in the proposed screening system we would see a very low correlation between value-added and classroom observations—a red flag. The need for some type of insurance policy against uniformly high ratings is one reason why I part ways with those who want to get rid of value-added altogether (see above).

It is not too late to change this. While the states and districts that won these grants and/or ESEA waivers had a strong incentive to use the composite and matrix approaches in the initial competition, they could now ask for clarification and modifications. The development of these new human capital systems is a long-term proposition, one that will require careful observation and adaption when new ideas and evidence emerge.

Finally, in another response to my initial post, Bruce Baker mentioned that the screening idea is not entirely new. I started mentioning the idea in my value-added presentations as the RTTT process began, and included the same idea in my 2011 book. No doubt others came up with similar ideas on their own. Bruce indicated that he has the idea in his forthcoming book, and he pointed to a presentation that Steve Glazerman made at Princeton in 2011 (Steve confirmed this and provided me his slides). I know and respect Bruce and Steve and I am glad they are talking about it as well. If you know of other discussions on this topic, please add a comment to this post. The more people write about it, the more people will consider it. Thanks again to all who responded and especially to Chad Aldeman for the productive back and forth.

- Douglas N. Harris

Blog Topics

It sounds to me like we may be trying very hard to find some use for what is basically unsound practice. Perhaps the effort to evaluate teachers would be more effective if we left VAM behind and moved on to evaluation methods that have proven to be successful. The issue of teacher training has been largely ignored in my state, Fla., and others too I suspect. As a retired NBCT, I wonder why states care so little for training, and why it is so easy to qualify for a teaching job in the first place. Pay obviously plays a role in that systemic weakness.

First, thank you for all your thought on this - the book, last blog, this blog. All great reading.

Can you comment on where you see VAM versus observations (and if you feel like it, student surveys)? How did you react to seeing the measly correlation of observations with VAM in the MET study, something like 0.2?

As for the origin of the proposal to use VAM as a screening device, it predates any of the citations offered in this blog entry. Consider the following quote:

One can envision VAM results serving as a component of a multidimensional system of evaluation; however, this possibility would have to be evaluated in light of the specifics of such a system and the context of use. In the meantime, one can support the use of VAM results as a first screen in a process to identify schools in need of assistance or teachers in need of targeted professional development. Subsequent stages should involve interviews and/or direct observation since statistical analyses cannot identify the strategies and practices employed by educators. In some districts, VAM results have been used successfully in this manner by principals and teachers in devising individualized improvement strategies.

From Henry Braun and Howard Wainer. (2007). “Value-Added Modeling.” Handbook of Statistics. Vol. 26, (Elsevier). p. 899.

Hi Doug--check out the new observation schedule from the TN DOE. They're now taking teachers' individual growth scores into account when deciding how frequently they should be observed in subsequent observation cycles. The scores are composites, but it's still a move toward a type of "screening" process...Hope this helps!

See this chart: http://team-tn.org/assets/misc/Number%20of%20Observations%20Teacher%20L…

In response to Mike G - It seems pretty clear that value-added measures of student test score growth and classroom observations of teachers (typically using Danielson-based rubrics) are capturing different dimensions of teaching (see Rothstein, J. & Mathis, W.J. (2013). Review of “Have We Identified Effective Teachers?” and “A Composite Estimator of Effective Teaching”: Culminating findings from the Measures of Effective Teaching Project. Boulder, CO: National Education Policy Center. Retrieved [1/31/13] from http://nepc.colorado.edu/thinktank/review-MET-final-2013.) I don’t know a good reason why one would expect there to be a high degree of correlation between these measures.

Unfortunately, policymakers and other elected officials, the overwhelming majority of whom lack the training and expertise to understand the myriad methodological issues involved, simplistically ASSUME that teacher value-added scores are the most valid measure of their effectiveness. It doesn’t help that so many academic studies engage in what many have noted is the circular reasoning of demonstrating that prior value-added is the best predictor of effectiveness, measured as current value-added.

Despite being warned of this trap, Pennsylvania’s education policymakers are planning to rate principals, in part, based on the correlation between their classroom evaluations of teachers and the value-added scores the same teachers obtain (so much for the independence of these "multiple" measures.) I hope that more and more academics will be turning their attention to the evaluation systems that are live and going live very soon. One would hope that they would be very aggressive in their attempts to correct inappropriate and destructive practices.