Skip to:

Creating A Valid Process For Using Teacher Value-Added Measures


We "could" do this, it may even "work" to some degree, but is the best way to do the initial triage? Any consideration of that question has to include the human cost to our students of the extensive testing necessary to produce VAM measures that have any worth at all (as well as the financial costs, the costs in staff time...).

" The measures are not very reliable and therefore bounce around from year to year in ways that have nothing to do with actual performance." This quote is later followed by step one of the process - "The screening approach maintains the new and important focus on teacher evaluation and the use of student test scores in those systems." How is making anything that is not very reliable a reasonable first step in a teacher's evaluation no matter what weight it is given? There is also this statement - "The most obvious problem with this approach is that value-added measures, unlike the medical screening tests, do not capture all potential low-performers. They are statistically noisy, for example, and so many low-performers will get high scores by chance." Again, if the measures are not reliable, just what percentage of low performers are captured and just how many high performers get low scores by chance? If the we are right to be so critical of traditional style evaluations, how does adding an unreliable component help since neither the measures nor the traditional style evaluations are apparently reliable in identifying low performers. Actually, step one conflates two different ideas, the identification of low performers and useful feedback to teachers. If the traditional style evaluation does not provide useful feedback to a teacher, how do unreliable measures that have nothing to do with performance help? In the guise of seeming reasonable, this screening process in step two tries to give legitimacy to using value added measures in conjunction with observations by suggesting two unreliable processes can somehow provide a "“feedback loop” in which both value-added and observations are used to ensure that the other is functioning well – i.e., observations are used to verify the identification of low-performing teachers based on value-added (and help them improve), while value-added is used to identify observers whose performance may be lacking." This presupposes value added measures can be tied directly to teacher activities an observer should be able to see. Furthermore, we have now added another layer of complexity to the process in which a teacher is considered a poor performer because of an unreliable measure and an observer is possibly considered a poor observer whose performance might be lacking because the observer's feedback is not aligned with that same unreliable measure. Unless we have observers for the observers, how does this situation not bias an observer's judgement toward validating that value added measure? Maybe value added measures could be useful as a starting point in reflecting on, conversing about, or focusing observations on current practices though all those things could take place without value added measures. The real question is why step three does not say the screening approach ensures that value-added measures are never used as determinants of high-stakes personnel decisions?

Ed--My argument is more of a semantic one at this point. The idea of only using VAMs after observations isn’t functionally different from a system where VAMs make up X% of the evaluation, with X being some number low enough that VAMs can’t be solely responsible for dismissal. If in fact there are lots of systems where high test score variance can lead teachers to be dismissed based on VAMs alone when the designers did not intend it that way, then this system does fix that, but I’m skeptical of how often that occurs in practice. That’s why I think point #3 in the post is less a consequence of an original system and more a result of ensuring a low enough VAM component. This matters because at this point it may be easier to convince people to have lower VAM components than to adopt a new type of sequential system.

I agree that the system is an improvement -- and said so -- particularly because of the efficiencies you mention regarding not focusing on strong teachers. But in terms of avoiding things like teaching to the test the system is not logically different. Your point seems to be that you don't have to worry about test scores because you will only raise red flags if your observations are bad. But even under standard systems that weight observations enough to give them a "veto" you don't have to worry about test scores if your observations aren't bad. I think what you're really saying is that Dr. Harris' system is better because it guarantees observations this "veto," and thus ensures that less weight will be placed on test scores. That's a fair point to make, but it has nothing to do with the clever design of the system. My point is that If you take two systems where test scores are not weighted enough to cause a dismissal on their own, the decision over whether to count observations and test scores together or in sequence will have no effect on a teacher's chances of being dismissed. The fact that you think this is the case supports my first point -- using observations and value-added measures in sequence sounds like it's less arbitrary.

Eric--I disagree as well. As Bruce Baker, Matt (I believe), Carol, and others have pointed out, initial weights that have VAMs at less than 50% of an overall eval score can end up with VAMs acting as a sole determinant of effectiveness if the variance in VAMs is much greater than the variance in the other components. Non-VAM achievement measures tend to be worrisome as well becayuuse they don;t control for the factors outside the control of the teacher (not that VAMs control for all these factors, but at least they try to).

I've heard this said many times lately, but it is true - teaching (and education) cannot be simplified into an algorithm. The diversity of any given classroom on any given day, let alone any given year, will always make the evaluation process a messy one. VAM is yet another effort to try and neaten up a process that really defies simplification. It also puts yet an even heavier weight on standardized tests that they were never meant to bear. The push for VAM is part of a larger catch-22 created by NCLB. Most states adopting it ignore the studies citing the ineffectiveness of such measures, because in the larger picture, including VAM will allow them opportunity to apply for waivers from the more punitive elements of NCLB. It's a sad cycle.

Kudos to Dr. Harris for providing a clear approach as to how data can be used to drive improvement, and for acknowledging and addressing the political challenges of implementing teacher evaluation reforms.

And the convincing empirical basis for the use of value-added measures in educational improvement is? Well? There are more effective ways to spend this huge, and ever growing amount of money involved in this VAM-enterprise with dubious or simply useless outcomes. Measurement policies of 'teacher effectiveness' have led nowhere but to demoralization of teachers, and the erosion of the teaching profession. That is typically *not* in the interest of our students.

This is a slightly more palatable approach, but I reject the medical analogy. Medical tests are so narrow and so simple. The classroom is a thoroughly dynamic combination of dozens of people, and dozens of factors known and unknown - and it differs each year. I am hardly reassured by the fact that economists find enough patterns in the data to rationalize the use of measures that are neither valid nor reliable, and then reassure teachers that everything will be okay. As it applies to my own subject area (HS English), the use of VAM could not even attempt to measure most of what I teach, because the tests are simply that awful - and we can't even look at the tests in order to determine their potential (unlikely) usefulness in improving teaching. Let's dispense with VAM for evaluation and dedicate our time and energy to developing and constantly improving evaluation methods that enjoy greater support among those engaged in the work. After all, none of the top schools or systems in the world got there by using VAM for evaluation.

Something to consider when using two different measures and treating 'failure' (or whatever one wants to call it) as low performance on either measure is that you increase the false positive rate (i.e., a teacher isn't performing poorly despite one poor measurement). If there is a cost to overestimating poor performance (e.g., administrative time, teacher demoralization, etc.), this would seem to be a problem.



This web site and the information contained herein are provided as a service to those who are interested in the work of the Albert Shanker Institute (ASI). ASI makes no warranties, either express or implied, concerning the information contained on or linked from The visitor uses the information provided herein at his/her own risk. ASI, its officers, board members, agents, and employees specifically disclaim any and all liability from damages which may result from the utilization of the information provided herein. The content in the Shanker Blog may not necessarily reflect the views or official policy positions of ASI or any related entity or organization.