Skip to:

Reign Of Error: The Publication Of Teacher Data Reports In New York City


Oh, and I notice that the DOE bases the teacher's percentiles on the distributions of value added score averages for teachers with the "same" amount of experience, rather than all teachers. Maybe that is marginally a fairer comparison - but is it? The DOE website seems to say they are moving to a different approach (or the state is?) but I have no clue what it will be. Colorado style growth percentiles maybe? Have to keep an eye on them, and for sure, don't turn your back.

Matt, Overall, I think you've done us a great service by digging more deeply into the labels and the meanings of different estimates and calculations. I also think everything Harris Zwerling noted above (in his "incomplete" list) presents insurmountable problems for the use of VAM to evaluate teachers given the available data. Also, if we change some significant variables in the context and conditions surrounding the teacher's work from year to year, then haven't the relevant data been produced in different ways, rendering them much more difficult, if not impossible, to compare, or to blend into a single calculation? And do you know any teachers for whom conditions are held constant from one year to the next? If our goal is to compare teachers, can we assume that changes in conditions each year affect each teacher in the same way? And finally, a stylistic request: a few times above you used the data and labels to say that "the teacher is" average, above, below, etc. I don't think you mean to suggest that these tests actually tell us what the teacher's overall quality or skill might be. "An average teacher" in this situation is more precisely "a teacher whose students produced average test score gains." I think it's important not to let our language and diction drift, lest our thoughts become similarly muddled on this topic.

I think one more thing to consider here might be the psychology of interpreting statistics. Those like yourself who have advanced statistics training have extensive experience in ignoring certain tendencies, or looking at numbers in different ways that most people do. For example, it might be true and seemingly honest that to say that a teacher scored on the 27th percentile, which, with a 53% margin of error is not statistically different from average. But when people read 27 + or minus 53, they are given the impression of precise measurements (look, they are even precise in their margin of error!). I know that the margin of error as a statistical concept is calculated in a certain way, but that is not how most people understand it. To me, this is similar to reporting way too many decimal places, giving a reader an overly precise estimate. I could say that my car gets 27.567 miles per gallon, with a margin of error of 3.023, or I could say that it gets around 25 or 30 miles per gallon. These might be equivalent to someone with statistics training, but they are not to most people. Anyways, good reading as always.

This was a very good post on the reliability of the data published, but I don't think it goes far enough to cast a shadow over the numbers themselves. First of all, they are a percentile, so although they do make a comment on teachers relative to eachother, they don't actually comment at all on effectiveness of any teacher. Secondly, because they compare one year's score to the next, they don't even really comment on how effective any one teacher is, but rather, how effective one teacher is to the teacher the year before. This means an above average teacher can be ranked as below average if the teacher that teaches one grade level lower is superb. This will work in reverse also. It's a sad day for education and a sad day for logical thought.

Matt, I too thank you for the helpful, thoughtful, and dare I say it, balanced, treatment of the error issues, and the responses add value as well. A couple of additional questions: What do you think of Aaron Pallas's suggestion about a way to handle error, i.e. for any teacher's score that nominally falls within a particular category, you give it that label only if the confidence limits give it a 90% (or pick your particular % criterion) chance of falling within that band; otherwise, you give it the next higher rating (unless the error range is so high that you go even higher :-) Of course, that doesn't really confront the arbitrary nature of the choice of bands for "low" "below average" "average" etc. or the educationally ambiguous meaning of those classifications, but it would at least be a further accommodation of the error problem. But on that - a question - I don't really understand how error is attributed to individual teachers' percentile scores - is it really individual? If so, gee - interesting ... maybe because of class size, number of years of data, or -- what? Which brings me to - the percentiles here are referenced to the distribution of the average value-added scores of their students, which are based on discrepancies of each student's actual score from their predicted scores, based on regressions involving prior year scores and a kitchen sink of other variables, including student, class, and school characteristics - right? Are the error bands derived from the error stemming from all those individual regressions as they are summed and averaged and the averages turned in to percentiles compared to the distribution of other teachers' averages? Wow! And how do the statistics factor in the noise in the test scores themselves, particularly at anywhere beyond the middle of the test score distribution? The NYC DOE website does go a little way in recognizing the problem of the noise at the upper (and lower) ends, and maybe all the stuff they throw into the equations somehow takes this into account, but boy it looks like a Rube Goldberg operation. I suppose the short way of asking this is -- where do those 30 to 50 or more confidence intervals come from, and is there even more error (not to mention validity questions) hiding behind them? Just askin' Anyway, great work, Matt. Fritz

Hi Harris, David, Cedar and Matthew, Thanks for your comments. It seems very clear that I failed to emphasize sufficiently the point that a discussion about interpreting error margins is a separate issue from the accuracy of the model itself. It’s ironic that I’ve discussed this issue in too many previous posts to count, but not in this one, where it was so clearly important, given the fact that the databases are being viewed by the public. So, I’m sorry about that. That said, I would make three other related points. First, this issue cuts both ways: Just as accounting for error doesn’t imply accuracy in the educational sense, the large margins are not necessarily evidence that the models are useless, as some people have implied. Even if, hypothetically, value-added models were perfect, the estimates would still be highly imprecise using samples of 10-20 students. Second, correct interpretation, a big part of which is sample size and the treatment of error, can actually mitigate (but not even close to eliminate) some of the inaccuracy stemming from systematic bias and other shortcomings in the models themselves. Third, and most importantly, I’m horrified by the publication of these data, period. Yet my narrow focus on interpretation in this post is in part motivated by a kind of pragmatism about what’s going on outside the databases. While the loudest voices are busy arguing that value-added is either the magic ingredient to improving education or so inaccurate as to be absurd, the estimates are *already* being used – or will soon be used – in more than half of U.S. states, and that number is still growing. Like it or not, this is happening. And, in my view, the important details such as error margins – which is pretty basic, I think - are being largely ignored, sometimes in a flagrant manner. We may not all agree on whether these models have any role to play in policy (and I’m not personally opposed), but that argument will not help the tens of thousands of teachers who are already being evaluated. Thanks again, MD P.S. Cedar – The manner in which I go about interpreting and explaining statistics is the least of my psychological issues.

Matthew, Thanks for this breakdown! Hope you or others can also address issues of sample size (especially in schools with high student turnover), and of the quality/ methodology used to produce the ELA & math assessments themselves -- if teachers are being judged on their students' scores, someone outside the testing provider/ NYSED needs to ensure the tests are fair. However, if the test shielding that occurred in 2011 continues, there is no way for a truly independent observer to evaluate the validity of the assessment vehicles, which makes for a very precarious foundation of an already structurally questionable evaluation system...

Matthew- Thanks very much for this very informative post. You have done a great service pointing out the importance of error margins in the interpretation of these scores. However, as I have followed your commentary (and Bruce Baker's) on the limitations of value-added assessment, it seems that there are many other methodological concerns that should also be raised as part of this discussion. Let me begin by noting that I have no knowledge of the specific VAMs used by NYC to calculate these results. If possible, I would like you to comment on the other methodological caveats that should be considered when one attempts to evaluate the ultimate meaning of these scores, i.e., not only their precision, but also their underlying validity. Here’s my incomplete list. 1. Does the VAM used to calculate the results plausibly meet its required assumptions? Did the contractor test this? (See Harris, Sass, and Semykina, “Value-Added Models and the Measurement of Teacher Productivity” Calder Working Paper No. 54.) 2. Was the VAM properly specified? (e.g., Did the VAM control for summer learning, tutoring, test for various interactions, e.g., between class size and behavioral disabilities?) 3. What specification tests were performed? How did they affect the categorization of teachers as effective or ineffective? 4. How was missing data handled? 5. How did the contractors handle team teaching or other forms of joint teaching for the purposes of attributing the test score results? 6. Did they use appropriate statistical methods to analyze the test scores? (For example, did the VAM provider use regression techniques if the math and reading tests were not plausibly scored at an interval level?) 7. When referring back to the original tests, particularly ELA, does the range of teacher effects detected cover an educationally meaningful range of test performance? 8. To what degree would the test results differ if different outcome tests were used? 9. Did the VAM provider test for sorting bias? My overall point is that measurement precision is very important, but let’s also consider the other reasons why these VAM results should be read with great caution. Thanks again. Harris


This web site and the information contained herein are provided as a service to those who are interested in the work of the Albert Shanker Institute (ASI). ASI makes no warranties, either express or implied, concerning the information contained on or linked from The visitor uses the information provided herein at his/her own risk. ASI, its officers, board members, agents, and employees specifically disclaim any and all liability from damages which may result from the utilization of the information provided herein. The content in the Shanker Blog may not necessarily reflect the views or official policy positions of ASI or any related entity or organization.