Reign Of Error: The Publication Of Teacher Data Reports In New York City
Late last week and over the weekend, New York City newspapers, including the New York Times and Wall Street Journal, published the value-added scores (teacher data reports) for thousands of the city’s teachers. Prior to this release, I and others argued that the newspapers should present margins of error along with the estimates. To their credit, both papers did so.
In the Times’ version, for example, each individual teacher’s value-added score (converted to a percentile rank) is presented graphically, for math and reading, in both 2010 and over a teacher’s “career” (averaged across previous years), along with the margins of error. In addition, both papers provided descriptions and warnings about the imprecision in the results. So, while the decision to publish was still, in my personal view, a terrible mistake, the papers at least make a good faith attempt to highlight the imprecision.
That said, they also published data from the city that use teachers’ value-added scores to label them as one of five categories: low, below average, average, above average or high. The Times did this only at the school level (i.e., the percent of each school’s teachers that are “above average” or “high”), while the Journal actually labeled each individual teacher. Presumably, most people who view the databases, particularly the Journal's, will rely heavily on these categorical ratings, as they are easier to understand than percentile ranks surrounded by error margins. The inherent problems with these ratings are what I’d like to discuss, as they illustrate important concepts about estimation error and what can be done about it.
First, let’s quickly summarize the imprecision associated with the NYC value-added scores, using the raw datasets from the city. It has been heavily reported that the average confidence interval for these estimates – the range within which we can be confident the “true estimate” falls - is 35 percentile points in math and 53 in English Language Arts (ELA). But this oversimplifies the situation somewhat, as the overall average masks quite a bit of variation by data availability. Take a look at the graph below, which shows how the average confidence interval varies by the number of years of data available, which is really just a proxy for sample size (see the figure's notes).
When you’re looking at the single-year teacher estimates (in this case, for 2009-10), the average spread is a pretty striking 46 percentile points in math and 62 in ELA. Furthermore, even with five years of data, the intervals are still quite large – about 30 points in math and 48 in ELA. (There is, however, quite a bit of improvement with additional years. The ranges are reduced by around 25 percent in both subjects when you use five years data compared with one.)
Now, opponents of value-added have expressed a great deal of outrage about these high levels of imprecision, and they are indeed extremely wide – which is one major reason why these estimates have absolutely no business being published in an online database. But, as I’ve discussed before, one major, frequently-ignored point about the error – whether in a newspaper or an evaluation system – is that the problem lies less with how much there is than how you go about addressing it.
It’s true that, even with multiple years of data, the estimates are still very imprecise. But, no matter how much data you have, if you pay attention to the error margins, you can, at least to some degree, use this information to ensure that you’re drawing defensible conclusions based on the available information. If you don’t, you can't.
This can be illustrated by taking a look at the categories that the city (and the Journal) uses to label teachers (or, in the case of the Times, schools).
Here’s how teachers are rated: low (0-4th percentile); below average (5-24); average (25-74); above average (75-94); and high (95-99).
To understand the rocky relationship between value-added margins of error and these categories, first take a look at the Times’ “sample graph” below.
This is supposed to be a sample of one teacher’s results. This particular hypothetical teacher’s value-added score was at the 50th percentile, with a margin of error of plus or minus roughly 30 percentile points. What this tells you is that we can have a high level of confidence that this teacher’s “true estimate” is somewhere between the 20th and 80th percentile (that’s the confidence interval for this teacher), although it is more likely to be closer to 50 than 20 or 80.
One shorthand way to see whether teachers scores are, accounting for error, average, above average or below average is to see whether their confidence intervals overlap with the average (50th percentile, which is actually the median, but that's a semantic point in these data).
Let’s say we have a teacher with a value-added score at the 60th percentile, plus or minus 20 points, making for a confidence interval of 40-80. This crosses the average/median “border," since the 50th percentile is within the confidence interval. So, while this teacher, based on his or her raw percentile rank, may appear to be above the average, he or she cannot really be statistically distinguished from average (because the estimate is too imprecise for us to reject the possibility that random error is the only reason for the difference).
Now, let’s say we have another teacher, scoring at the 60th percentile, but with a much smaller error margin of plus or minus 5 points. In this case, the teacher is statistically above average, since the confidence interval (55-65) doesn’t cross the average/median border.
The big point here is that each teacher’s actual score (e.g., 60th percentile) must be interpreted differently, using the error margins, which vary widely between teachers. The fact that our first teacher had a wide margin of error (20 percentile points) did not preclude accuracy, since we used that information to interpret the estimate, and the only conclusion we could reach was the less-than-exciting but accurate characterization of the teacher as not distinguishable from average.
There’s always error, but if you pay attention to each teacher’s spread individually, you can try to interpret their scores more rigorously (though it's of course no guarantee).
Somewhat in contradiction with that principle, the NYC education department’s dataset sets its own uniform definitions of average (the five categories mentioned above), which are reported by the Times (at the school-level) and Journal (for each teacher). The problem is that the cutpoints are the same for all teachers – i.e., they largely ignore variability in margins of error.*
As a consequence, a lot of these ratings present misleading information.
Let’s see how this works out with the real data. For the sake of this illustration, I will combine the “high” and “above average” categories, as well as the “low” and “below average” categories. This means our examination of the NYC scheme’s "accuracy" – at least by the alternative standard accounting for error margins – is more lenient, since we’re ignoring any discrepancies among teachers assigned to categories at the very top and very bottom of the distribution.
When you rate teachers using the NYC scheme, about half of the city’s teachers receive “average," about a quarter received “below average” or “low," and the other quarter got “above average” or “high." This of course makes perfect sense – by definition, about half of teachers would come in between the 25th and 50th percentiles, so about half would be coded as “average."
Now, let’s look at the distribution of teacher ratings when you actually account for the confidence intervals – i.e., when you define average, above average, and below average in terms of whether estimates are statistically different from the average – i.e., whether the confidence intervals overlap with the average/median.
Since sample size is a big factor here, I once again present the distribution two different ways for each subject. One of them (“multi-year”) is based on teachers’ “career” estimates, limited to those that include between 2-5 years of data. The other is on their 2010 (single-year) estimates alone. In general, the former will be more accurate (i.e., will yield a more narrow confidence interval) than the latter.
These figures tell us the percentages of teachers’ scores that we can be confident are above average (green) or below average (red), as well as those that we cannot say are different from average (yellow). In many respects, only the estimates that are statistically above or below average are “actionable” – i.e., they’re the only ones that actually provide any information that might be used in policy (or by parents). All other teachers’ scores are best interpreted as no different from average – not in the negative sense of that term, but rather in the sense that the available data do not permit us to make a (test-based) determination that they are anything other than normally competent professionals.
There are three findings to highlight in the graph. First, because the number of years of data makes a big difference in precision (as shown in the first graph), it also influences how many teachers can actually be identified as above or below average (by the standard I’m using). In math, when you use the multi-year estimates (i.e., those based on 2-5 years of data), the distribution is quite similar to the modified NYC categories, with roughly half of teachers classified as indistinguishable from the average (this actually surprised me a bit, as I thought it would be higher).
When you use only one year of data, however, almost two-thirds cannot be distinguished from average. You get a similar decrease in ELA (from 82.1 percent to 69.6 percent).
In other words, when you have more years of data, more teachers’ estimates are precise enough, one way or the other, to draw conclusions. This illustrates the importance of sample size (years of data) in determining how much useful information these estimates can provide (and the potential risk of ignoring it).
The second thing that stands out from the graph, which you may have already noticed, is the difference between math and ELA. Even when you use multiple years of data, only about three in ten ELA teachers’ estimates are discernibly above or below average, whereas the proportion in math is much higher. This tells you that value-added scores in ELA are subject to higher levels of imprecision than those in math, and that even when there are several years of data, most teachers’ ELA estimates are statistically indistinguishable from the average.
This is important, not only for designing evaluation systems (discussed here), but also for interpreting the estimates in the newspapers’ databases. If you don't pay attention to error, the same percentile rank can mean something quite different in math versus ELA in terms of precision, even across groups of teachers (e.g., schools). And, beyond the databases, the manner in which the estimates are incorporated along with other measures into systems such as evaluations - for example, whether you use percentile ranks or actual scores (usually, in standard deviations) - is very important, and can influence the final results.
The third and final takeaway here – and the one that’s most relevant to the public databases – is that the breakdowns in the graph are in most cases quite different from the breakdowns using the NYC system (in which, again, about half of teachers are “average," and about a quarter are coded as “above-" and “below average”).
In other words, the NYC categories might in many individual cases be considered misleading. Many teachers who are labeled as “above average” or “below average” by the city are not necessarily statistically different from the average, once the error margins are accounted for. Conversely, in some cases, teachers who are rated average in the NYC system are actually statistically above or below average – i.e., their percentile ranks are above or below 50, and their confidence intervals don’t overlap with the average/median.
In the graph below, I present the percent of teachers whose NYC ratings would be different under the scheme that I use – that is, they are coded in the teacher data reports as “above average” or “below average” when they are (by my ratings, based on confidence intervals) no different from average, or vice-versa.
The discrepancies actually aren’t terrible in math – about eight percent with multiple years of data, and about 13 percent when you use a single year (2010). But they’re still noteworthy. About one in ten teachers is given a label – good or bad – that is based on what might be regarded as a misinterpretation of statistical estimates.
In ELA, on the other hand, the differences are rather high – one in five teachers gets a different rating using multiple years of data, compared with almost 30 percent based on the single-year estimates.
And it’s important to reiterate that I’m not even counting the estimates categorized as “low” (0-5th percentile) or “high” (95-99th percentile). If I included those, the rates in the graph would be higher.
(Side note: At least in the older versions of the NYC teacher data reports [sample here], there is actually an asterisk placed next to any teacher’s categorical rating that is accompanied by a particularly wide margin of error, along with a warning to “interpret with caution." I'm not sure - but hope - this practice continued in the more recent versions, but, either way, the asterisks do not appear in the newspaper databases, which I assume is because they are not included in the datasets that the papers received.)
I feel it was irresponsible to publish the NYC categories. I can give a pass to the Times, since they didn’t categorize each teacher individually (only across schools). The Journal, on the other hand, did choose to label each teacher, even using estimates that are based on only year of data – most of which are, especially for ELA, just too imprecise to be meaningful, regardless of whether or not the error margins were presented too. These categorical ratings, in my opinion, made a bad situation worse.
That said, you might argue that I’m being unfair here. Both papers did, to their credit, present and explain the error margins. Moreover, in choosing an alternative categorization scheme, the newspapers would have had to impose their own judgment on data that they likely felt a responsibility to provide to the public as they were received. Put differently, if the city felt the categorical ratings were important and valid, then it might be seen as poor journalistic practice, perhaps even unethical, to alter them.
That’s all fair enough – this is just my opinion, and the importance of this issue pales in comparison with the (in my view, wrong) decision to publish in the first place. But I would point out that the Times did make the choice to withhold the categories for individual teachers, so at least one newspaper felt omitting them was the right thing to do.
(On a related note, the Los Angeles Times value-added database, released in 2010, also applied labels to individual teachers, but they actually required that teachers have at least 60 students before publishing their names. Had the NYC papers applied the LA Times’ standard, they would have withheld almost 30 percent of teachers’ multi-year scores, and virtually all of their single-year scores.)
In any case, if there’s a bright spot to this whole affair, it is the fact that value-added error margins are finally getting the attention they are due.
And, no matter what you think of the published categorical ratings (or the decision to publish the database), all the issues discussed above – the need to account for error margins, the importance of sample size, the difference in results between subjects, etc. – carry very important implications for the use of value-added and other growth model estimates in evaluations, compensation, and other policy contexts. Although many systems are still on the drawing board, most of the new evaluations now being rolled out make the same "mistakes" as are reflected in the newspaper databases.**
So, I’m also hoping that more people will start asking the big question: If the error margins are important to consider when publishing these estimates in the newspaper, why are they still being ignored in most actual teacher evaluations?
In a future post, I’ll take a look at other aspects of the NYC data.
- Matt Di Carlo
* It’s not quite accurate to say that these categories completely ignore the error margins, since they code teachers as average if they’re within 25 percentile points of the median (25-75). In other words, they don’t just call a teacher “below average” if he or she has a score below 50. They do allow for some spread, but it’s just the same for all teachers (the extremely ill-advised alternative is to code teachers as above average if they’re above 50, below if they’re below). One might also argue that there’s a qualitative difference between, on the hand, the city categorizing its teachers in reports that only administrators and teachers will see, and, on the other hand, publishing these ratings in the newspaper.
** In fairness, there are some (but very few) exceptions, such as Hillsborough County, Florida, which requires that teachers have at least 60 students for their estimates to count in formal evaluations. In addition, some of the models being used, such as D.C.’s, use a statistical technique called “shrinking," by which estimates are adjusted according to sample size. If any readers know of other exceptions, please leave a comment.