Our Annual Testing Data Charade

Every year, around this time, states and districts throughout the nation release their official testing results. Schools are closed and reputations are made or broken by these data. But this annual tradition is, in some places, becoming a charade.

Most states and districts release two types of assessment data every year (by student subgroup, school and grade): Average scores (“scale scores”); and the percent of students who meet the standards to be labeled proficient, advanced, basic and below basic. The latter type – the rates – are of course derived from the scores – that is, they tell us the proportion of students whose scale score was above the minimum necessary to be considered proficient, advanced, etc.

Both types of data are cross-sectional. They don’t follow individual students over time, but rather give a “snapshot” of aggregate performance among two different groups of students (for example, third graders in 2010 compared with third graders in 2011). Calling the change in these results “progress” or “gains” is inaccurate; they are cohort changes, and might just as well be chalked up to differences in the characteristics of the students (especially when changes are small). Even averaged across an entire school or district, there can be huge differences in the groups compared between years – not only is there often considerable student mobility in and out of schools/districts, but every year, a new cohort enters at the lowest tested grade, while a whole other cohort exits at the highest tested grade (except for those retained).

For these reasons, any comparisons between years must be done with extreme caution, but the most common way - simply comparing proficiency rates between years - is in many respects the worst. A closer look at this year’s New York City results illustrates this perfectly.

In New York State, the tests are designed such that scale scores are comparable (albeit imperfectly) between years within the same grade and subject. In other words, a given math score among fourth graders in 2010 represents the same estimated “amount of knowledge” as the same score among fourth graders in 2011. So, for example, a score of 650 on the third grade math exam in 2010 represents a performance level that is comparable to a 650 on the third grade math exam in 2011 (though, again, it’s two different groups of students).

Now, averages are almost always potentially misleading. The average scale score for a given grade does provide a sense of how the “typical student” performed, but it also masks a lot of information about the distribution of scores – e.g., a small group of particularly high- or low-performing students can skew the overall average, especially when samples are small.

Similarly, when you “convert” those scores into proficiency rates, you also lose a tremendous amount of data, which is especially problematic when comparing rates across years. For instance, a cohort of low-scoring students who are nowhere near the proficiency cutoff (above or below) can be compared to another cohort with much higher (or lower) scores the next year, but it might not budge the proficiency rate one bit. As a result, proficiency rates can increase while average scores decrease, and vice-versa (as I explained in this post).

Let’s see how this played out in NYC this year.

The two graphs below present the simple change in both scores (blue bars) and proficiency rates (red bars) for each grade (the first graph is math results, the second is English Language Arts [ELA]). The data are from the NYC Department of Education. (Two notes: When there are what appear to missing bars in the graph, this means that there was no change at all; and a couple of the very small changes in the scale scores within the graphs may not be statistically significant.)

There seems to be some conflicting “evidence” here. For example, the average math score among third graders actually decreased five points, which means that the “amount of knowledge” that the typical third grader had in 2011 was actually lower than in 2010. But the proficiency rate increased half of one percentage point. Similarly, there was a six percentage point increase in the proficiency rate for eighth graders, but the scale scores were totally flat (no change at all).

There’s even more “conflict” in the ELA results. In two grades (third and fifth) there were small increases in the proficiency rate accompanied by decreases in the average score. So, as with the third graders in math, the “amount of knowledge” of the average student in these two grades was lower in 2010 than in 2011, but the proficiency rates make it appear as if there was real "improvement" (albeit between two totally different groups of students).

Overall, not counting the combined averages for grades 3-8 (the topmost bars in each graph, which may not be comparable between years, since they combine different grades), there are 12 sets of results in the graphs above. In five of these, the proficiency rate and scale scores either moved in opposite directions, or one moved and the other was flat.

Now, it bears mentioning that the “definitions” of proficiency changed in New York State between 2010 and 2011. More specifically, the minimum score above which students are proficient increased slightly in all grades and subjects, which means that increases in the average score might not be accompanied by increases in the proficiency rate (since the bar was lower in 2010). For example, a third grader taking the ELA test in 2010 and receiving a score of 662 would be deemed “proficient." But, in 2011, that bar was raised to 663, which means that any student scoring exactly 662 in 2011 would have been proficient in 2010 but not in 2011.

The city’s press release about the results suggested that the “growth” in proficiency rates between 2010 and 2011 was not as high as it should have been because the bar was raised. This may be true (though, again, it's not really "growth"). It’s impossible to tell, using the publicly available data, how many students were “affected” by this change (i.e., would have “passed” this year if the definitions hadn’t changed), but it’s likely that some students were.

This does not, however, change the fact that both the math and ELA scale scores across most grades were either flat or lower in 2011. On the whole, the “typical student” in New York City schools demonstrated roughly the same level of “knowledge” of the subject matter in 2011 as in 2010. Nevertheless, the NYC press release characterized the small increases in proficiency as “continued progress," and Mayor Bloomberg called the city’s results “dramatic” vis-à-vis those in the rest of the state. Both those statements are, at best, unsupported.

The city’s results are also indicative of how changes in the proficiency rate between years are, by themselves, a terrible measure of performance changes. Like scale scores, they are cross-sectional – and therefore measure the performance of two different groups of students. But the rates, by themselves, can also be criminally misleading if not interpreted carefully (especially when the changes are small). But you wouldn’t know that from reading the annual parade of press releases and news stories, both in NYC and elsewhere, calling small increases in proficiency “progress."

To be clear, my point here isn’t to say that we should abandon (or even scale back) the public examination of testing data. Nor am I saying that we shouldn’t use proficiency rates, which are useful in that they are easier than scale scores for parents and the public to interpret. Rather, what I’m saying is that our understanding of what these data mean, especially between years, is usually insufficient, and simplistic media coverage, as well as misleading, politically-motivated “rollouts” on the part of many districts, exacerbates this misunderstanding.

At the very least, any newspaper, press release, or any other media outlet that heralds changes in proficiency rates (especially small changes) without at least examining the grade-by-grade change in actual scores may very well be hiding more than it reveals. It is very important to look at both (whenever possible).

Actually, given the limitations of cross-sectional data, both scores and rates, I cannot help but wonder why more states and districts do not perform and release longitudinal analyses – the kind that follow students over time – as this would provide a much better idea of “real” test-based progress (though the scores would have to be normalized so as to allow comparability between grades, the way they are for use in value-added models). It’s also worth noting that a few states and districts, including Washington, D.C., release only the rates and not the scores.

Finally, these misconceptions would be much less of a problem if we didn’t live in a nation where these measures were used to literally make or break the careers of administrators, and to advocate forcefully for the closing or opening of entire schools. Changes in proficiency rates (and scale scores) can provide useful information, but, even when interpreted properly, they are far too limited to play the high-stakes role we let them play.

- Matt Di Carlo

Blog Topics

I don't trust anything about NYS test scores as the exams were constructed by the same group of goons at NYSED and CTB-McGraw who produced the versions in previous yrs w/ huge inflation; I just don't think these guys know how to create reliable tests. I also don't trust the results because of Campbell's law and the overemphasis placed on test scores, and the lack of protections against cheating which were eliminated under Bloomberg and Klein. I honestly don't think they can tell you anything about anything.