Five Recommendations For Reporting On (Or Just Interpreting) State Test Scores

From my experience, education reporters are smart, knowledgeable, and attentive to detail. That said, the bulk of the stories about testing data – in big cities and suburbs, in this year and in previous years – could be better.

Listen, I know it’s unreasonable to expect every reporter and editor to address every little detail when they try to write accessible copy about complicated issues, such as test data interpretation. Moreover, I fully acknowledge that some of the errors to which I object – such as calling proficiency rates “scores” – are well within tolerable limits, and that news stories need not interpret data in the same way as researchers. Nevertheless, no matter what you think about the role of test scores in our public discourse, it is in everyone’s interest that the coverage of them be reliable. And there are a few mostly easy suggestions that I think would help a great deal.

Below are five such recommendations. They are of course not meant to be an exhaustive list, but rather a quick compilation of points, all of which I’ve discussed in previous posts, and all of which might also be useful to non-journalists.

Look at both scale scores and proficiency rates: Some states (including D.C.) don’t release scale scores, but most of them do. Scores and rates often move in opposite directions. If you don’t look at both, you risk misleading your readers (frankly, it's best to rely on the scores for presenting trends). In addition, check whether any changes (in scores or rates) are disproportionately concentrated in a small number of grades or student subgroups. (Side note: EWA is on top of this one.)

Changes in proficiency rates, especially small changes, should not be taken at face value: In general, rate changes can tell you whether a larger proportion of tested students scored above the (often somewhat-arbitrarily-defined) proficiency cutoff in one year compared with another, but that’s a very different statement from saying that the average student improved. Because most of the data are cross-sectional, states’ annual test score results entail a lot of sampling error (differences in the students being compared), not to mention all the other issues with proficiency rates (e.g., changes in rates depend a great deal on the clustering of students around the cutoff point) and the tests themselves. As a result, it's often difficult to know, based on rate changes, whether there was "real" improvement. If you must report on the rates, exercise caution. For instance, it’s best to regard very small changes between years (say, 1-2 percentage points, depending on sample size) as essentially flat (i.e., insufficient for conclusions about improvement). Also keep in mind that rates tend to fluctuate – up one year, down the next. Finally, once again, the scores themselves, rather than the rates, are much better for presenting trends.

Changes in rates or scores are not necessarily due to school improvements (and they certainly cannot be used as evidence for or against any policy or individual): Albeit imperfectly, test scores by themselves measure student performance, not school performance. Changes might be caused by any number of factors, many of which have nothing to do with schools (e.g., error, parental involvement, economic circumstances, etc.). If a given change in rates/scores is substantial in magnitude and shared across grades and student subgroups, it is plausible that some (but not all) of it was due to an increase in school effectiveness. It is almost never valid, however, to attribute a change to particular policies or individuals. Districts and elected officials will inevitably try to make these causal claims. In my opinion, they should be ignored, or the claims should at least be identified as pure speculation.

Comparing average scores/rates between schools or districts is not comparing their “performance": Unlike the other “guidelines” above, this one is about the scores/rates themselves, rather than changes in them. If one school or district has a higher average score or proficiency rate than another, this doesn’t mean it is higher-performing, nor does a lower score/rate signal lower effectiveness. The variation in average scores or rates is largely a function of student characteristics, and schools/districts vary widely in the students they serve. These comparisons – for example, comparing a district’s results to the state average – can be useful, but be careful to frame them in terms of student and not school performance.

If you want get an approximate idea of schools’ relative performance, wait until states release their value-added or growth model results: Many states employ value-added or other growth models that, interpreted cautiously, provide defensible approximations of actual school effectiveness, in that they make some attempt to control for extraneous variables, such as student characteristics. When possible, it’s much better to wait for these results, which, unlike raw state testing data, are at least designed to measure school performance (relative to comparable schools in the state or district).


At this point, you might be asking whether I think the annual test data are useless. Quite the contrary. In fact, the irony of the annual test results frenzy is that it often misses the value of the data, which is to show how students are doing (at least inasmuch as tests can measure that).

This is, needless to say, important information, yet student performance is sometimes lost in a barrage of misinterpretations about the performance of schools, policies, and individuals. If we’re really interested in taking a sober look at how kids are doing, the annual release of testing results is one opportunity. Let’s not miss it.

- Matt Di Carlo