Three Important Distinctions In How We Talk About Test Scores

In education discussions and articles, people (myself included) often say “achievement” when referring to test scores, or “student learning” when talking about changes in those scores. These words reflect implicit judgments to some degree (e.g., that the test scores actually measure learning or achievement). Every once in a while, it’s useful to remind ourselves that scores from even the best student assessments are imperfect measures of learning. But this is so widely understood - certainly in the education policy world, and I would say among the public as well - that the euphemisms are generally tolerated.

And then there are a few common terms or phrases that, in my personal opinion, are not so harmless. I’d like to quickly discuss three of them (all of which I’ve talked about before). All three appear many times every day in newspapers, blogs, and regular discussions. To criticize their use may seem like semantic nitpicking to some people, but I would argue that these distinctions are substantively important and may not be so widely-acknowledged, especially among people who aren’t heavily engaged in education policy (e.g., average newspaper readers).

So, here they are, in no particular order.

In virtually all public testing data, trends in performance are not “gains” or “progress." When you tell the public that a school or district’s students made “gains” or “progress," you’re clearly implying that there was improvement. But you can’t measure improvement unless you have at least two data points for the same students – i.e., test scores in one year are compared with those in previous years. If you’re tracking the average height of your tomato plants, and the shortest one dies overnight, you wouldn’t say that there had been “progress” or “gains," just because the average height of your plants suddenly increased.

Similarly, almost all testing trend data that are available to the public don’t actually follow the same set of students over time (i.e., they are cross-sectional). In some cases, such as NAEP, you’re comparing a sample of fourth and eighth graders in one year with a different cohort of fourth and eighth graders two years earlier. In other cases, such as the results of state tests across an entire school, there’s more overlap – many students remain in the sample between years – but there’s also a lot of churn. In addition to student mobility within and across districts, which is often high and certainly non-random, students at the highest tested grade leave the schools (unless they’re held back), while whole new cohorts of students enter the samples at the lowest tested grade (in middle schools serving grades seven and eight, this means that half the sample turns over every year).

So, whether it’s NAEP or state tests, you’re comparing two different groups of students over time. Often, those differences cannot be captured by standard education variables (e.g., lunch program eligibility), but are large enough to affect the results, especially in smaller schools (smaller samples are more prone to sampling error). Calling the differences between years “gains/progress” or “losses” therefore gives a false impression; at least in part, they are neither – reflecting nothing more than variations between the cohorts being compared.

Proficiency rates are not “scores." Proficiency or other cutpoint-based rates (e.g., percent advanced) are one huge step removed from test scores. They indicate how many students scored above a certain line. The choice of this line can be somewhat arbitrary, reflecting value judgments and, often, political considerations as to the definition of “proficient” or "advanced." Without question, the rates are an accessible way to summarize the actual scale scores, which aren’t very meaningful to most people. But they are interpretations of scores, and severely limited ones at that.*

Rates can vary widely, using the exact same set of scores, depending on where the bar is set. In addition, all these rates tell you is whether students were above or below the designated line – not how far above it or below it they might be. Thus, the actual test scores of two groups of students might be very different even though they have the same proficiency ranking, and scores and rates can move in opposite directions between years.

To mitigate the risk of misinterpretation, comparisons of proficiency rates (whether between schools/districts or over time) should be accompanied by comparisons of average scale scores whenever possible. At the very least, the two should not be conflated.**

Schools with high average test scores are not necessarily “high-performing," while schools with lower scores are not necessarily “low-performing." As we all know, tests don’t measure the performance of schools. They measure (however imperfectly) the performance of students. One can of course use student performance to assess that of schools, but not with simple average scores.

Roughly speaking, you might define a high-performing school as one that provides high-quality instruction. Raw average test scores by themselves can’t tell you about that, since the scores also reflect starting points over which schools have no control, and you can’t separate the progress (school effect) from the starting points. For example, even the most effective school, providing the best instruction and generating large gains, might still have relatively low scores due to nothing more than the fact the students it serves have low scores upon entry, and they only attend the schools for a few years at most. Conversely, schools with very high scores might provide poor instruction, simply maintaining (or even decreasing) the already stellar performance levels of the students it serves.

We very clearly recognize this reality in how we evaluate teachers. We would never judge teachers’ performance based on how highly their students score at the end of the year, because some teachers’ students were higher-scoring than others’ at the beginning of the year.

Instead, to the degree that school (and teacher) effectiveness can be assessed using testing data, doing so requires growth measures, as these gauge (albeit imprecisely) whether students are making progress, independent of where they started out and other confounding factors. There’s a big difference between a high-performing school and a school that serves high-performing students; it’s important not to confuse them.

- Matt Di Carlo

*****

* Although this doesn’t affect the point about the distinction between scores and rates, it’s fair to argue that scale scores also reflect value judgments and interpretations, as the process by which they are calculated is laden with assumptions – e.g., about the comparability of content on different tests.

** Average scores, of course, also have their strengths and weaknesses. Like all summary statistics, they hide a lot of the variation. And, unlike rates, they don’t provide much indication as to whether the score is “high” or “low” by some absolute standard (thus making them very difficult to interpret), and they are usually not comparable between grades. But they are a better measure of the performance of the “typical student," and as such are critical for a more complete portrayal of testing results, especially viewed over time.

Blog Topics

Great stuff. It's amazing how often even experienced observers seem to forget these crucial points.