Schools Aren't The Only Reason Test Scores Change
In all my many posts about the interpretation of state testing data, it seems that I may have failed to articulate one major implication, which is almost always ignored in the news coverage of the release of annual testing data. That is: raw, unadjusted changes in student test scores are not by themselves very good measures of schools' test-based effectiveness.
In other words, schools can have a substantial impact on performance, but student test scores also increase, decrease or remain flat for reasons that have little or nothing to do with schools. The first, most basic reason is error. There is measurement error in all test scores - for various reasons, students taking the same test twice will get different scores, even if their "knowledge" remains constant. Also, as I've discussed many times, there is extra imprecision when using cross-sectional data. Often, any changes in scores or rates, especially when they’re small in magnitude and/or based on smaller samples (e.g., individual schools), do not represent actual progress (see here and here). Finally, even when changes are "real," other factors that influence test score changes include a variety of non-schooling inputs, such as parental education levels, family's economic circumstances, parental involvement, etc. These factors don't just influence how highly students score; they are also associated with progress (that's why value-added models exist).
Thus, to the degree that test scores are a valid measure of student performance, and changes in those scores a valid measure of student learning, schools aren’t the only suitors at the dance. We should stop judging school or district performance by comparing unadjusted scores or rates between years.
The limitations of these measures is well-established - e.g., see this Kane/Staiger paper on volatility in test scores and this 1998 article about proficiency rates. As a more concrete example, last year, Mathematica released a report that compared three types of school performance measures for the same group of schools over a three-year period:
- Successive cohort indicators (e.g., comparing fourth graders one year to fourth graders the next year);
- Average gain or same cohort change (e.g., cohorts are followed over time, but there's no attempt to control for non-school factors);
- Estimates from schoolwide value-added models (the same group of students is compared in successive years, with controls for student characteristics).
It appears that the average gain indicator [type 2 in the list above] is potentially unfair, but can be adjusted, while the successive cohort indicator [type 1] has potential to be quite misleading. If so, then why are they in such widespread use? The reason these indicators are so entrenched in the practice of education policy is a matter for speculation but probably depends on cost and burden as well as on the fact that many student assessments were not designed for evaluation of education interventions. Most were designed instead for diagnosing problems and documenting achievements of individual students without regard to how such achievements were produced.In other words, test scores alone measure student performance, not school performance. We can still use them to approximate the latter, but it must be done correctly, and that requirement is routinely ignored, both in our public debate and in policy.
Here’s one more quick illustration. The graph below compares two measures used in Ohio. Each dot is a school (there are a few thousand in all, and they are very bunched up in the middle, but that’s not a big problem for our purposes here).
The vertical axis is the change in schools' “performance index” between 2010 and 2011. This is a weighted index based on the number of students who are below basic, basic, proficient and advanced. It is very similar to changes in proficiency rates, except it also incorporates the other cutpoint-based categories. The horizontal axis is each school’s 2011 rating on the state’s value-added model, which assesses schools according to whether they are below, above or meeting expectations for student growth. So, these are two "change-oriented" measures, one of them (the performance index) based on raw changes in cross-sectional cutpoint rates, the other (value-added) calculated using longitudinal data and controlling for student characteristics (e.g., prior achievement).
Clearly, these two measures are in many cases sending different messages about school performance. Roughly half of the schools rated “below expectations” by the value-added model actually saw an increase in their performance index (the dots above the red horizontal line), while a bunch of the schools rated “above expectations” by the model actually saw a decrease in their index between 2010 and 2011. In the majority value-added group ("met expectations"), there's a fairly even spread of increases and decreases.
If you were judging Ohio schools based on the change in the performance index alone, there’s a good chance you would reach a different conclusion than that provided by the value-added model. And value-added models, for all their limitations, are at least designed to address the issues discussed above, such as the fact that non-school factors influence testing gains.
So, in general, tests can be used to approximate school effects, but the manner in which we usually do so (assuming that small changes in cross-sectional proficiency are not only “real," but entirely attributable to school performance) can be extremely misleading. And the frequent attempts to attribute increases to specific policies is borderline absurd.
This doesn’t mean that the raw test results released annually are useless – far from it, in fact. They provide a snapshot of student performance in any given year. That’s incredibly important information, and yet we largely ignore it, preferring instead to misinterpret these results in order to cast judgment on schools, districts and policy interventions.
Scores change for a variety of reasons. Schools are one very important player in this dynamic, but they’re not the only one.
- Matt Di Carlo
It would be interesting to weight the markers by the population size of the schools to see whether those on the extreme bounds of the x-axis were small schools (or, to use small multiples by school size). This may explain some of the spillover on the positive/negative x-axis for low/high VA schools.
The issue of small schools and large longitudinal variation is discussed a bit in this Kaine and Staiger piece and more accessibly on Marginal Revolution a few years ago.