Education policymaking and debates are under constant threat from an improbable assailant: Short-term changes in cross-sectional proficiency rates.
The use of rate changes is still proliferating rapidly at all levels of our education system. These measures, which play an important role in the provisions of No Child Left Behind, are already prominent components of many states’ core accountability systems (e..g, California), while several others will be using some version of them in their new, high-stakes school/district “grading systems." New York State is awarding millions in competitive grants, with almost half the criteria based on rate changes. District consultants issue reports recommending widespread school closures and reconstitutions based on these measures. And, most recently, U.S. Secretary of Education Arne Duncan used proficiency rate increases as “preliminary evidence” supporting the School Improvement Grants program.
Meanwhile, on the public discourse front, district officials and other national leaders use rate changes to “prove” that their preferred reforms are working (or are needed), while their critics argue the opposite. Similarly, entire charter school sectors are judged, up or down, by whether their raw, unadjusted rates increase or decrease.
So, what’s the problem? In short, it’s that year-to-year changes in proficiency rates are not valid evidence of school or policy effects. These measures cannot do the job we’re having them do, even on a limited basis. This really has to stop.
The literature is replete with warnings and detailed expositions of these measures' limitations. Let's just quickly recap the major points, with links to some relevant evidence and previous posts.
- Proficiency rates may be a useful way to present information accessibly to parents and the public, but they can be highly-misleading measures of student performance, as they only tell you how many test-takers are above a given (often somewhat arbitrary) cutpoint. The problems are especially salient when the rates are viewed over time – rates can increase while average scores decrease (and vice-versa), and rate changes are heavily dependent on the choice of cutpoint and distribution of cohorts' scores around it. They are really not appropriate for evaluating schools or policies, even using the best analytical approaches (for just two among dozens of examples of additional research on this topic, see this published 2008 paper and this one from 2003);
- The data are (almost always) cross-sectional, and they mask changes in the sample of students taking the test, especially at the school- and district-level, where samples are smaller (note that this issue can apply to both rates and actual scores; for more, see this Mathematica report and this 2002 published article);
- Most of the change in raw proficiency rates between years is transitory – i.e., it is not due to the quality of a school or the efficacy of a policy, but rather to random error, sampling variation (see the second bullet) or factors, such as students’ circumstances and characteristics, that are outside of schools’ control (see this paper analyzing Colorado data, this one on North Carolina and our quick analysis of California data).
But typical, raw rate changes reflect the rather severe limitations of both cross-sectional data and cutpoint-based measures, as well as the more general fact that test performance varies for reasons other than the quality of schooling. In other words, they don't even necessarily tell us whether students actually made testing progress, to say nothing of the degree to which it was schools or specific policies responsible for those changes (or lack thereof).
The only proper way to assess the effect of schools/policies on test scores is multivariate analysis of longitudinal testing data - actual scores, not rates – which control, to the degree possible, for confounding factors that can influence results (in the case of policy evaluation, random assignment is of course preferable, but often not feasible). This takes time, but policies and schools should not be judged based on short-term outcomes anyway, whether test-based or otherwise. It also requires investment, but that's the price of good information.
If these kinds of systems and capabilities are not in place, they should be. In the meantime, unless interpreted with extreme caution, simple rate changes are not an acceptable alternative.
- Matt Di Carlo