When You Hear Claims That Policies Are Working, Read The Fine Print

When I point out that raw changes in state proficiency rates or NAEP scores are not valid evidence that a policy or set of policies is “working," I often get the following response: “Oh Matt, we can’t have a randomized trial or peer-reviewed article for everything. We have to make decisions and conclusions based on imperfect information sometimes."

This statement is obviously true. In this case, however, it's also a straw man. There’s a huge middle ground between the highest-quality research and the kind of speculation that often drives our education debate. I’m not saying we always need experiments or highly complex analyses to guide policy decisions (though, in general, these are always preferred and sometimes required). The point, rather, is that we shouldn’t draw conclusions based on evidence that doesn't support those conclusions.

This, unfortunately, happens all the time. In fact, many of the more prominent advocates in education today make their cases based largely on raw changes in outcomes immediately after (or sometimes even before) their preferred policies were implemented (also see hereherehereherehere, and here). In order to illustrate the monumental assumptions upon which these and similar claims ride, I thought it might be fun to break them down quickly, in a highly simplified fashion. So, here are the four “requirements” that must be met in order to attribute raw test score changes to a specific policy (note that most of this can be applied not only to claims that policies are working, but also to claims that they're not working because scores or rates are flat):

  1. The changes in test scores are "statistically real." Before you can even begin to think about what caused a change in test scores, you need to make sure there actually was a change in test scores. Even the best tests entail measurement error – for example, a student taking the same test twice with the same amount of “knowledge” might get different scores. This means that small or even moderate year-to-year changes in scores might be due to this statistical noise, rather than “real” differences in performance between cohorts. As with political polls, the conventional way to check on this is to use error margins. Sometimes this is pretty easy (e.g., NAEP’s data tool performs significance tests), and sometimes it’s harder (e.g., digging through technical reports). In either case, it’s important to check, especially when changes (or samples) are small. And this all becomes even more complicated when looking at proficiency rates, which add an extra layer of imprecision by converting every student’s score into a “yes/no” outcome. In general, whenever possible, it’s best to focus on the scores and not the rates, and disregard small changes in either.
  2. The changes aren’t due to differences between cohorts. This is another general source of imprecision, one that you've heard me discuss before. Put simply, there's a difference between changes in average scores and actual progress. Most public testing data are cross-sectional – that is, they don’t follow students over time. Since changes in student scores tend to be small, even minor differences in the sample of students taking the test can distort overall changes a great deal. Without student-level data, this is difficult to check. A decent approach, in addition to the helpful but very limited exercise of breaking down the data by student subgroup, is to simply disregard modest changes in average scores or rates, especially when they’re based on small samples (e.g., schools). However, even when changes are larger, keep in mind that part of them may be due to this sample variation.
  3. The changes, even if they’re "real," are due to school factors. Even if changes in test scores are not due to statistical noise and are based on longitudinal data, one cannot automatically assume that the changes are entirely due to improvements (or degradation) in the quality of schooling. There are many reasons why student performance might change, and not all of them are school-related factors (e.g., shifts in social or economic circumstances). Teasing out the contribution of schools isn't easy. This doesn't mean you have to give up, but it does necessitate caution. In general, if you have a large, well-documented observed change in scores, and no reason to believe there has been a massive shift in the circumstances of students and their families (e.g., a recession), it’s fair to assume that much of the score change is due to schools, but part of it is not. (But don't forget, when talking about accountability-related policies, that test scores are also subject to various forms of inflation).
  4. The changes, even if they’re “real” and due to school-related factors, are attributable to a specific policy or set of policies. When you "add up" the imprecision in numbers 1-3, you're dealing with a lot of uncertainty about whether there was a change in performance at all, to say nothing of how it might be explained. But the biggest step still lies ahead - attributing it to specific policies. If only things were so easy. School systems are complex and dynamic. Policy changes interact with existing policies, and their effects vary by the students, staff and community where they are implemented. The effects often take years to start showing up in aggregate test scores (if they show up at all). Approximating the impact of a specific policy doesn’t necessarily require experimental methods, but you'll need a lot more than just the coincidence of cohort changes in test scores with implementation in a single location. Frankly, unless you make it abundantly clear that you're just speculating, it's questionable to even attempt these causal claims without detailed analysis (and, even with the analysis, serious caution is in order). In most cases, your best bet is to review the high-quality evidence on the policies, from other locations. This doesn't make for very good talking points, and effects can vary by context, but it's far better than the needle-in-the-causal-haystack approach, which is often little more than a guess, and not a very good one at that.
So, there you go. If you speculate that a given policy affected raw testing outcomes, these are the assumptions you are making (in addition to the usual caveats about the limitations of standardized tests). Some of them (3 and 4 in particular) cannot be fully addressed, and that’s perfectly fine and normal. There’s human judgment involved, and certainty is not possible.

My point is this: More often than not, none of them is addressed. People simply assume them all away, especially when the conclusions support their policy preferences.

This is not only misleading and leads to bad policy, but it's also self-defeating. Putting forth these arguments is kind of digging your own evidentiary grave. Every time you say a policy “worked” because rates or scores increased during the years after which it was implemented, you are implicitly endorsing a method that may very well lead to the opposite conclusion about that policy in a future year. Moreover, if the coincidence of test score increases is enough to justify a policy, you can justify (or disprove) almost any policy eventually.

That’s the problem with bad evidence – you can’t count on it.

- Matt Di Carlo