There is currently a flurry of debate focused on the question of whether “NCLB worked.” This question, which surfaces regularly in the education field, is particularly salient in recent weeks, as Congress holds hearings on reauthorizing the law.
Any time there is a spell of “did NCLB work?” activity, one can hear and read numerous attempts to use simple NAEP changes in order to assess its impact. Individuals and organizations, including both supporters and detractors of the law, attempt to make their cases by presenting trends in scores, parsing subgroups estimates, and so on. These efforts, though typically well-intentioned, do not, of course, tell us much of anything about the law’s impact. One can use simple, unadjusted NAEP changes to prove or disprove any policy argument. And the reason is that they are not valid evidence of an intervention's effects. There’s more to policy analysis than subtraction.
But it’s not just the inappropriate use of evidence that makes these “did NCLB work?” debates frustrating and, often, unproductive. It is also the fact that NCLB really cannot be judged in simple, binary terms. It is a complex, national policy with considerable inter-state variation in design/implementation and various types of effects, intended and unintended. This is not a situation that lends itself to clear cut yes/no answers to the “did it work?” question.
For one thing, if you’re interested in how accountability policies affect testing outcomes, a more fruitful approach, given what’s available, is not to focus solely on NCLB per se, but rather to survey the research on whether (test-based) school accountability in general has a track record of effectiveness. And this question requires a nuanced answer. These policies come in many different forms. They differ, for example, in terms of the type of testing measures for which stakeholders are held accountable, the focus of accountability (e.g., subgroups), the flexibility granted nested entities (e.g., districts within states) in design/implementation, and incentive structures (e.g., closure, monetary bonuses, voucher threats, etc.). In fact, seemingly subtle differences in how states' implemented NCLB itself played a larger role than actual student achievement in determining which schools made AYP (Davidson et al. 2013).
Still, there is a pretty decent body of evidence on the impact of test-based accountability systems, NCLB and otherwise (e.g., Carnoy and Loeb 2002; Jacob 2004; Hanushek and Raymond 2005; Ballou and Springer 2008; Ladd and Lauen 2010; Rockoff and Turner 2010; Dee and Jacob 2011; Winters and Cowen 2012; Rouse et al. 2013). Most of these analyses rely on test scores as the outcome of interest (though often tests other than the NCLB assessments administered by states). In addition, most are not national in scope, but rather state- or district-level policy evaluations (a bunch from the pre-NCLB era). But they provide a fairly solid, steadily expanding base of evidence about test-based accountability (also see Figlio and Loeb 2011 for a general review of the literature on school accountability).
Yet even the best of these studies are sometimes used by advocates in an oversimplified fashion. For example, an excellent analysis by Thomas Dee and Brian Jacob (2011), both highly respected economists, is one of the relatively few national analyses of the impact of NCLB proper. It is often cited to argue that “NCLB worked.” What Dee and Jacob find is that the implementation of the policy was associated with statistically discernible and large improvements in fourth grade math scores, and more moderate increases in eighth grade math scores (concentrated largely among lower-performing students). They also, however, find no estimated improvement at all in reading in either grade.
Now, to be clear, the findings for math achievement are positive and, among fourth graders, uniformly large in magnitude. This is meaningful and important, and it speaks directly to the impact of the policy (and remember that seemingly minor annual impacts can accumulate over time in attendance, and can have a big influence on many individual students). Still, at the very least, the fact that there was no estimated improvement at all in reading in either grade hardly lends itself to the unqualified conclusion that “NCLB worked.” The results can, however, inform a number of very important questions, such as those about the distribution of accountability effects across students and between subjects, how accountability interacts with other educational policies, etc. (Dee and Jacob themselves provide a nuanced discussion of these and other issues in the article). But these questions too often get lost in the "did NCLB work?" back-and-forth.
That said, in reviewing the research overall at this point, at least those studies looking at short-term testing outcomes, it’s fair to draw a few general conclusions:
- The introduction of test-based accountability systems tends to have moderate positive estimated effects on short-term testing outcomes in math, and typically smaller (and sometimes nil) effects in reading;
- There is scarce evidence that test-based accountability policies have a negative impact on short-term student testing outcomes;
- Results from the vast majority of evaluations of test-based accountability policies suffer from an unavoidable but nonetheless important limitation: It is very difficult to isolate, and there is mixed evidence regarding, the policies and practices that led to the outcomes.
The impact of these policies on testing outcomes is certainly important, but it is far from the only relevant question (though it is, unfortunately, the one that gets virtually all the attention in our public discourse surrounding school accountability). For example, as noted in the third bullet above, whether policies had measurable impacts on observable outcomes cannot be fully assessed without attention to why these effects came about – i.e., the changes in policies and practices that gave rise to them.
Accountability policies are fundamentally about changing behavior, and there is evidence that accountability systems do sometimes lead to undesirable behavior, such as reclassifying students as learning disabled to exempt them from testing (e.g., Jacob 2004), cheating (e.g., Jacob and Levitt 2003) or shifting attention to students who are close to proficiency thresholds (e.g., Neal and Schanzenbach 2010). But there are also studies suggesting that improvement sometimes stems from more strategic, seemingly desirable practices, such as increasing instructional time, investing in teacher training, and reorganizing the learning environment (e.g., Chiang 2009; Dee et al. 2012; Rouse et al. 2013). Both "types" of changes likely co-exist in many schools and districts.
In any case, it is exceedingly important to focus on what led to changes in outcomes, positive or negative, and not just on whether such changes can be found. Identifying and understanding these causal factors could not be more important for assessing and improving educational accountability policy, particularly in the U.S., where testing data increasingly dominate so many important components of the education system.
Furthermore, needless to say, test-based accountability systems can influence a variety of outcomes beyond short-term test (usually math and reading) scores. For instance, these policies can influence, for better or worse, the recruitment and retention (and morale) of teachers and administrators, the quality of curriculum, and, by the way, longer-term outcomes such as college attainment. It is seriously unwise to judge test-based accountability based solely (or, perhaps, even predominantly) on short-term testing outcomes (see, for example, Deming et al. 2013).
(Side note: There is also a timing issue here. It is only now that the first cohorts of students are graduating after having spent their entire K-12 careers in the NCLB era. Policies often take years to show observable effects, and good research takes time on top of that. 10-12 years may seem like an eternity, but NCLB was a large and complex policy, and, most likely, a large chunk of the best research is yet to come.)
So, there is a good amount of strong evidence out there about test-based accountability policies, including NCLB, and using that evidence should be an ongoing priority for the education field. But we should use it wisely - not to make quick-and-dirty, soundbyte-style arguments about whether huge national policies “worked,” but rather to address the important underlying questions, including (but not limited to): how most effectively to design valid and reliable performance measures, both test and non-test; how to calibrate properly the incentives attached to these measures; how different accountability policies lead to changes in policies and practice among stakeholders; and which of these policies and practices do and do not generate improvement (again, hopefully defined using test and non-test outcomes, short- and longer-term).
Because the answers to these questions, even when there is evidence available, can be inconsistent or even conflicting, this kind of effort is often slow and not conducive to certainty, and that can be very frustrating. But frustration and good policymaking usually go hand-in-hand.