The Arcane Rules That Drive Outcomes Under NCLB

** Reprinted here in the Washington Post

A big part of successful policy making is unyielding attention to detail (an argument that regular readers of this blog hear often). Choices about design and implementation that may seem unimportant can play a substantial role in determining how policies play out in practice.

A new paper, co-authored by Elizabeth Davidson, Randall Reback, Jonah Rockoff and Heather Schwartz, and presented at last month’s annual conference of The Association for Education Finance and Policy, illustrates this principle vividly, and on a grand scale: With an analysis of outcomes in all 50 states during the early years of NCLB.

After a terrific summary of the law's rules and implementation challenges, as well as some quick descriptive statistics, the paper's main analysis is a straightforward examination of why the proportion of schools meeting AYP varied quite a bit between states. For instance, in 2003, the first year of results, 32 percent of U.S. schools failed to make AYP, but the proportion ranged from one percent in Iowa to over 80 percent in Florida.

Surprisingly, the results suggest that the primary reasons for this variation seem to have had little to do with differences in student performance. Rather, the big factors are subtle differences in rather arcane rules that each state chose during the implementation process. These decisions received little attention, yet they had a dramatic impact on the outcomes of NCLB during this time period.

As is so often the case when using the small samples of state-level datasets, the authors are limited in their modeling options (though it bears mentioning that these data were very difficult to assemble, and, by the way, are available to the public). Instead, they identify five factors that made the most substantial contributions to the variation in AYP outcomes:

  1. Deviation from NCLB rules. During the early years of NCLB, a few states didn't quite follow the law. (Note that this is the only one of the five factors that has been largely rectified.) In at least one case, such failure was due to simple human error - Iowa’s one percent AYP rate in 2003 seems partially to have been a result of a leave of absence taken by the staff member responsible for the data, who suffered an injury. In other cases, states bent the guidelines set forth in the legislation. Texas, for instance, petitioned the U.S. Department of Education for flexibility on a rule that permitted a maximum of one percent of a school’s special education students to use alternative assessments. Their petition was turned down, but they went ahead with the plan anyway and, as a result, 22 percent of Texas schools that would have failed to make AYP in the first year actually made it.
  2. "Generosity" of confidence intervals. As is fairly well known, if just one of a school's "accountable subgroups" (e.g., low-income, students with disabilities, etc.) fail to meet proficiency targets (or "safe harbor"), that entire school does not make AYP. In order to account for the inevitable fact that, in some schools, these subgroups would consist of very few tested students, NCLB allowed states to apply "confidence intervals." Basically, these adjustments meant that smaller subgroups (i.e., those consisting of fewer tested students in a given school) would be required to meet lower targets. However, states were given flexibility in how much "leeway" they granted via these confidence intervals, and a few specified none at all. Florida, for example, did not use them, and thus a fairly large group of schools that would have made AYP had this rule been applied did not do so.
  3. Different targets across grade levels. States had the option of either setting the same proficiency targets for all grades or letting their targets vary by grade (and subject). Using the former system – the same targets for all grades – basically meant that schools serving particular grade configurations would have an advantage in making AYP (if their starting rates were higher) whereas others would have a disadvantage (if their starting rates were lower). For example, Pennsylvania set uniform targets, but their high schools’ starting rates were much lower, on average, than those of elementary schools. The end result was that 27 percent of the state's high schools failed to make AYP in 2004, compared with just 7 percent of elementary schools. 
  4. Number of “accountable subgroups” and minimum sample size. As mentioned above, NCLB required schools to be held accountable for the performance of student subgroups. But states were given flexibility not only in how many subgroups they chose (and which ones), but also in setting minimum sample sizes for these subgroups to be “included” in AYP calculations. For example, schools with only a handful of students with disabilities in a given year could be exempted from having this subgroup count at all. As a rule, states that chose to include fewer subgroups in AYP, or set higher sample size requirements for their inclusion, tended to have lower failure rates, all else being equal. Once again, states varied in the choices they made, and this influenced their results. 
  5. Definition of “continuous enrollment." Finally, states had to specify the rules by which mobile students (e.g., transfers) were or were not counted toward schools’ AYP calculations. Some states set more stringent enrollment requirements than others, which meant that they excluded more students from being counted in their testing results. For instance, Wisconsin’s rules excluded students who were not enrolled in late September of 2003 (the tests were administered in November 2003). Thus, fairly large proportions of students who took the test were not counted. To the degree excluded students' performance was different from their "continuously enrolled" peers, these choices affected failure rates.
Now, it’s important to note that all of these rules interacted with each other (as well as with other rules and factors, such as school and student characteristics) to produce outcomes.

For instance, 23 states opted to use the most generous confidence intervals (which, by itself, "inflates" AYP rates), but these states did not have appreciably higher AYP rates, on average. The reason is that they also tended to choose other options (e.g., minimum subgroup sample size requirements) and/or they exhibited characteristics (e.g., tested grades, student characteristics) that, on the whole, "cancelled out" the differences.

From this perspective, it would be fascinating to look at the results of this paper in terms of general strategies and principles reflected in the decisions states made. Such coherence may not be easy to decipher - some choices increased failure rates, whereas others were certain to lower them. States varied in their decision making structures and resources, and they only had a relatively short period of time to devise their plans (which often had to be rectified with pre-existing accountability policies).

Many state-level NCLB configurations ended up being complex, sometimes inconsistent webs of rules that reflected varying incentives and priorities. Making things worse, the ESEA waivers that most states have submitted will only result in more heterogeneity (see this paper by Morgan Polikoff and colleagues).

This is not quite how accountability systems are supposed to work, and, as the authors of this paper note, it illustrates the ever-present difficulty of finding the right balance between standardization and flexibility in policy design/implementation.

(It should be lost on no one that these issues are extremely relevant to the current efforts in over 30 states to install new principal and/or teacher evaluations.)

There are a couple of other quick takeaways from this paper. First, after roughly a decade of NCLB, most everyone seems to have a strong opinion on whether and why it worked, but, in many respects, the truth is that we still have a lot to learn about how the law shook out, to say nothing of its impact on school performance and other outcomes (see here and here). Building a body of good research often takes time, particularly in the case of landmark policies such as NCLB. 
Finally, and perhaps most importantly, accountability systems can play a productive role in education, but this analysis demonstrates very clearly that, when it comes to the design and implementation of these systems, details matter. Seemingly trivial choices can have drastic effects on measured outcomes.

In short, phrases such as “we need to hold schools accountable” sound wonderful, but what really matters is how we do so.

- Matt Di Carlo

This will happen again under CCSS if implemented as the states will get to set the cut scores and the difference than what happened under NCLB...what am I missing?