Where Al Shanker Stood: The Importance And Meaning Of NAEP Results

In this New York Times piece, published on July 29, 1990, Al Shanker discusses the results of the National Assessment of Educational Progress (NAEP), and what they suggested about the U.S. education system at the time.

One of the things that has influenced me most strongly to call for radical school reform has been the results of the National Assessment of Educational Progress (NAEP) examinations. These exams have been testing the achievement of our 9, 13 and 17-year olds in a number of basic areas over the past 20 years, and the results have been almost uniformly dismal.

According to NAEP results, no 17-year-olds who are still in school are illiterate and innumerate - that is, all of them can read the words you would find on a cereal box or a billboard, and they can do simple arithmetic. But very few achieve what a reasonable person would call competence in reading, writing or computing.

For example, NAEP's 20-year overview, Crossroads in American Education, indicated that only 2.6 percent of 17-year-olds taking the test could write a good letter to a high school principal about why a rule should be changed. And when I say good, I'm talking about a straightforward presentation of a couple of simple points. Only 5 percent could grasp a paragraph as complicated as the kind you would find in a first-year college textbook. And only 6 percent could solve a multi-step math problem like this one:"Christine borrowed $850 for one year from Friendly Finance Company. If she paid 12% simple interest on the loan, what was the total amount she repaid?"

When Checking Under The Hood Of Overall Test Score Increases, Use Multiple Tools

When looking at changes in testing results between years, many people are (justifiably) interested in comparing those changes for different student subgroups, such as those defined by race/ethnicity or income (subsidized lunch eligibility). The basic idea is to see whether increases are shared between traditionally advantaged and disadvantaged groups (and, often, to monitor achievement gaps).

Sometimes, people take this a step further by using the subgroup breakdowns as a crude check on whether cross-sectional score changes are due to changes in the sample of students taking the test. The logic is as follows: If the increases are found when comparing advantaged and more disadvantaged cohorts, then an overall increase cannot be attributed to a change in the backgrounds of students taking the test, as the subgroups exhibited the same pattern. (For reasons discussed here many times before, this is a severely limited approach.)

Whether testing data are cross-sectional or longitudinal, these subgroup breakdowns are certainly important and necessary, but it's wise to keep in mind that standard variables, such as eligibility for free and reduced-price lunches (FRL), are imperfect proxies for student background (actually, FRL rates aren't even such a great proxy for income). In fact, one might reach different conclusions depending on which variables are chosen. To illustrate this, let’s take a look at results from the Trial Urban District Assessment (TUDA) for the District of Columbia Public Schools between 2011 and 2013, in which there was a large overall score change that received a great deal of media attention, and break the changes down by different characteristics.

Select Your Conclusions, Apply Data

The recent release of the National Assessment of Educational Progress (NAEP) and the companion Trial Urban District Assessment (TUDA) was predictably exploited by advocates to argue for their policy preferences. This is a blatant misuse of the data for many reasons that I have discussed here many times before, and I will not repeat them.

I do, however, want to very quickly illustrate the emptiness of this pseudo-empirical approach – finding cross-sectional cohort increases in states/districts that have recently acted policies you support, and then using the increases as evidence that the policies “work." For example, the recent TUDA results for the District of Columbia Public Schools (DCPS), where scores increased in all four grade/subject combinations, were immediately seized upon supporters of the reforms that have been enacted by DCPS as clear-cut evidence of the policy triumph. The celebrators included the usual advocates, but also DCPS Chancellor Kaya Henderson and the U.S. Secretary of Education Arne Duncan (there was even a brief mention by President Obama in his State of The Union speech).

My immediate reaction to this bad evidence was simple (though perhaps slightly juvenile) – find a district that had similar results under a different policy environment. It was, as usual, pretty easy: Los Angeles Unified School District (LAUSD).

Being Kevin Huffman

In a post earlier this week, I noted how several state and local education leaders, advocates and especially the editorial boards of major newspapers used the results of the recently-released NAEP results inappropriately – i.e., to argue that recent reforms in states such as Tennessee and D.C. are “working." I also discussed how this illustrates a larger phenomenon in which many people seem to expect education policies to generate immediate, measurable results in terms of aggregate student test scores, which I argued is both unrealistic and dangerous.

Mike G. from Boston, a friend whose comments I always appreciate, agrees with me, but asks a question that I think gets to the pragmatic heart of the matter. He wonders whether individuals in high-level education positions have any alternative. For instance, Mike asks, what would I suggest to Kevin Huffman, who is the head of Tennessee’s education department? Insofar as Huffman’s opponents “would use any data…to bash him if it’s trending down," would I advise him to forego using the data in his favor when they show improvement?*

I have never held any important high-level leadership positions. My political experience and skills are (and I’m being charitable here) underdeveloped, and I have no doubt many more seasoned folks in education would disagree with me. But my answer is: Yes, I would advise him to forego using the data in this manner. Here’s why.

The Ever-Changing NAEP Sample

The results of the latest National Assessment of Educational Progress long term trend tests (NAEP-LTT) were released last week. The data compare the reading and math scores of 9-, 13- and 17-year olds at various points since the early 1970s. This is an important way to monitor how these age cohorts’ performance changes over the long term.

Overall, there is ongoing improvement in scores among 9- and 13-year olds, in reading and especially math, though the trend is inconsistent and increases are somewhat slow in recent years. The scores for 17-year olds, in contrast, are relatively flat.

These data, of course, are cross-sectional – i.e., they don’t follow students over time, but rather compare children in the three age groups with their predecessors from previous years. This means that changes in average scores might be driven by differences, observable or unobservable, between cohorts. One of the simple graphs in this report, which doesn't present a single test score, illustrates that rather vividly.

Which State Has "The Best Schools?"

** Reprinted here in the Washington Post

I’ve written many times about how absolute performance levels – how highly students score – are not by themselves valid indicators of school quality, since, most basically, they don’t account for the fact that students enter the schooling system at different levels. One of the most blatant (and common) manifestations of this mistake is when people use NAEP results to determine the quality of a state's schools.

For instance, you’ll often hear that Massachusetts has the “best” schools in the U.S. and Mississippi the “worst," with both claims based solely on average scores on the NAEP (though, technically, Massachusetts public school students' scores are statistically tied with at least one other state on two of the four main NAEP exams, while Mississippi's rankings vary a bit by grade/subject, and its scores are also not statistically different from several other states').

But we all know that these two states are very different in terms of basic characteristics such as income, parental education, etc. Any assessment of educational quality, whether at the state or local level, is necessarily complicated, and ignoring differences between students precludes any meaningful comparisons of school effectiveness. Schooling quality is important, but it cannot be assessed by sorting and ranking raw test scores in a spreadsheet.

Guessing About NAEP Results

Every two years, the release of data from the National Assessment of Educational Progress (NAEP) generates a wave of research and commentary trying to explain short- and long-term trends. For instance, there have been a bunch of recent attempts to “explain” an increase in aggregate NAEP scores during the late 1990s and 2000s. Some analyses postulate that the accountability provisions of NCLB were responsible, while more recent arguments have focused on the “effect” (or lack thereof) of newer market-based reforms – for example, looking to NAEP data to “prove” or “disprove” the idea that changes in teacher personnel and other policies have (or have not) generated “gains” in student test scores.

The basic idea here is that, for every increase or decrease in cross-sectional NAEP scores over a given period of time (both for all students and especially for subgroups such as minority and low-income students), there must be “something” in our education system that explains it. In many (but not all) cases, these discussions consist of little more than speculation. Discernible trends in NAEP test score data are almost certainly due to a combination of factors, and it’s unlikely that one policy or set of policies is dominant enough to be identified as “the one." Now, there’s nothing necessarily wrong with speculation, so long as it is clearly identified as such, and conclusions presented accordingly. But I find it curious that some people involved with these speculative arguments seem a bit too willing to assume that schooling factors – rather than changes in cohorts’ circumstances outside of school – are the primary driver of NAEP trends.

So, let me try a little bit of illustrative speculation of my own: I might argue that changes in the economic conditions of American schoolchildren and their families are the most compelling explanation for changes in NAEP.

NAEP Shifting

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

Tomorrow, the education world will get the results of the 2011 National Assessment of Educational Progress (NAEP), often referred to as the “nation’s report card." The findings – reading and math scores among a representative sample of fourth and eighth graders - will drive at least part of the debate for the next two years, when the next round comes out.

I’m going to make a prediction, one that is definitely a generalization, but is hardly uncommon in policy debates: People on all “sides” will interpret the results favorably no matter how they turn out.

If NAEP scores are positive – i.e., overall scores rise by a statistically significant margin, and/or there are encouraging increases among key subgroups such as low performers or low-income students – supporters of market-based reform will say that their preferred policies are working. They’ll claim that the era of test-based accountability, which began with the enactment of No Child Left Behind ten years ago, have produced real results. Market reform skeptics, on the other hand, will say that virtually none of the policies, such as test-based teacher evaluations and merit pay, for which reformers are pushing were in force in more than a handful of locations between 2009 and 2011. Therefore, they’ll claim, the NAEP progress shows that the system is working without these changes.

If the NAEP results are not encouraging – i.e., overall progress is flat (or negative), and there are no strong gains among key subgroups – the market-based crowd will use the occasion to argue that the “status quo” isn’t producing results, and they will strengthen their call for policies like new evaluations and merit pay. Skeptics, in contrast, will claim that NCLB and standardized test-based accountability were failures from the get-go. Some will even use the NAEP results to advocate for the wholesale elimination of standardized testing.

The Legend Of Last Fall

The subject of Michelle Rhee’s teaching record has recently received a lot of attention. While the controversy has been interesting, it could also be argued that it’s relatively unimportant. The evidence that she exaggerated her teaching prowess is, after all, inconclusive (though highly suggestive). A little resume inflation from a job 20 years ago might be overlooked, so long as Rhee’s current claims about her more recent record are accurate. But are they?

On Rhee’s new website, her official bio - in effect, her resume today (or at least her cover letter) - contains a few sentences about her record as chancellor of D.C Public Schools (DCPS), under the header "Driving Unprecedented Growth in the D.C. Public Schools." There, her test-based accomplishments are characterized as follows:

Under her leadership, the worst performing school district in the country became the only major city system to see double-digit growth in both their state reading and state math scores in seventh, eighth and tenth grades over three years.
This time, we can presume that the statement has been vetted thoroughly, using all the tools of data collection and analysis available to Rhee during her tenure at the helm of DCPS.

But the statement is false.

Michelle Rhee's Testing Legacy: An Open Question

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post.

Michelle Rhee’s resignation and departure have, predictably, provoked a flurry of conflicting reactions. Yet virtually all of them, from opponents and supporters alike, seem to assume that her tenure at the helm of the D.C. Public Schools (DCPS) helped to boost student test scores dramatically. She and D.C. Mayor Adrian Fenty made similar claims themselves in the Wall Street Journal (WSJ) just last week.

Hardly anybody, regardless of their opinion about Michelle Rhee, thinks that test scores alone are an adequate indicator of student success. But, in no small part because of her own emphasis on them, that is how this debate has unfolded. Her aim was to raise scores and, with few exceptions (also here and here), even those who objected to her “abrasive” style and controversial policies seem to believe that she succeeded wildly in the testing area.

This conclusion is premature. A review of the record shows that Michelle Rhee’s test score “legacy” is an open question. 

There are three main points to consider: