## Where Al Shanker Stood: The Importance And Meaning Of NAEP Results

In this New York Times piece, published on July 29, 1990, Al Shanker discusses the results of the National Assessment of Educational Progress (NAEP), and what they suggested about the U.S. education system at the time.

One of the things that has influenced me most strongly to call for radical school reform has been the results of the National Assessment of Educational Progress (NAEP) examinations. These exams have been testing the achievement of our 9, 13 and 17-year olds in a number of basic areas over the past 20 years, and the results have been almost uniformly dismal.

According to NAEP results, no 17-year-olds who are still in school are illiterate and innumerate - that is, all of them can read the words you would find on a cereal box or a billboard, and they can do simple arithmetic. But very few achieve what a reasonable person would call competence in reading, writing or computing.

For example, NAEP's 20-year overview, Crossroads in American Education, indicated that only 2.6 percent of 17-year-olds taking the test could write a good letter to a high school principal about why a rule should be changed. And when I say good, I'm talking about a straightforward presentation of a couple of simple points. Only 5 percent could grasp a paragraph as complicated as the kind you would find in a first-year college textbook. And only 6 percent could solve a multi-step math problem like this one:"Christine borrowed \$850 for one year from Friendly Finance Company. If she paid 12% simple interest on the loan, what was the total amount she repaid?"

## Trust: The Foundation Of Student Achievement

When sharing with me the results of some tests, my doctor once said, "You are a scientist, you know a single piece of data can't provide all the answers or suffice to make a diagnosis. We can't look at a single number in isolation, we need to look at all results in combination." Was my doctor suggesting that I ignore that piece of information we had? No. Was my doctor deemphasizing the result? No. He simply said that we needed additional evidence to make informed decisions. This is, of course, correct.

In education, however, it is frequently implied or even stated directly that the bottom line when it comes to school performance is student test scores, whereas any other outcomes, such as cooperation between staff or a supportive learning environment, are ultimately "soft" and, at best, of secondary importance. This test-based, individual-focused position is viewed as serious, rigorous, and data driven. Deviation from it -- e.g., equal emphasis on additional, systemic aspects of schools and the people in them -- is sometimes derided as an evidence-free mindset. Now, granted, few people are “purely” in one camp or the other. Most probably see themselves as pragmatists, and, as such, somewhere in between: Test scores are probably not all that matters, but since the rest seems so difficult to measure, we might as well focus on "hard data" and hope for the best.

Why this narrow focus on individual measures such as student test scores or teacher quality? I am sure there are many reasons but one is probably lack of familiarity with the growing research showing that we must go beyond the individual teacher and student and examine the social-organizational aspects of schools, which are associated (most likely causally) with student achievement. In other words, all the factors skeptics and pragmatists might think are a distraction and/or a luxury, are actually relevant for the one thing we all care about: Student achievement. Moreover, increasing focus on these factors might actually help us understand what’s really important: Not simply whether testing results went up or down, but why or why not.

## Is The Social Side Of Education Touchy Feely?

That's right, measuring social and organizational aspects of schools is just... well, "touchy feely." We all intuitively grasp that social relations are important in our work environments, that having mentors on the job can make a world of difference, that knowing how to work with colleagues matters to the quality of the end product, that innovation and improvement relies on the sharing of ideas, that having a good relationship with supervisors influences both engagement and performance, and so on.

I could go on, but I don't have to; we all just know these things. But is there hard evidence, other than common sense and our personal experiences? Behaviors such as collaboration and interaction or qualities like trust are difficult to quantify. In the end, is it possible that they are just 'soft' and that, even if they’re important (and they are), they just don't belong in policy conversations?

Wrong.

In this post, I review three distinct methodological approaches that researchers have used to understand social-organizational aspects of schools. Specifically, I selected studies that examine the relationship between aspects of teachers' social-organizational environments and their students' achievement growth. I focus both on the methods and on the substantive findings. This is because I think some basic sense of how researchers look at complex constructs like trust or collegiality can deepen our understanding of this work and lead us to embrace its implications for policy and practice more fully.

## Charter Schools, Special Education Students, And Test-Based Accountability

Opponents often argue that charter schools tend to serve a disproportionately low number of special education students. And, while there may be exceptions and certainly a great deal of variation, that argument is essentially accurate. Regardless of why this is the case (and there is plenty of contentious debate about that), some charter school supporters have acknowledged that it may be a problem insofar as charters are viewed as a large scale alternative to regular public schools.

For example, Robin Lake, writing for the Center for Reinventing Public Education, takes issue with her fellow charter supporters who assert that “we cannot expect every school to be all things to every child.” She argues instead that schools, regardless of their governance structures, should never “send the soft message that kids with significant differences are not welcome,” or treat them as if “they are somebody else’s problem.” Rather, Ms. Lake calls upon charter school operators to take up the banner of serving the most vulnerable and challenging students and “work for systemic special education solutions.”

These are, needless to say, noble thoughts, with which many charter opponents and supporters can agree. Still, there is a somewhat more technocratic but perhaps more actionable issue lurking beneath the surface here: Put simply, until test-based accountability systems in the U.S. are redesigned such that they stop penalizing schools for the students they serve, rather than their effectiveness in serving those students, there will be a rather strong disincentive for charters to focus aggressively on serving special education students. Moreover, whatever accountability disadvantage may be faced by regular public schools that serve higher proportions of special education students pales in comparison with that faced by all schools, charter and regular public, located in higher-poverty areas. In this sense, then, addressing this problem is something that charter supporters and opponents should be doing together.

## The Big Story About Gender Gaps In Test Scores

The OECD recently published a report about differences in test scores between boys and girls on the Programme for International Student Assessment (PISA), which is a test of 15 year olds conducted every three years in multiple subjects. The main summary finding is that, in most nations, girls are significantly less likely than boys to score below the “proficient” threshold in all three subjects (math, reading and science). The report also includes survey items and other outcomes.

First, it is interesting to me how discussions of these gender gaps differ from those about gaps between income or ethnicity groups. Specifically, when we talk about gender gaps, we interpret them properly – as gaps in measured performance between groups of students. Any discussion of gaps between groups defined in terms of income or ethnicity, on the other hand, are almost always framed in terms of school performance.

This is partially because schools in the U.S. are segregated by income and ethnicity, but not really by gender, and also because some folks have a tendency to overestimate the degree to which income- and ethnicity-based achievement gaps stem from systematic variation in schooling inputs, whereas in reality they are more a function of non-school factors (though, of course, schools matter, and differences in school quality reinforce the non-school-based impact). That said, returning to the findings of this report, I was slightly concerned with how, in some cases, they were reported in the media.

## The Status Fallacy: New York State Edition

A recent New York Times story addresses directly New York Governor Andrew Cuomo’s suggestion, in his annual “State of the State” speech, that New York schools are in a state of crisis and "need dramatic reform." The article’s general conclusion is that the “data suggest otherwise.”

There are a bunch of important points raised in the article, but most of the piece is really just discussing student rather than school performance. Simple statistics about how highly students score on tests – i.e., “status measures” – tell you virtually nothing about the effectiveness of the schools those students attend, since, among other reasons, they don’t account for the fact that many students enter the system at low levels. How much students in a school know in a given year is very different from how much they learned over the course of that year.

I (and many others) have written about this “status fallacy” dozens of times (see our resources page), not because I enjoy repeating myself (I don’t), but rather because I am continually amazed just how insidious it is, and how much of an impact it has on education policy and debate in the U.S. And it feels like every time I see signs that things might be changing for the better, there is an incident, such as Governor Cuomo’s speech, that makes me question how much progress there really has been at the highest levels.

## Actual Growth Measures Make A Big Difference When Measuring Growth

As a frequent critic of how states and districts present and interpret their annual testing results, I am also obliged (and indeed quite happy) to note when there is progress.

Recently, I happened to be browsing through New York City’s presentation of their 2014 testing results, and to my great surprise, on slide number four, I found proficiency rate changes between 2013 and 2014 among students who were in the sample in both years (which they call “matched changes”). As it turns out, last year, for the first time, New York State as a whole began publishing these "matched" year-to-year proficiency rate changes for all schools and districts. This is an excellent policy. As we’ve discussed here many times, NCLB-style proficiency rate changes, which compare overall rates of all students, many of whom are only in the tested sample in one of the years, are usually portrayed as “growth” or “progress.” They are not. They compare different groups of students, and, as we’ll see, this can have a substantial impact on the conclusions one reaches from the data. Limiting the sample to students who were tested in both years, though not perfect, at least permits one to measure actual growth per se, and provides a much better idea of whether students are progressing over time.

This is an encouraging sign that New York State is taking steps to improve the quality and interpretation of their testing data. And, just to prove that no good deed goes unpunished, let’s see what we can learn using the new “matched” data – specifically, by seeing how often the matched (longitudinal) and unmatched (cross-sectional) changes lead to different conclusions about student “growth” in schools.

## Sample Size And Volatility In School Accountability Systems

It is generally well-known that sample size has an important effect on measurement and, therefore, incentives in test-based school accountability systems.

Within a given class or school, for example, there may be students who are sick on testing day, or get distracted by a noisy peer, or just have a bad day. Larger samples attenuate the degree to which unusual results among individual students (or classes) can influence results overall. In addition, schools draw their students from a population (e.g., a neighborhood). Even if the characteristics of the neighborhood from which the students come stay relatively stable, the pool of students entering the school (or tested sample) can vary substantially from one year to the next, particularly when that pool is small.

Classes and schools tend to be quite small, and test scores vary far more between- than within-student (i.e., over time). As a result, testing results often exhibit a great deal of nonpersistent variation (Kane and Staiger 2002). In other words, much of the differences in test scores between schools, and over time, is fleeting, and this problem is particularly pronounced in smaller schools. One very simple, though not original, way to illustrate this relationship is to compare the results for smaller and larger schools.

## The Debate And Evidence On The Impact Of NCLB

There is currently a flurry of debate focused on the question of whether “NCLB worked.” This question, which surfaces regularly in the education field, is particularly salient in recent weeks, as Congress holds hearings on reauthorizing the law.

Any time there is a spell of “did NCLB work?” activity, one can hear and read numerous attempts to use simple NAEP changes in order to assess its impact. Individuals and organizations, including both supporters and detractors of the law, attempt to make their cases by presenting trends in scores, parsing subgroups estimates, and so on. These efforts, though typically well-intentioned, do not, of course, tell us much of anything about the law’s impact. One can use simple, unadjusted NAEP changes to prove or disprove any policy argument. And the reason is that they are not valid evidence of an intervention's effects. There’s more to policy analysis than subtraction.

But it’s not just the inappropriate use of evidence that makes these “did NCLB work?” debates frustrating and, often, unproductive. It is also the fact that NCLB really cannot be judged in simple, binary terms. It is a complex, national policy with considerable inter-state variation in design/implementation and various types of effects, intended and unintended. This is not a situation that lends itself to clear cut yes/no answers to the “did it work?” question.

## The Persistent Misidentification Of "Low Performing Schools"

In education, we hear the terms “failing school” and “low-performing school” quite frequently. Usually, they are used in soundbyte-style catchphrases such as, “We can’t keep students trapped in ‘failing schools.’” Sometimes, however, they are used to refer to a specific group of schools in a given state or district that are identified as “failing” or “low-performing” as part of a state or federal law or program (e.g., waivers, SIG). There is, of course, interstate variation in these policies, but one common definition is that schools are “failing/low-performing” if their proficiency rates are in the bottom five percent statewide.

Putting aside the (important) issues with judging schools based solely on standardized testing results, low proficiency rates (or low average scores) tell you virtually nothing about whether or not a school is “failing.” As we’ve discussed here many times, students enter their schools performing at different levels, and schools cannot control the students they serve, only how much progress those students make while they’re in attendance (see here for more).

From this perspective, then, there may be many schools that are labeled “failing” or “low performing” but are actually of above average effectiveness in raising test scores. And, making things worse, virtually all of these will be schools that serve the most disadvantaged students. If that’s true, it’s difficult to think of anything more ill-advised than closing these schools, or even labeling them as “low performing.” Let’s take a quick, illustrative look at this possibility using the “bottom five percent” criterion, and data from Colorado in 2013-14 (note that this simple analysis is similar to what I did in this post, but this one is a little more specific; also see Glazerman and Potamites 2011; Ladd and Lauen 2010; and especially Chingos and West 2015).