The Status Fallacy: New York State Edition

A recent New York Times story addresses directly New York Governor Andrew Cuomo’s suggestion, in his annual “State of the State” speech, that New York schools are in a state of crisis and "need dramatic reform." The article’s general conclusion is that the “data suggest otherwise.”

There are a bunch of important points raised in the article, but most of the piece is really just discussing student rather than school performance. Simple statistics about how highly students score on tests – i.e., “status measures” – tell you virtually nothing about the effectiveness of the schools those students attend, since, among other reasons, they don’t account for the fact that many students enter the system at low levels. How much students in a school know in a given year is very different from how much they learned over the course of that year.

I (and many others) have written about this “status fallacy” dozens of times (see our resources page), not because I enjoy repeating myself (I don’t), but rather because I am continually amazed just how insidious it is, and how much of an impact it has on education policy and debate in the U.S. And it feels like every time I see signs that things might be changing for the better, there is an incident, such as Governor Cuomo’s speech, that makes me question how much progress there really has been at the highest levels.

Actual Growth Measures Make A Big Difference When Measuring Growth

As a frequent critic of how states and districts present and interpret their annual testing results, I am also obliged (and indeed quite happy) to note when there is progress.

Recently, I happened to be browsing through New York City’s presentation of their 2014 testing results, and to my great surprise, on slide number four, I found proficiency rate changes between 2013 and 2014 among students who were in the sample in both years (which they call “matched changes”). As it turns out, last year, for the first time, New York State as a whole began publishing these "matched" year-to-year proficiency rate changes for all schools and districts. This is an excellent policy. As we’ve discussed here many times, NCLB-style proficiency rate changes, which compare overall rates of all students, many of whom are only in the tested sample in one of the years, are usually portrayed as “growth” or “progress.” They are not. They compare different groups of students, and, as we’ll see, this can have a substantial impact on the conclusions one reaches from the data. Limiting the sample to students who were tested in both years, though not perfect, at least permits one to measure actual growth per se, and provides a much better idea of whether students are progressing over time.

This is an encouraging sign that New York State is taking steps to improve the quality and interpretation of their testing data. And, just to prove that no good deed goes unpunished, let’s see what we can learn using the new “matched” data – specifically, by seeing how often the matched (longitudinal) and unmatched (cross-sectional) changes lead to different conclusions about student “growth” in schools.

Sample Size And Volatility In School Accountability Systems

It is generally well-known that sample size has an important effect on measurement and, therefore, incentives in test-based school accountability systems.

Within a given class or school, for example, there may be students who are sick on testing day, or get distracted by a noisy peer, or just have a bad day. Larger samples attenuate the degree to which unusual results among individual students (or classes) can influence results overall. In addition, schools draw their students from a population (e.g., a neighborhood). Even if the characteristics of the neighborhood from which the students come stay relatively stable, the pool of students entering the school (or tested sample) can vary substantially from one year to the next, particularly when that pool is small.

Classes and schools tend to be quite small, and test scores vary far more between- than within-student (i.e., over time). As a result, testing results often exhibit a great deal of nonpersistent variation (Kane and Staiger 2002). In other words, much of the differences in test scores between schools, and over time, is fleeting, and this problem is particularly pronounced in smaller schools. One very simple, though not original, way to illustrate this relationship is to compare the results for smaller and larger schools.

The Debate And Evidence On The Impact Of NCLB

There is currently a flurry of debate focused on the question of whether “NCLB worked.” This question, which surfaces regularly in the education field, is particularly salient in recent weeks, as Congress holds hearings on reauthorizing the law.

Any time there is a spell of “did NCLB work?” activity, one can hear and read numerous attempts to use simple NAEP changes in order to assess its impact. Individuals and organizations, including both supporters and detractors of the law, attempt to make their cases by presenting trends in scores, parsing subgroups estimates, and so on. These efforts, though typically well-intentioned, do not, of course, tell us much of anything about the law’s impact. One can use simple, unadjusted NAEP changes to prove or disprove any policy argument. And the reason is that they are not valid evidence of an intervention's effects. There’s more to policy analysis than subtraction.

But it’s not just the inappropriate use of evidence that makes these “did NCLB work?” debates frustrating and, often, unproductive. It is also the fact that NCLB really cannot be judged in simple, binary terms. It is a complex, national policy with considerable inter-state variation in design/implementation and various types of effects, intended and unintended. This is not a situation that lends itself to clear cut yes/no answers to the “did it work?” question.

The Persistent Misidentification Of "Low Performing Schools"

In education, we hear the terms “failing school” and “low-performing school” quite frequently. Usually, they are used in soundbyte-style catchphrases such as, “We can’t keep students trapped in ‘failing schools.’” Sometimes, however, they are used to refer to a specific group of schools in a given state or district that are identified as “failing” or “low-performing” as part of a state or federal law or program (e.g., waivers, SIG). There is, of course, interstate variation in these policies, but one common definition is that schools are “failing/low-performing” if their proficiency rates are in the bottom five percent statewide.

Putting aside the (important) issues with judging schools based solely on standardized testing results, low proficiency rates (or low average scores) tell you virtually nothing about whether or not a school is “failing.” As we’ve discussed here many times, students enter their schools performing at different levels, and schools cannot control the students they serve, only how much progress those students make while they’re in attendance (see here for more).

From this perspective, then, there may be many schools that are labeled “failing” or “low performing” but are actually of above average effectiveness in raising test scores. And, making things worse, virtually all of these will be schools that serve the most disadvantaged students. If that’s true, it’s difficult to think of anything more ill-advised than closing these schools, or even labeling them as “low performing.” Let’s take a quick, illustrative look at this possibility using the “bottom five percent” criterion, and data from Colorado in 2013-14 (note that this simple analysis is similar to what I did in this post, but this one is a little more specific; also see Glazerman and Potamites 2011; Ladd and Lauen 2010; and especially Chingos and West 2015).

Fixing Our Broken System Of Testing And Accountability: The Reauthorization Of ESEA

** Reprinted here in the Washington Post

Our guest author today is Stephen Lazar, a founding teacher at Harvest Collegiate High School in New York City, where he teaches Social Studies. A National Board certified teacher, he blogs at Outside the Cave. Stephen is also one of the organizers of Insightful Social Studies, a grass roots campaign of teachers to reform the newly proposed New York State Social Studies standards. The following is Steve’s testimony this morning in front of the Senate HELP committee’s hearing on ESEA reauthorization.

Sen. Lamar Alexander, Sen. Patty Murray and distinguished members of the Senate Committee on Health, Education, Labor and Pensions, it is my honor to testify before you today on the reauthorization of the Elementary and Secondary Education Act (ESEA), and to share with you the perspective of a classroom teacher on how the ESEA should address the issue of testing and assessment.

I am a proud New York City public high school teacher. Currently, I teach both English and U.S. history to 11th-grade students at Harvest Collegiate High School in Manhattan, a school I helped found with a group of teachers three years ago. I also serve as our dean of Academic Progress, overseeing our school’s assessment system and supporting student learning schoolwide. My students, who are listening to us now—and who I need to remind to study for their test tomorrow—represent the full diversity of New York City. Over 70 percent receive free or reduced-price lunch; 75 percent are black and/or Latino; 25 percent have special education needs; and the overwhelming majority are immigrants or the children of immigrants.

PISA And TIMSS: A Distinction Without A Difference?

Our guest author today is William Schmidt, a University Distinguished Professor and co-director of the Education Policy Center at Michigan State University. He is also a member of the Shanker Institute board of directors.

Every year or two, the mass media is full of stories on the latest iterations of one of the two major international large scale assessments, the Trends in International Mathematics and Science Study (TIMSS) and the Program for International Student Assessment (PISA). What perplexes many is that the results of these two tests -- both well-established and run by respectable, experienced organizations -- suggest different conclusions about the state of U.S. mathematics education. Generally speaking, U.S. students do better on the TIMSS and poorly on the PISA, relative to their peers in other nations. Depending on their personal preferences, policy advocates can simply choose whichever test result is convenient to press their argument, leaving the general public without clear guidance.

Now, in one sense, the differences between the tests are more apparent than real. One reason why the U.S. ranks better on the TIMSS than the PISA is that the two tests sample students from different sets of countries. The PISA has many more wealthy countries, whose students tend to do better – hence, the U.S.’s lower ranking. It turns out that when looking at only the countries that participated in both the TIMSS and the PISA we find similar country rankings. There are also some differences in statistical sampling, but these are fairly minor.

A Descriptive Analysis Of The 2014 D.C. Charter School Ratings

The District of Columbia Public Charter School Board (PCSB) recently released the 2014 results of their “Performance Management Framework” (PMF), which is the rating system that the PCSB uses for its schools.

Very quick background: This system sorts schools into one of three “tiers," with Tier 1 being the highest-performing, as measured by the system, and Tier 3 being the lowest. The ratings are based on a weighted combination of four types of factors -- progress, achievement, gateway, and leading -- which are described in detail in the first footnote.* As discussed in a previous post, the PCSB system, in my opinion, is better than many others out there, since growth measures play a fairly prominent role in the ratings, and, as a result, the final scores are only moderately correlated with key student characteristics such as subsidized lunch eligibility.** In addition, the PCSB is quite diligent about making the PMF results accessible to parents and other stakeholders, and, for the record, I have found the staff very open to sharing data and answering questions.

That said, PCSB's big message this year was that schools’ ratings are improving over time, and that, as a result, a substantially larger proportion of DC charter students are attending top-rated schools. This was reported uncritically by several media outlets, including this story in the Washington Post. It is also based on a somewhat questionable use of the data. Let’s take a very simple look at the PMF dataset, first to examine this claim and then, more importantly, to see what we can learn about the PMF and DC charter schools in 2013 and 2014.

Rethinking The Use Of Simple Achievement Gap Measures In School Accountability Systems

So-called achievement gaps – the differences in average test performance among student subgroups, usually defined in terms of ethnicity or income –  are important measures. They demonstrate persistent inequality of educational outcomes and economic opportunities between different members of our society.

So long as these gaps remain, it means that historically lower-performing subgroups (e.g., low-income students or ethnic minorities) are less likely to gain access to higher education, good jobs, and political voice. We should monitor these gaps; try to identify all the factors that affect them, for good and for ill; and endeavor to narrow them using every appropriate policy lever – both inside and outside of the educational system.

Achievement gaps have also, however, taken on a very different role over the past 10 or so years. The sizes of gaps, and extent of “gap closing," are routinely used by reporters and advocates to judge the performance of schools, school districts, and states. In addition, gaps and gap trends are employed directly in formal accountability systems (e.g., states’ school grading systems), in which they are conceptualized as performance measures.

Although simple measures of the magnitude of or changes in achievement gaps are potentially very useful in several different contexts, they are poor gauges of school performance, and shouldn’t be the basis for high-stakes rewards and punishments in any accountability system.

The Bewildering Arguments Underlying Florida's Fight Over ELL Test Scores

The State of Florida is currently engaged in a policy tussle of sorts with the U.S. Department of Education (USED) over Florida’s accountability system. To make a long story short, last spring, Florida passed a law saying that the test scores of English language learners (ELLs) would only count toward schools’ accountability grades (and teacher evaluations) once the ELL students had been in the system for at least two years. This runs up against federal law, which requires that ELLs’ scores be counted after only one year, and USED has indicated that it’s not willing to budge on this requirement. In response, Florida is considering legal action.

This conflict might seem incredibly inane (unless you’re in one of the affected schools, of course). Beneath the surface, though, this is actually kind of an amazing story.

Put simply, Florida’s argument against USED's policy of counting ELL scores after just one year is a perfect example of the reason why most of the state's core accountability measures (not to mention those of NCLB as a whole) are so inappropriate: Because they judge schools’ performance based largely on where their students’ scores end up without paying any attention to where they start out.