Burden Of Proof, Benefit Of Assumption

** Also posted here on "Valerie Strauss' Answer Sheet" in the Washington Post

Michelle Rhee, the controversial former chancellor of D.C. public schools, is a lightning rod. Her confrontational style has made her many friends as well as enemies. As is usually the case, people’s reaction to her approach in no small part depends on whether or not they support her policy positions.

I try to be open-minded toward people with whom I don’t often agree, and I can certainly accept that people operate in different ways. Honestly, I have no doubt as to Ms. Rhee’s sincere belief in what she’s doing; and, even if I think she could go about it differently, I respect her willingness to absorb so much negative reaction in order to try to get it done.

What I find disturbing is how she continues to try to build her reputation and advance her goals based on interpretations of testing results that are insulting to the public’s intelligence.

NAEP Shifting

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

Tomorrow, the education world will get the results of the 2011 National Assessment of Educational Progress (NAEP), often referred to as the “nation’s report card." The findings – reading and math scores among a representative sample of fourth and eighth graders - will drive at least part of the debate for the next two years, when the next round comes out.

I’m going to make a prediction, one that is definitely a generalization, but is hardly uncommon in policy debates: People on all “sides” will interpret the results favorably no matter how they turn out.

If NAEP scores are positive – i.e., overall scores rise by a statistically significant margin, and/or there are encouraging increases among key subgroups such as low performers or low-income students – supporters of market-based reform will say that their preferred policies are working. They’ll claim that the era of test-based accountability, which began with the enactment of No Child Left Behind ten years ago, have produced real results. Market reform skeptics, on the other hand, will say that virtually none of the policies, such as test-based teacher evaluations and merit pay, for which reformers are pushing were in force in more than a handful of locations between 2009 and 2011. Therefore, they’ll claim, the NAEP progress shows that the system is working without these changes.

If the NAEP results are not encouraging – i.e., overall progress is flat (or negative), and there are no strong gains among key subgroups – the market-based crowd will use the occasion to argue that the “status quo” isn’t producing results, and they will strengthen their call for policies like new evaluations and merit pay. Skeptics, in contrast, will claim that NCLB and standardized test-based accountability were failures from the get-go. Some will even use the NAEP results to advocate for the wholesale elimination of standardized testing.

Trouble In Paradise

According to the principles of market-based education reform, there’s at least one large, urban public school district operating at max power: District of Columbia Public Schools.

For the past 2-3 years, DCPS has been a reformer’s paradise. The district has a new evaluation system (IMPACT), which it designed by itself. The system includes heavily-weighted value-added estimates (50 percent for teachers in tested grades/subjects), and the results of teachers’ evaluations are used every year to fire the teachers who receive the lowest evaluation ratings, or receive the second lowest score for two consecutive years. “Ineffective teachers” are being weeded out – no hearing, no due process, no nothing.

Furthermore, these evaluation scores are also used to award performance bonuses, and very large ones at that – up to $25,000. This should, so the logic goes, be attracting high-achieving people to DCPS, and keeping them around after they arrive. And, finally, as a result of many years of growth, the city has among the largest charter school sectors in the nation, with almost half of public school student attending charters. Theoretically, this competition should be upping the game of all schools, charter and regular public alike.

Basically, almost everything that market-based reformers think needs to happen has been the reality in DCPS for the past 2-3 years. And the staff  has been transformed too. The majority of principals, and a huge proportion of teachers, were hired during the tenure of either Michelle Rhee or her successor, Kaya Henderson.

The district should be in overdrive right about now. Is it?

The Education Reporter's Dilemma

I’ve written so many posts about the misinterpretation of testing data in news stories that I’m starting to annoy myself. For example, I’ve shown that year-to-year changes in testing results might be attributable to the fact that, each year, a different set of students takes the test. I’ve discussed the fact that proficiency rates are not test scores – they only tell you the proportion of students above a given line – and that the rates and actual scores can move in opposite directions (see this simple illustration). And I’ve pleaded with journalists, most of whom I like and respect, to write with care about these issues (and, I should note, many of them do so).

Yet here I am, back on my soapbox again. This time the culprit is the recent release of SAT testing data, generating dozens of error-plagued stories from newspapers and organizations. Like virtually all public testing data, the SAT results are cross-sectional – each year, the test is taken by a different group of students. This means that demographic changes in the sample of test takers influence the results. This problem is even more acute in the case of the SAT, since it is voluntary. Despite the best efforts of the College Board (see their press release), a slew of stories improperly equated the decline in average SAT scores since the previous year with an overall decline in student performance – a confirmation of educational malaise (in fairness, there were many exceptions).

I’ve come to think that there’s a fundamental problem here: When you interpret testing data properly, you don’t have much of a story.

What Are "Middle Class Schools"?

An organization called “The Third Way” released a report last week, in which they present descriptive data on what they call “middle class schools." The primary conclusion of their analysis is that “middle class schools” aren’t “making the grade," and that they are “falling short on their most basic 21st century mission: To prepare kids to get a college degree." They also argue that “middle class schools” are largely ignored in our debate and policymaking, and we need a “second phase of school reform” in order to address this deficit.

The Wall Street Journal swallowed the report whole, running a story presenting Third Way’s findings under the headline “Middle class schools fail to make the grade."

To be clear, I think that our education policy debates do focus on lower-income schools to a degree that sometimes ignores those closer to the middle of the distribution. So, it’s definitely worthwhile to take a look at “middle class schools’” performance and how it can be improved. In other words, I’m very receptive to the underlying purpose of the report.

That said, this analysis consists mostly of arbitrary measurement and flawed, vague interpretations. As a result, it actually offers little meaningful insight.

How Cross-Sectional Are Cross-Sectional Testing Data?

In several posts, I’ve complained about how, in our public discourse, we misinterpret changes in proficiency rates (or actual test scores) as “gains” or “progress," when they actually represent cohort changes—that is, they are performance snapshots for different groups of students who are potentially quite dissimilar.

For example, the most common way testing results are presented in news coverage and press releases is to present year-to-year testing results across entire schools or districts – e.g., the overall proficiency rate across all grades in one year compared with the next. One reason why the two groups of students being compared (the first versus the second year) are different is obvious. In most districts, tests are only administered to students in grades 3-8. As a result, the eighth graders who take the test in Year 1 will not take it in Year 2, as they will have moved on to the ninth grade (unless they are retained). At the same time, a new cohort of third graders will take the test in Year 2 despite not having been tested in Year 1 (because they were in second grade). That’s a large amount of inherent “turnover” between years (this same situation applies when results are averaged for elementary and secondary grades). Variations in cohort performance can generate the illusion of "real" change in performance, positive or negative.

But there’s another big cause of incomparability between years: Student mobility. Students move in and out of districts every year. In urban areas, mobility is particularly high. And, in many places, this mobility includes students who move to charter schools, which are often run as separate school districts.

I think we all know intuitively about these issues, but I’m not sure many people realize just how different the group of tested students across an entire district can be in one year compared with the next. In order to give an idea of this magnitude, we might do a rough calculation for the District of Columbia Public Schools (DCPS).

Our Annual Testing Data Charade

Every year, around this time, states and districts throughout the nation release their official testing results. Schools are closed and reputations are made or broken by these data. But this annual tradition is, in some places, becoming a charade.

Most states and districts release two types of assessment data every year (by student subgroup, school and grade): Average scores (“scale scores”); and the percent of students who meet the standards to be labeled proficient, advanced, basic and below basic. The latter type – the rates – are of course derived from the scores – that is, they tell us the proportion of students whose scale score was above the minimum necessary to be considered proficient, advanced, etc.

Both types of data are cross-sectional. They don’t follow individual students over time, but rather give a “snapshot” of aggregate performance among two different groups of students (for example, third graders in 2010 compared with third graders in 2011). Calling the change in these results “progress” or “gains” is inaccurate; they are cohort changes, and might just as well be chalked up to differences in the characteristics of the students (especially when changes are small). Even averaged across an entire school or district, there can be huge differences in the groups compared between years – not only is there often considerable student mobility in and out of schools/districts, but every year, a new cohort enters at the lowest tested grade, while a whole other cohort exits at the highest tested grade (except for those retained).

For these reasons, any comparisons between years must be done with extreme caution, but the most common way - simply comparing proficiency rates between years - is in many respects the worst. A closer look at this year’s New York City results illustrates this perfectly.

Melodramatic

At a press conference earlier this week, New York City Mayor Michael Bloomberg announced the city’s 2011 test results. Wall Street Journal reporter Lisa Fleisher, who was on the scene, tweeted Mayor Bloomberg’s remarks. According to Fleisher, the mayor claimed that there was a “dramatic difference” between his city’s testing progress between 2010 and 2011, as compared with the rest of state.

Putting aside the fact that the results do not measure “progress” per se, but rather cohort changes – a comparison of cross-sectional data that measures the aggregate performance of two different groups of students – I must say that I was a little astounded by this claim. Fleisher was also kind enough to tweet a photograph that the mayor put on the screen in order to illustrate the “dramatic difference” between the gains of NYC students relative to their non-NYC counterparts across the state.  Here it is:

If Gifted And Talented Programs Don't Boost Scores, Should We Eliminate Them?

In education policy debates, the phrase “what works” is sometimes used to mean “what increases test scores." Among those of us who believe that testing data have a productive role to play in education policy (even if we disagree on the details of that role), there is a constant struggle to interpret test-based evidence properly and put it in context. This effort to craft and maintain a framework for using assessment data productively is very important but, despite the careless claims of some public figures, it is also extremely difficult.

Equally important and difficult is the need to apply that framework consistently. For instance, a recent working paper from the National Bureau of Economic Research (NBER) looked at the question of whether gifted and talented (GT) programs boost student achievement. The researchers found that GT programs (and magnet schools as well) have little discernible impact on students’ test score gains. Another recent NBER paper reached the same conclusion about the highly-selective “exam schools” in New York and Boston. Now, it’s certainly true that high-quality research on the test-based effect of these programs is still somewhat scarce, and these are only two (as yet unpublished) analyses, but their conclusions are certainly worth noting.

Still, let’s speculate for a moment: Let’s say that, over the next few years, several other good studies also reached the same conclusion. Would anyone, based on this evidence, be calling for the elimination of GT programs? I doubt it. Yet, if we applied faithfully the standards by which we sometimes judge other policy interventions, we would have to make a case for getting rid of GT.

Atlanta: Bellwether Or Whistleblower For Test-Driven Reform?

Early in the life of No Child Left Behind, one amateur but insightful futurist on the Shanker Institute Board remarked to me: "Well, if you tie teacher pay, labeling failing schools, and evaluations of teachers and principals all to student test results—guess what?—you’ll get student test results. But some 20, years down the road when these kids get out of high school, we may discover they don’t know anything."

The quip did not necessarily suggest that we were headed for massive cheating scandals. Nor did it mean that students should never be assessed to find out how well they were learning what had been taught. It was just a warning that the incentives to produce score results would produce them —one way or another—and whether or not they stood for any true reflection on learning. Meaning, in this case, that a system that defines success narrowly in terms of test score gains will, at minimum, invite exaggerated claims and, at worst, encourage corruption.

An important report was released this spring that should bring some U. S. education "reformers" up short as they pursue policies based on test-based incentives. Instead, Incentives and Test-Based Accountability in Education, by the National Research Council (NRC), was received as a blip on their screens. A serious research review, the report looked at "15 test-based incentive programs, including large scale policies of NCLB, its predecessors, and state high school exit exams as well as a number of experiments and programs carried out in the United States and other countries." Its conclusion: "Despite using them [test-based incentives] for several decades, policymakers and educators do not yet know how to consistently generate positive effects on achievement and to improve education."

In other words, given the methods we are now using to grant performance pay, design evaluation plans, or fix low performing schools, these incentives don’t work. Moreover, looking at recent education history, they haven’t worked for quite a long time.