The Real Charter School Experiment

The New York Times reports that there is a pilot program in Houston, called the "Apollo 20 Program" in which some of the district’s regular public schools are "mimicking" the practices of high-performing charter schools. According to the Times article, the group of pilot schools seek to replicate five of the practices commonly used by high-flying charters: extended school time; extensive tutoring; more selective hiring of principals and teachers; “data-driven” instruction, including frequent diagnostic quizzing; and a “no excuses” culture of high expectations.

In theory, this pilot program is a good idea, since a primary mission of charter schools should be as a testing ground for new policies and practices that could help to improve all schools. More than a decade of evidence has made it very clear that there’s nothing about "charterness" that makes a school successful – and indeed, only a handful get excellent results. So instead of arguing along the tired old pro-/anti-charter lines, we should, like Houston, be asking why these schools excel and working to see if we can use this information productively.

I’ll be watching to see how the pilot schools end up doing. I’m also hoping that the analysis (the program is being overseen by Harvard’s EdLabs) includes some effort to separate out the effects of each of the five replicated practices. If so, I’m guessing that we will find that the difference between high- and low-performing urban schools depends more than anything else on two factors: time and money.

How Cross-Sectional Are Cross-Sectional Testing Data?

In several posts, I’ve complained about how, in our public discourse, we misinterpret changes in proficiency rates (or actual test scores) as “gains” or “progress," when they actually represent cohort changes—that is, they are performance snapshots for different groups of students who are potentially quite dissimilar.

For example, the most common way testing results are presented in news coverage and press releases is to present year-to-year testing results across entire schools or districts – e.g., the overall proficiency rate across all grades in one year compared with the next. One reason why the two groups of students being compared (the first versus the second year) are different is obvious. In most districts, tests are only administered to students in grades 3-8. As a result, the eighth graders who take the test in Year 1 will not take it in Year 2, as they will have moved on to the ninth grade (unless they are retained). At the same time, a new cohort of third graders will take the test in Year 2 despite not having been tested in Year 1 (because they were in second grade). That’s a large amount of inherent “turnover” between years (this same situation applies when results are averaged for elementary and secondary grades). Variations in cohort performance can generate the illusion of "real" change in performance, positive or negative.

But there’s another big cause of incomparability between years: Student mobility. Students move in and out of districts every year. In urban areas, mobility is particularly high. And, in many places, this mobility includes students who move to charter schools, which are often run as separate school districts.

I think we all know intuitively about these issues, but I’m not sure many people realize just how different the group of tested students across an entire district can be in one year compared with the next. In order to give an idea of this magnitude, we might do a rough calculation for the District of Columbia Public Schools (DCPS).

Predicaments Of Reform

Our guest author today is David K. Cohen, John Dewey Collegiate Professor of Education and professor of public policy at the University of Michigan, and a member of the Shanker Institute’s board of directors. This is a response to Michael Petrilli, who recently published a post on the Fordham Institute’s blog that referred to Cohen’s new book.

Dear Mike:

Thank you for considering my book Teaching And Its Predicaments (Harvard University Press, 2011), and for your intelligent discussion of the issues. I write to continue the conversation. 

You are right to say that I see the incoherence of U.S. public education as a barrier to more quality and less inequality, but I do not "look longingly" at Asia or Finland, let alone take them as models for what Americans should do to improve schools. 

In my 2009 book (The Ordeal Of Equality: Did Federal Regulation Fix The Schools?), Susan L. Moffitt and I recounted the great difficulties that the "top-down" approach to coherence, with which you associate my work, encountered as Title I of the 1965 ESEA was refashioned to leverage much greater central influence on schooling. Susan and I concluded that increased federal regulation had not fixed the schools, and had caused some real damage along with some important constructive effects. We did not see central coherence as The Answer.

Quality Control, When You Don't Know The Product

Last week, New York State’s Supreme Court issued an important ruling on the state’s teacher evaluations. The aspect of the ruling that got the most attention was the proportion of evaluations – or “weight” – that could be assigned to measures based on state assessments (in the form of estimates from value-added models). Specifically, the Court ruled that these measures can only comprise 20 percent of a teacher’s evaluation, compared with the option of up to 40 percent for which Governor Cuomo and others were pushing. Under the decision, the other 20 percent must consist entirely of alternative test-based measures (e.g., local assessments).

Joe Williams, head of Democrats for Education Reform, one of the flagship organizations of the market-based reform movement, called the ruling “a slap in the face” and “a huge win for the teachers unions." He characterized the policy impact as follows: “A mediocre teacher evaluation just got even weaker."

This statement illustrates perfectly the strange reasoning that seems to be driving our debate about evaluations.

Charter And Regular Public School Performance In "Ohio 8" Districts, 2010-11

Every year, the state of Ohio releases an enormous amount of district- and school-level performance data. Since Ohio has among the largest charter school populations in the nation, the data provide an opportunity to examine performance differences between charters and regular public schools in the state.

Ohio’s charters are concentrated largely in the urban “Ohio 8” districts (sometimes called the “Big 8”): Akron; Canton; Cincinnati; Cleveland; Columbus; Dayton; Toledo; and Youngstown. Charter coverage varies considerably between the “Ohio 8” districts, but it is, on average, about 20 percent, compared with roughly five percent across the whole state. I will therefore limit my quick analysis to these districts.

Let’s start with the measure that gets the most attention in the state: Overall “report card grades." Schools (and districts) can receive one of six possible ratings: Academic emergency; academic watch; continuous improvement; effective; excellent; and excellent with distinction.

These ratings represent a weighted combination of four measures. Two of them measure performance “growth," while the other two measure “absolute” performance levels. The growth measures are AYP (yes or no), and value-added (whether schools meet, exceed, or come in below the growth expectations set by the state’s value-added model). The first “absolute” performance measure is the state’s “performance index," which is calculated based on the percentage of a school’s students who fall into the four NCLB categories of advanced, proficient, basic and below basic. The second is the number of “state standards” that schools meet as a percentage of the number of standards for which they are “eligible." For example, the state requires 75 percent proficiency in all the grade/subject tests that a given school administers, and schools are “awarded” a “standard met” for each grade/subject in which three-quarters of their students score above the proficiency cutoff (state standards also include targets for attendance and a couple of other non-test outcomes).

The graph below presents the raw breakdown in report card ratings for charter and regular public schools.

What Americans Think About Teachers Versus What They're Hearing

The results from the recent Gallup/PDK education survey found that 71 percent of surveyed Americans “have trust and confidence in the men and women who are teaching children in public schools." Although this finding received a fair amount of media attention, it is not at all surprising. Polls have long indicated that teachers are among the most trusted professions in the U.S., up there with doctors, nurses and firefighters.

(Side note: The teaching profession also ranks among the most prestigious U.S. occupations – in both analyses of survey data as well as in polls [though see here for an argument that occupational prestige scores are obsolete].)

What was rather surprising, on the other hand, was the Gallup/PDK results for the question about what people are hearing about teachers in the news media. Respondents were asked, “Generally speaking, do you hear more good stories or bad stories about teachers in the news media?"

Over two-thirds (68 percent) said they heard more bad stories than good ones. A little over a quarter (28 percent) said the opposite.

Certainty And Good Policymaking Don't Mix

Using value-added and other types of growth model estimates in teacher evaluations is probably the most controversial and oft-discussed issue in education policy over the past few years.

Many people (including a large proportion of teachers) are opposed to using student test scores in their evaluations, as they feel that the measures are not valid or reliable, and that they will incentivize perverse behavior, such as cheating or competition between teachers. Advocates, on the other hand, argue that student performance is a vital part of teachers’ performance evaluations, and that the growth model estimates, while imperfect, represent the best available option.

I am sympathetic to both views. In fact, in my opinion, there are only two unsupportable positions in this debate: Certainty that using these measures in evaluations will work; and certainty that it won’t. Unfortunately, that’s often how the debate has proceeded – two deeply-entrenched sides convinced of their absolutist positions, and resolved that any nuance in or compromise of their views will only preclude the success of their efforts. You’re with them or against them. The problem is that it's the nuance - the details - that determine policy effects.

Let’s be clear about something: I'm not aware of a shred of evidence – not a shred – that the use of growth model estimates in teacher evaluations improves performance of either teachers or students.

Our Annual Testing Data Charade

Every year, around this time, states and districts throughout the nation release their official testing results. Schools are closed and reputations are made or broken by these data. But this annual tradition is, in some places, becoming a charade.

Most states and districts release two types of assessment data every year (by student subgroup, school and grade): Average scores (“scale scores”); and the percent of students who meet the standards to be labeled proficient, advanced, basic and below basic. The latter type – the rates – are of course derived from the scores – that is, they tell us the proportion of students whose scale score was above the minimum necessary to be considered proficient, advanced, etc.

Both types of data are cross-sectional. They don’t follow individual students over time, but rather give a “snapshot” of aggregate performance among two different groups of students (for example, third graders in 2010 compared with third graders in 2011). Calling the change in these results “progress” or “gains” is inaccurate; they are cohort changes, and might just as well be chalked up to differences in the characteristics of the students (especially when changes are small). Even averaged across an entire school or district, there can be huge differences in the groups compared between years – not only is there often considerable student mobility in and out of schools/districts, but every year, a new cohort enters at the lowest tested grade, while a whole other cohort exits at the highest tested grade (except for those retained).

For these reasons, any comparisons between years must be done with extreme caution, but the most common way - simply comparing proficiency rates between years - is in many respects the worst. A closer look at this year’s New York City results illustrates this perfectly.

Teachers' Preparation Routes And Policy Views

In a previous post, I lamented the scarcity of survey data measuring what teachers think of different education policy reforms. A couple of weeks ago, the National Center for Education Information (NCEI) released the results of their teacher survey (conducted every five years), which provides a useful snapshot of teachers’ opinions toward different policies (albeit not at the level of detail that one might wish).

There are too many interesting results to review in one post, and I encourage you to take a look at the full set yourself. There was, however, one thing about the survey tabulations that I found particularly striking, and that was the high degree to which policy opinions differed between traditionally-certified teachers and those who entered teaching through alternative certification (alt-cert).

In the figure below, I reproduce data from the NCEI report’s battery of questions about whether teachers think different policies would “improve education." Respondents are divided by preparation route – traditional and alternative.

Test-Based Teacher Evaluations Are The Status Quo

We talk a lot about the “status quo” in our education debates. For instance, there is a common argument that the failure to use evidence of “student learning” (in practice, usually defined in terms of test scores) in teacher evaluations represents the “status quo” in this (very important) area.

Now, the implication that “anything is better than the status quo” is a rather massive fallacy in public policy, as it assumes that the costs of alternatives will outweigh benefits, and that there is no chance the replacement policy will have a negative impact (almost always an unsafe assumption). But, in the case of teacher evaluations, the “status quo” is no longer what people seem to think.

Not counting Puerto Rico and Hawaii, the ten largest school districts in the U.S. are (in order): New York City; Los Angeles; Chicago; Dade County (FL); Clark County (NV); Broward County (FL); Houston; Hillsborough (FL); Orange County (FL); and Palm Beach County (FL). Together, they serve about eight percent of all K-12 public school students in the U.S., and over one in ten of the nation’s low-income children.

Although details vary, every single one of them is either currently using test-based measures of effectiveness in its evaluations, or is in the process of designing/implementing these systems (most due to statewide legislation).