The Structural Curve In Indiana's New School Grading System

The State of Indiana has received a great deal of attention for its education reform efforts, and they recently announced the details, as well as the first round of results, of their new "A-F" school grading system. As in many other states, for elementary and middle schools, the grades are based entirely on math and reading test scores.

It is probably the most rudimentary scoring system I've seen yet - almost painfully so. Such simplicity carries both potential advantages (easier for stakeholders to understand) and disadvantages (school performance is complex and not always amenable to rudimentary calculation).

In addition, unlike the other systems that I have reviewed here, this one does not rely on explicit “weights," (i.e., specific percentages are not assigned to each component). Rather, there’s a rubric that combines absolute performance (passage rates) and proportions drawn from growth models (a few other states use similar schemes, but I haven't reviewed any of them).

On the whole, though, it's a somewhat simplistic variation on the general approach most other states are taking -- but with a few twists.

The Data-Driven Education Movement

** Also reprinted here in the Washington Post

In the education community, many proclaim themselves to be "completely data-driven." Data Driven Decision Making (DDDM) has been a buzz phrase for a while now, and continues to be a badge many wear with pride. And yet, every time I hear it, I cringe.

Let me explain. During my first year in graduate school, I was taught that excessive attention to quantitative data impedes – rather than aids – in-depth understanding of social phenomena. In other words, explanations cannot simply be cranked out of statistical analyses, without the need for a precursor theory of some kind – a.k.a. “variable sociology” – and the attempt to do so constitutes a major obstacle to the advancement of knowledge.

I am no longer in graduate school, so part of me says: Okay, I know what data-driven means in education. But then, at times, I still think: No, really, what does “data-driven” mean even in this context?

Which State Has "The Best Schools?"

** Reprinted here in the Washington Post

I’ve written many times about how absolute performance levels – how highly students score – are not by themselves valid indicators of school quality, since, most basically, they don’t account for the fact that students enter the schooling system at different levels. One of the most blatant (and common) manifestations of this mistake is when people use NAEP results to determine the quality of a state's schools.

For instance, you’ll often hear that Massachusetts has the “best” schools in the U.S. and Mississippi the “worst," with both claims based solely on average scores on the NAEP (though, technically, Massachusetts public school students' scores are statistically tied with at least one other state on two of the four main NAEP exams, while Mississippi's rankings vary a bit by grade/subject, and its scores are also not statistically different from several other states').

But we all know that these two states are very different in terms of basic characteristics such as income, parental education, etc. Any assessment of educational quality, whether at the state or local level, is necessarily complicated, and ignoring differences between students precludes any meaningful comparisons of school effectiveness. Schooling quality is important, but it cannot be assessed by sorting and ranking raw test scores in a spreadsheet.

The Stability And Fairness Of New York City's School Ratings

New York City has just released the new round of results from its school rating system (they're called “progress reports"). It relies considerably more on student growth (60 out of 100 points) than absolute performance (25 points), and there are efforts to partially adjust most of the measures via peer group comparisons.*

All of this indicates that the city's system is more focused on school rather than student test-based performance, compared with many other systems around the U.S.

The ratings are high-stakes. Schools receiving low grades – a D or F in any given year, or a C for three consecutive years – enter a review process by which they might be closed. The number of schools meeting these criteria jumped considerably this year.

There is plenty of controversy to go around about the NYC ratings, much of it pertaining to two important features of the system. They’re worth discussing briefly, as they are also applicable to systems in other states.

Our Not-So-College-Ready Annual Discussion Of SAT Results

Every year, around this time, the College Board publicizes its SAT results, and hundreds of newspapers, blogs, and television stations run stories suggesting that trends in the aggregate scores are, by themselves, a meaningful indicator of U.S. school quality. They’re not.

Everyone knows that the vast majority of the students who take the SAT in a given year didn’t take the test the previous year – i.e., the data are cross-sectional. Everyone also knows that participation is voluntary (as is participation in the ACT test), and that the number of students taking the test has been increasing for many years and current test-takers have different measurable characteristics from their predecessors. That means we cannot use the raw results to draw strong conclusions about changes in the performance of the typical student, and certainly not about the effectiveness of schools, whether nationally or in a given state or district. This is common sense.

Unfortunately, the College Board plays a role in stoking the apparent confusion - or, at least, they could do much more to prevent it. Consider the headline of this year’s press release:

Does It Matter How We Measure Schools' Test-Based Performance?

In education policy debates, we like the "big picture." We love to say things like “hold schools accountable” and “set high expectations." Much less frequent are substantive discussions about the details of accountability systems, but it’s these details that make or break policy. The technical specs just aren’t that sexy. But even the best ideas with the sexiest catchphrases won’t improve things a bit unless they’re designed and executed well.

In this vein, I want to recommend a very interesting CALDER working paper by Mark Ehlert, Cory Koedel, Eric Parsons and Michael Podgursky. The paper takes a quick look at one of these extremely important, yet frequently under-discussed details in school (and teacher) accountability systems: The choice of growth model.

When value-added or other growth models come up in our debates, they’re usually discussed en masse, as if they’re all the same. They’re not. It's well-known (though perhaps overstated) that different models can, in many cases, lead to different conclusions for the same school or teacher. This paper, which focuses on school-level models but might easily be extended to teacher evaluations as well, helps illustrate this point in a policy-relevant manner.

Who's Afraid of Virginia's Proficiency Targets?

The accountability provisions in Virginia’s original application for “ESEA flexibility” (or "waiver") have received a great deal of criticism (see here, here, here and here). Most of this criticism focused on the Commonwealth's expectation levels, as described in “annual measurable objectives” (AMOs) – i.e., the statewide proficiency rates that its students are expected to achieve at the completion of each of the next five years, with separate targets established for subgroups such as those defined by race (black, Hispanic, Asian, white), income (subsidized lunch eligibility), limited English proficiency (LEP), and special education.

Last week, in response to the criticism, Virginia agreed to amend its application, and it’s not yet clear how specifically they will calculate the new rates (only that lower-performing subgroups will be expected to make faster progress).

In the meantime, I think it’s useful to review a few of the main criticisms that have been made over the past week or two and what they mean. The actual table containing the AMOs is pasted below (for math only; reading AMOs will be released after this year, since there’s a new test).

Five Recommendations For Reporting On (Or Just Interpreting) State Test Scores

From my experience, education reporters are smart, knowledgeable, and attentive to detail. That said, the bulk of the stories about testing data – in big cities and suburbs, in this year and in previous years – could be better.

Listen, I know it’s unreasonable to expect every reporter and editor to address every little detail when they try to write accessible copy about complicated issues, such as test data interpretation. Moreover, I fully acknowledge that some of the errors to which I object – such as calling proficiency rates “scores” – are well within tolerable limits, and that news stories need not interpret data in the same way as researchers. Nevertheless, no matter what you think about the role of test scores in our public discourse, it is in everyone’s interest that the coverage of them be reliable. And there are a few mostly easy suggestions that I think would help a great deal.

Below are five such recommendations. They are of course not meant to be an exhaustive list, but rather a quick compilation of points, all of which I’ve discussed in previous posts, and all of which might also be useful to non-journalists.

Large Political Stones, Methodological Glass Houses

Earlier this summer, the New York City Independent Budget Office (IBO) presented findings from a longitudinal analysis of NYC student performance. That is, they followed a cohort of over 45,000 students from third grade in 2005-06 through 2009-10 (though most results are 2005-06 to 2008-09, since the state changed its definition of proficiency in 2009-10).

The IBO then simply calculated the proportion of these students who improved, declined or stayed the same in terms of the state’s cutpoint-based categories (e.g., Level 1 ["below basic" in NCLB parlance], Level 2 [basic], Level 3 [proficient], Level 4 [advanced]), with additional breakdowns by subgroup and other variables.

The short version of the results is that almost two-thirds of these students remained constant in their performance level over this time period – for instance, students who scored at Level 2 (basic) in third grade in 2006 tended to stay at that level through 2009; students at the “proficient” level remained there, and so on. About 30 percent increased a category over that time (e.g., going from Level 1 to Level 2).

The response from the NYC Department of Education (NYCDOE) was somewhat remarkable. It takes a minute to explain why, so bear with me.

The Louisiana Voucher Accountability Sweepstakes

The situation with vouchers in Louisiana is obviously quite complicated, and there are strong opinions on both sides of the issue, but I’d like to comment quickly on the new “accountability” provision. It's a great example of how, too often, people focus on the concept of accountability and ignore how it is actually implemented in policy.

Quick and dirty background: Louisiana will be allowing students to receive vouchers (tuition to attend private schools) if their public schools are sufficiently low-performing, according to their "school performance score" (SPS). As discussed here, the SPS is based primarily on how highly students score, rather than whether they’re making progress, and thus tells you relatively little about the actual effectiveness of schools per se. For instance, the vouchers will be awarded mostly to schools serving larger proportions of disadvantaged students, even if many of those schools are compelling large gains (though such progress cannot be assessed adequately using year-to-year changes in the SPS, which, due in part to its reliance on cross-sectional proficiency rates, are extremely volatile).

Now, here's where things get really messy: In an attempt to demonstrate that they are holding the voucher-accepting private schools accountable, Louisiana officials have decided that they will make these private schools ineligible for the program if their performance is too low (after at least two years of participation in the program). That might be a good idea if the state measured school performance in a defensible manner. It doesn't.