Growth And Consequences In New York City's School Rating System

In a New York Times article a couple of weeks ago, reporter Michael Winerip discusses New York City’s school report card grades, with a focus on an issue that I have raised many times – the role of absolute performance measures (i.e., how highly students scores) in these systems, versus that of growth measures (i.e., whether students are making progress).

Winerip uses the example of two schools – P.S. 30 and P.S. 179 – one of which (P.S. 30) received an A on this year’s report card, while the other (P.S. 179) received an F. These two schools have somewhat similar student populations, at least so far as can be determined using standard education variables, and their students are very roughly comparable in terms of absolute performance (e.g., proficiency rates). The basic reason why one received an A and the other an F is that P.S. 179 received a very low growth score, and growth is heavily weighted in the NYC grade system (representing 60 out of 100 points for elementary and middle schools).

I have argued previously that unadjusted absolute performance measures such as proficiency rates are inappropriate for test-based assessments of schools' effectiveness, given that they tell you almost nothing about the quality of instruction schools provide, and that growth measures are the better option, albeit one that also has its own issues (e.g., they are more unstable), and must be used responsibly. In this sense, the weighting of the NYC grading system is much more defensible than most of its counterparts across the nation, at least in my view.

But the system is also an example of how details matter – each school’s growth portion is calculated using an unconventional, somewhat questionable approach, one that is, as yet, difficult to treat with a whole lot of confidence.

The Allure Of Teacher Quality

Those following education know that policy focused on "teacher quality" is by far the dominant paradigm for improving  schools over the past few years. Some (but not nearly all) components of this all-hands-on-deck effort are perplexing to many teachers, and have generated quite a bit of pushback. No matter one’s opinion of this approach, however, what drives it is the tantalizing allure of variation in teacher quality.

Fueled by the ever-increasing availability of detailed test score datasets linking teachers to students, the research literature on teachers’ test-based effectiveness has grown rapidly, in both size and sophistication. Analysis after analysis finds that, all else being equal, the variation in teachers’ estimated effects on students' test growth – the difference between the “top” and “bottom” teachers – is very large. In any given year, some teachers’ students make huge progress, others’ very little. Even if part of this estimated variation is attributable to confounding factors, the discrepancies are still larger than most any other measured "input" within the jurisdiction of education policy. The underlying assumption here is that “true” teacher quality varies to a degree that is at least somewhat comparable in magnitude to the spread of the test-based estimates.

Perhaps that's the case, but it does not, by itself, help much. The key question is whether and how we can measure teacher performance at the individual level and, more importantly, influence the distribution – that is, to raise the ceiling, the middle and/or the floor. The variation hangs out there like a drug to which we’re addicted, but haven’t really figured out how to administer. If there was some way to harness it efficiently, the potential benefits could be considerable. The focus of current education policy is in large part an effort to do anything and everything to try and figure this out. And, as might be expected given the enormity of the task, progress has been slow.

Value-Added Versus Observations, Part Two: Validity

In a previous post, I compared value-added (VA) and classroom observations in terms of reliability – the degree to which they are free of error and stable over repeated measurements. But even the most reliable measures aren’t useful unless they are valid – that is, unless they’re measuring what we want them to measure.

Arguments over the validity of teacher performance measures, especially value-added, dominate our discourse on evaluations. There are, in my view, three interrelated issues to keep in mind when discussing the validity of VA and observations. The first is definitional – in a research context, validity is less about a measure itself than the inferences one draws from it. The second point might follow from the first: The validity of VA and observations should be assessed in the context of how they’re being used.

Third and finally, given the difficulties in determining whether either measure is valid in and of itself, as well as the fact that so many states and districts are already moving ahead with new systems, the best approach at this point may be to judge validity in terms of whether the evaluations are improving outcomes. And, unfortunately, there is little indication that this is happening in most places.

Value-Added Versus Observations, Part One: Reliability

Although most new teacher evaluations are still in various phases of pre-implementation, it’s safe to say that classroom observations and/or value-added (VA) scores will be the most heavily-weighted components toward teachers’ final scores, depending on whether teachers are in tested grades and subjects. One gets the general sense that many - perhaps most - teachers strongly prefer the former (observations, especially peer observations) over the latter (VA).

One of the most common arguments against VA is that the scores are error-prone and unstable over time - i.e., that they are unreliable. And it's true that the scores fluctuate between years (also see here), with much of this instability due to measurement error, rather than “real” performance changes. On a related note, different model specifications and different tests can yield very different results for the same teacher/class.

These findings are very important, and often too casually dismissed by VA supporters, but the issue of reliability is, to varying degrees, endemic to all performance measurement. Actually, many of the standard reliability-based criticisms of value-added could also be leveled against observations. Since we cannot observe “true” teacher performance, it’s tough to say which is “better” or “worse," despite the certainty with which both “sides” often present their respective cases. And, the fact that both entail some level of measurement error doesn't by itself speak to whether they should be part of evaluations.*

Nevertheless, many states and districts have already made the choice to use both measures, and in these places, the existence of imprecision is less important than how to deal with it. Viewed from this perspective, VA and observations are in many respects more alike than different.

Measuring Journalist Quality

Journalists play an essential role in our society. They are charged with informing the public, a vital function in a representative democracy. Yet, year after year, large pockets of the electorate remain poorly-informed on both foreign and domestic affairs. For a long time, commentators have blamed any number of different culprits for this problem, including poverty, education, increasing work hours and the rapid proliferation of entertainment media.

There is no doubt that these and other factors matter a great deal. Recently, however, there is growing evidence that the factors shaping the degree to which people are informed about current events include not only social and economic conditions, but journalist quality as well. Put simply, better journalists produce better stories, which in turn attract more readers. On the whole, the U.S. journalist community is world class. But there is, as always, a tremendous amount of underlying variation. It’s likely that improving the overall quality of reporters would not only result in higher quality information, but it would also bring in more readers. Both outcomes would contribute to a better-informed, more active electorate.

We at the Shanker Institute feel that it is time to start a public conversation about this issue. We have requested and received datasets documenting the story-by-story readership of the websites of U.S. newspapers, large and small. We are using these data in statistical models that we call “Readers-Added Models," or “RAMs."

Ohio's New School Rating System: Different Results, Same Flawed Methods

Without question, designing school and district rating systems is a difficult task, and Ohio was somewhat ahead of the curve in attempting to do so (and they're also great about releasing a ton of data every year). As part of its application for ESEA waivers, the state recently announced a newly-designed version of its long-standing system, with the changes slated to go into effect in 2014-15. State officials told reporters that the new scheme is a “more accurate reflection of … true [school and district] quality."

In reality, however, despite its best intentions, what Ohio has done is perpetuate a troubled system by making less-than-substantive changes that seem to serve the primary purpose of giving lower grades to more schools in order for the results to square with preconceptions about the distribution of “true quality." It’s not a better system in terms of measurement - both the new and old schemes consist of mostly the same inappropriate components, and the ratings differentiate schools based largely on student characteristics rather than school performance.

So, whether or not the aggregate results seem more plausible is not particularly important, since the manner in which they're calculated is still deeply flawed. And demonstrating this is very easy.

Learning From Teach For America

There is a small but growing body of evidence about the (usually test-based) effectiveness of teachers from Teach for America (TFA), an extremely selective program that trains and places new teachers in mostly higher needs schools and districts. Rather than review this literature paper-by-paper, which has already been done by others (see here and here), I’ll just give you the super-short summary of the higher-quality analyses, and quickly discuss what I think it means.*

The evidence on TFA teachers focuses mostly on comparing their effect on test score growth vis-à-vis other groups of teachers who entered the profession via traditional certification (or through other alternative routes). This is no easy task, and the findings do vary quite a bit by study, as well as by the group to which TFA corps members are compared (e.g., new or more experienced teachers). One can quibble endlessly over the methodological details (and I’m all for that), and this area is still underdeveloped, but a fair summary of these papers is that TFA teachers are no more or less effective than comparable peers in terms of reading tests, and sometimes but not always more effective in math (the differences, whether positive or negative, tend to be small and/or only surface after 2-3 years). Overall, the evidence thus far suggests that TFA teachers perform comparably, at least in terms of test-based outcomes.

Somewhat in contrast with these findings, TFA has been the subject of both intensive criticism and fawning praise. I don’t want to engage this debate directly, except to say that there has to be some middle ground on which a program that brings talented young people into the field of education is not such a divisive issue. I do, however, want to make a wider point specifically about the evidence on TFA teachers – what it might suggest about the current focus to “attract the best people” to the profession.

Beware Of Anecdotes In The Value-Added Debate

A recent New York Times "teacher diary" presents the compelling account of a New York City teacher whose value-added rating was 6th percentile in 2009 – one of the lowest scores in the city – and 96th percentile the following year, one of the highest. Similar articles - for example, about teachers with errors in their rosters or scores that conflict with their colleagues'/principals' opinions - have been published since the release of the city’s teacher data reports (also see here). These accounts provoke a lot of outrage and disbelief, and that makes sense – they can sound absurd.

Stories like these can be useful as illustrations of larger trends and issues - in this case, of the unfairness of publishing the NYC scores,  most of which are based on samples that are too small to provide meaningful information. But, in the debate over using these estimates in actual policy, we need to be careful not to focus too much on anecdotes. For every one NYC teacher whose value-added rank changed over 90 points between 2009 and 2010, there are almost 100 teachers whose ranks were within 10 points (and percentile ranks overstate the actual size of all these differences). Moreover, even if the models yielded perfect measures of test-based teacher performance, there would still be many implausible fluctuations between years - those that are unlikely to be "real" change - due to nothing more than random error.*

The reliability of value-added estimates, like that of all performance measures (including classroom observations), is an important issue, and is sometimes dismissed by supporters in a cavalier fashion. There are serious concerns here, and no absolute answers. But none of this can be examined or addressed with anecdotes.

Dispatches From The Nexus Of Bad Research And Bad Journalism

In a recent story, the New York Daily News uses the recently-released teacher data reports (TDRs) to “prove” that the city’s charter school teachers are better than their counterparts in regular public schools. The headline announces boldly: New York City charter schools have a higher percentage of better teachers than public schools (it has since been changed to: "Charters outshine public schools").

Taking things even further, within the article itself, the reporters note, “The newly released records indicate charters have higher performing teachers than regular public schools."

So, not only are they equating words like “better” with value-added scores, but they’re obviously comfortable drawing conclusions about these traits based on the TDR data.

The article is a pretty remarkable display of both poor journalism and poor research. The reporters not only attempted to do something they couldn’t do, but they did it badly to boot. It’s unfortunate to have to waste one’s time addressing this kind of thing, but, no matter your opinion on charter schools, it's a good example of how not to use the data that the Daily News and other newspapers released to the public.

Revisiting The "5-10 Percent Solution"

In a post over a year ago, I discussed the common argument that dismissing the “bottom 5-10 percent" of teachers would increase U.S. test scores to the level of high-performing nations. This argument is based on a calculation by economist Eric Hanushek, which suggests that dismissing the lowest-scoring teachers based on their math value-added scores would, over a period of around ten years  (when the first cohort of students would have gone through the schooling system without the “bottom” teachers), increase U.S. math scores dramatically – perhaps to the level of high-performing nations such as Canada or Finland.*

This argument is, to say the least, controversial, and it invokes the full spectrum of reactions. In my opinion, it's best seen as a policy-relevant illustration of the wide variation in test-based teacher effects, one that might suggest a potential of a course of action but can't really tell us how it will turn out in practice. To highlight this point, I want to take a look at one issue mentioned in that previous post – that is, how the instability of value-added scores over time (which Hanushek’s simulation doesn’t address directly) might affect the projected benefits of this type of intervention, and how this is turn might modulate one's view of the huge projected benefits.

One (admittedly crude) way to do this is to use the newly-released New York City value-added data, and look at 2010 outcomes for the “bottom 10 percent” of math teachers in 2009.