How Not To Improve New Teacher Evaluation Systems

One of the more interesting recurring education stories over the past couple of years has been the release of results from several states’ and districts’ new teacher evaluation systems, including those from New York, Indiana, Minneapolis, Michigan and Florida. In most of these instances, the primary focus has been on the distribution of teachers across ratings categories. Specifically, there seems to be a pattern emerging, in which the vast majority of teachers receive one of the higher ratings, whereas very few receive the lowest ratings.

This has prompted some advocates, and even some high-level officials, essentially to deem as failures the new systems, since their results suggest that the vast majority of teachers are “effective” or better. As I have written before, this issue cuts both ways. On the one hand, the results coming out of some states and districts seem problematic, and these systems may need adjustment. On the other hand, there is a danger here: States may respond by making rash, ill-advised changes in order to achieve “differentiation for the sake of differentiation,” and the changes may end up undermining the credibility and threatening the validity of the systems on which these states have spent so much time and money.

Granted, whether and how to alter new evaluations are difficult decisions, and there is no tried and true playbook. That said, New York Governor Andrew Cuomo’s proposals provide a stunning example of how not to approach these changes. To see why, let’s look at some sound general principles for improving teacher evaluation systems based on the first rounds of results, and how they compare with the New York approach.*

The Status Fallacy: New York State Edition

A recent New York Times story addresses directly New York Governor Andrew Cuomo’s suggestion, in his annual “State of the State” speech, that New York schools are in a state of crisis and "need dramatic reform." The article’s general conclusion is that the “data suggest otherwise.”

There are a bunch of important points raised in the article, but most of the piece is really just discussing student rather than school performance. Simple statistics about how highly students score on tests – i.e., “status measures” – tell you virtually nothing about the effectiveness of the schools those students attend, since, among other reasons, they don’t account for the fact that many students enter the system at low levels. How much students in a school know in a given year is very different from how much they learned over the course of that year.

I (and many others) have written about this “status fallacy” dozens of times (see our resources page), not because I enjoy repeating myself (I don’t), but rather because I am continually amazed just how insidious it is, and how much of an impact it has on education policy and debate in the U.S. And it feels like every time I see signs that things might be changing for the better, there is an incident, such as Governor Cuomo’s speech, that makes me question how much progress there really has been at the highest levels.

The Increasing Academic Ability Of New York Teachers

For many years now, a common talking point in education circles has been that U.S. public school teachers are disproportionately drawn from the “bottom third” of college graduates, and that we have to “attract better candidates” in order to improve the distribution of teacher quality. We discussed the basis for this “bottom third” claim in this post, and I will not repeat the points here, except to summarize that “bottom third” teachers (based on SAT/ACT scores) were indeed somewhat overrepresented nationally, although the magnitudes of such differences vary by cohort and other characteristics.

A very recent article in the journal Educational Researcher addresses this issue head-on (a full working version of the article is available here). It is written by Hamilton Lankford, Susanna Loeb, Andrew McEachin, Luke Miller and James Wyckoff. The authors analyze SAT scores of New York State teachers over a 25 year period (between 1985 and 2009). Their main finding is that these SAT scores, after a long term decline, improved between 2000 and 2009 among all certified teachers, with the increases being especially large among incoming (new) teachers, and among teachers in high-poverty schools. For example, the proportion of incoming New York teachers whose SAT scores were in the top third has increased over 10 percentage points, while the proportion with scores in the bottom third has decreased by a similar amount (these figures define “top third” and “bottom third” in terms of New York State public school students who took the SAT between 1979 and 2008).

This is an important study that bears heavily on the current debate over improving the teacher labor supply, and there are few important points about it worth discussing briefly.

New York Public Schools And Governor Andrew Cuomo: An Essay, In List Form

A point-by-point commentary on Governor Andrew Cuomo’s newly-announced education plan.*

  1. New York State now has most racially and economically segregated schools in the nation, worse than Mississippi.
  2. New York is violating Campaign for Fiscal Equity ruling of highest state court to provide full, equitable funding to high poverty schools.
  3. As a result, New York State owes $6 billion it had promised to school districts with concentrations of poverty.
  4. One would think that a Democratic Governor would be focused on correcting such educational injustices.  But not Andrew Cuomo.
  5. Cuomo is proposing tax credits (aka vouchers) that would divert funds and resources from underfunded public schools to private schools.
  6. Poor and working class kids, students of color who attend public schools would be hurt.
  7. Cuomo is 1st ever Democratic Governor to propose tax credits for private schools, says conservative Checker Finn.
  8. League of Women Voters, Civil Liberties Union, school board ass., sup'ts ass't., teachers union all opposed to Cuomo’s tax credit scheme.
  9. The problem with our public schools, Cuomo says, is teachers.
  10. Teachers think: how convenient that Cuomo, who ignores his responsibilities regarding school segregation and funding, blames us.

The Great Teacher Evaluation Evaluation: New York Edition

A couple of weeks ago, the New York State Education Department (NYSED) released data from the first year of the state's new teacher and principal evaluation system (called the “Annual Professional Performance Review," or APPR). In what has become a familiar pattern, this prompted a wave of criticism from advocates, much of it focused on the proportion of teachers in the state to receive the lowest ratings.

To be clear, evaluation systems that produce non-credible results should be examined and improved, and that includes those that put implausible proportions of teachers in the highest and lowest categories. Much of the commentary surrounding this and other issues has been thoughtful and measured. As usual, though, there have been some oversimplified reactions, as exemplified by this piece on the APPR results from Students First NY (SFNY).

SFNY notes what it considers to be the low proportion of teachers rated “ineffective," and points out that there was more differentiation across rating categories for the state growth measure (worth 20 percent of teachers’ final scores), compared with the local “student learning” measure (20 percent) and the classroom observation components (60 percent). Based on this, they conclude that New York’s "state test is the only reliable measure of teacher performance" (they are actually talking about validity, not reliability, but we’ll let that go). Again, this argument is not representative of the commentary surrounding the APPR results, but let’s use it as a springboard for making a few points, most of which are not particularly original. (UPDATE: After publication of this post, SFNY changed the headline of their piece from "the only reliable measure of teacher performance" to "the most reliable measure of teacher performance.")

The Thrill Of Success, The Agony Of Measurement

** Reprinted here in the Washington Post

The recent release of the latest New York State testing results created a little public relations coup for the controversial Success Academies charter chain, which operates over 20 schools in New York City, and is seeking to expand.

Shortly after the release of the data, the New York Post published a laudatory article noting that seven of the Success Academies had overall proficiency rates that were among the highest in the state, and arguing that the schools “live up to their name." The Daily News followed up by publishing an op-ed that compares the Success Academies' combined 94 percent math proficiency rate to the overall city rate of 35 percent, and uses that to argue that the chain should be allowed to expand because its students “aced the test” (this is not really what high proficiency rates mean, but fair enough).

On the one hand, this is great news, and a wonderfully impressive showing by these students. On the other, decidedly less sensational hand, it's also another example of the use of absolute performance indicators (e.g., proficiency rates) as measures of school rather than student performance, despite the fact that they are not particularly useful for the former purpose since, among other reasons, they do not account for where students start out upon entry to the school. I personally don't care whether Success Academy gets good or bad press. I do, however, believe that how one gauges effectiveness, test-based or otherwise, is important, even if one reaches the same conclusion using different measures.

New York City: The Mississippi Of The Twenty-First Century?

Last month saw the publication of a new report, New York State’s Extreme School Segregation, produced by UCLA’s highly regarded Civil Rights Project. It confirmed what New York educators have suspected for some time: our schools are now the most racially segregated schools in the United States. New York’s African-American and Latino students experience “the highest concentration in intensely-segregated public schools (less than 10% white enrollment), the lowest exposure to white students, and the most uneven distribution with white students across schools."

Driving the statewide numbers were schools in New York City, particularly charter schools. Inside New York City, “the vast majority of the charter schools were intensely segregated," the report concluded, significantly worse in this regard “than the record for public schools."

New York State’s Extreme School Segregation provides a window into the intersection of race and class in the city’s schools. As a rule, the city’s racially integrated schools are middle class, in which middle-class white, Asian, African-American and Latino students all experience the educational benefits of racial diversity. By contrast, the city’s racially segregated public schools are generally segregated by both race and class: extreme school segregation involves high concentrations of African-American and Latino students living in poverty.

Data Driving: At The Intersection Of Arbitrary And Meaningful

In his State of the City address last month, New York City Mayor Michael Bloomberg made some brief comments about the upcoming adoption of new assessments aligned with the Common Core State Standards (CCSS), including the following statement:

But no matter where the definition of proficiency is arbitrarily set on the new tests, I expect that our students’ progress will continue outpacing the rest of the State’s[,] the only meaningful measurement of progress we have.
On the surface, this may seem like just a little bit of healthy bravado. But there are a few things about this single sentence that struck me, and it also helps to illustrate an important point about the relationship between standards and testing results.

The Stability And Fairness Of New York City's School Ratings

New York City has just released the new round of results from its school rating system (they're called “progress reports"). It relies considerably more on student growth (60 out of 100 points) than absolute performance (25 points), and there are efforts to partially adjust most of the measures via peer group comparisons.*

All of this indicates that the city's system is more focused on school rather than student test-based performance, compared with many other systems around the U.S.

The ratings are high-stakes. Schools receiving low grades – a D or F in any given year, or a C for three consecutive years – enter a review process by which they might be closed. The number of schools meeting these criteria jumped considerably this year.

There is plenty of controversy to go around about the NYC ratings, much of it pertaining to two important features of the system. They’re worth discussing briefly, as they are also applicable to systems in other states.

Large Political Stones, Methodological Glass Houses

Earlier this summer, the New York City Independent Budget Office (IBO) presented findings from a longitudinal analysis of NYC student performance. That is, they followed a cohort of over 45,000 students from third grade in 2005-06 through 2009-10 (though most results are 2005-06 to 2008-09, since the state changed its definition of proficiency in 2009-10).

The IBO then simply calculated the proportion of these students who improved, declined or stayed the same in terms of the state’s cutpoint-based categories (e.g., Level 1 ["below basic" in NCLB parlance], Level 2 [basic], Level 3 [proficient], Level 4 [advanced]), with additional breakdowns by subgroup and other variables.

The short version of the results is that almost two-thirds of these students remained constant in their performance level over this time period – for instance, students who scored at Level 2 (basic) in third grade in 2006 tended to stay at that level through 2009; students at the “proficient” level remained there, and so on. About 30 percent increased a category over that time (e.g., going from Level 1 to Level 2).

The response from the NYC Department of Education (NYCDOE) was somewhat remarkable. It takes a minute to explain why, so bear with me.