When The Legend Becomes Fact, Print The Fact Sheet

The New Teacher Project (TNTP) just released a "fact sheet" on value-added (VA) analysis. I’m all for efforts to clarify complex topics such as VA, and, without question, there is a great deal of misinformation floating around on this subject, both "pro-" and "anti-."

The fact sheet presents five sets of “myths and facts." Three of the “myths” seem somewhat unnecessary: that there’s no research behind VA; that teachers will be evaluated based solely on test scores; and that VA is useless because it’s not perfect. Almost nobody believes or makes these arguments (at least in my experience). But I guess it never hurts to clarify.

In contrast, the other two are very common arguments, but they are not myths. They are serious issues with concrete policy implications. If there are any myths, they're in the "facts" column.

Making (Up) The Grade In Ohio

In a post last week over at Flypaper, the Fordham Institute’s Terry Ryan took a “frank look” at the ratings of the handful of Ohio charter schools that Fordham’s Ohio branch manages. He noted that the Fordham schools didn’t make a particularly strong showing, ranking 24th among the state’s 47 charter authorizers in terms of the aggregate “performance index” among the schools it authorizes. Mr. Ryan takes the opportunity to offer a few valid explanations as to why Fordham ranked in the middle of the charter authorizer pack, such as the fact that the state’s “dropout recovery schools," which accept especially hard-to-serve students who left public schools, aren’t included (which would likely bump up Fordham's relative ranking).

Mr. Ryan doth protest too little. His primary argument, which he touches on but does not flesh out, should be that Ohio’s performance index is more a measure of student characteristics than of any defensible concept of school effectiveness. By itself, it reveals relatively little about the “quality” of schools operated by Ohio’s charter authorizers.

But the limitations of measures like the performance index, which are discussed below (and in the post linked above), have implications far beyond Ohio’s charter authorizers. The primary means by which Ohio assesses school/district performance is the state’s overall “report card grades," which are composite ratings comprised of multiple test-based measures, including the performance index. Unfortunately, however, these ratings are also not a particularly useful measure of school effectiveness. Not only are the grades unstable between years, but they also rely too heavily on test-based measures, including the index, that fail to account for student characteristics. While any attempt to measure school performance using testing data is subject to imprecision, Ohio’s effort falls short.

The Stability Of Ohio's School Value-Added Ratings And Why It Matters

I have discussed before how most testing data released to the public are cross-sectional, and how comparing them between years entails the comparison of two different groups of students. One way to address these issues is to calculate and release school- and district-level value-added scores.

Value added estimates are not only longitudinal (i.e., they follow students over time), but the models go a long way toward accounting for differences in the characteristics of students between schools and districts. Put simply, these models calculate “expectations” for student test score gains based on student (and sometimes school) characteristics, which are then used to gauge whether schools’ students did better or worse than expected.

Ohio is among the few states that release school- and district-level value-added estimates (though this number will probably increase very soon). These results are also used in high-stakes decisions, as they are a major component of Ohio’s “report card” grades for schools, which can be used to close or sanction specific schools. So, I thought it might be useful to take a look at these data and their stability over the past two years. In other words, what proportion of the schools that receive a given rating in one year will get that same rating the next year?

In Ohio, Charter School Expansion By Income, Not Performance

For over a decade, Ohio law has dictated where charter schools can open. Expansion was unlimited in Lucas County (the “pilot district” for charters) and in the “Ohio 8” urban districts (Akron, Canton, Cincinnati, Cleveland, Columbus, Dayton, Toledo, and Youngstown). But, in any given year, charters could open up in any other district that was classified as a “challenged district," as measured by whether the district received a state “report card” rating of “academic watch” or “academic emergency." This is a performance-based standard.

Under this system, there was of course very rapid charter proliferation in Lucas County and the “Ohio 8” districts. Only a small number of other districts (around 20-30 per year) “met” the  performance-based standard. As a whole, the state’s current charter law was supposed to “open up” districts for charter schools when the districts are not doing well.

Starting next year, the state is adding a fourth criterion: Any district with a “performance index” in the bottom five percent for the state will also be open for charter expansion. Although this may seem like a logical addition, in reality, the change offends basic principles of both fairness and educational measurement.

Quality Control, When You Don't Know The Product

Last week, New York State’s Supreme Court issued an important ruling on the state’s teacher evaluations. The aspect of the ruling that got the most attention was the proportion of evaluations – or “weight” – that could be assigned to measures based on state assessments (in the form of estimates from value-added models). Specifically, the Court ruled that these measures can only comprise 20 percent of a teacher’s evaluation, compared with the option of up to 40 percent for which Governor Cuomo and others were pushing. Under the decision, the other 20 percent must consist entirely of alternative test-based measures (e.g., local assessments).

Joe Williams, head of Democrats for Education Reform, one of the flagship organizations of the market-based reform movement, called the ruling “a slap in the face” and “a huge win for the teachers unions." He characterized the policy impact as follows: “A mediocre teacher evaluation just got even weaker."

This statement illustrates perfectly the strange reasoning that seems to be driving our debate about evaluations.

Certainty And Good Policymaking Don't Mix

Using value-added and other types of growth model estimates in teacher evaluations is probably the most controversial and oft-discussed issue in education policy over the past few years.

Many people (including a large proportion of teachers) are opposed to using student test scores in their evaluations, as they feel that the measures are not valid or reliable, and that they will incentivize perverse behavior, such as cheating or competition between teachers. Advocates, on the other hand, argue that student performance is a vital part of teachers’ performance evaluations, and that the growth model estimates, while imperfect, represent the best available option.

I am sympathetic to both views. In fact, in my opinion, there are only two unsupportable positions in this debate: Certainty that using these measures in evaluations will work; and certainty that it won’t. Unfortunately, that’s often how the debate has proceeded – two deeply-entrenched sides convinced of their absolutist positions, and resolved that any nuance in or compromise of their views will only preclude the success of their efforts. You’re with them or against them. The problem is that it's the nuance - the details - that determine policy effects.

Let’s be clear about something: I'm not aware of a shred of evidence – not a shred – that the use of growth model estimates in teacher evaluations improves performance of either teachers or students.

Test-Based Teacher Evaluations Are The Status Quo

We talk a lot about the “status quo” in our education debates. For instance, there is a common argument that the failure to use evidence of “student learning” (in practice, usually defined in terms of test scores) in teacher evaluations represents the “status quo” in this (very important) area.

Now, the implication that “anything is better than the status quo” is a rather massive fallacy in public policy, as it assumes that the costs of alternatives will outweigh benefits, and that there is no chance the replacement policy will have a negative impact (almost always an unsafe assumption). But, in the case of teacher evaluations, the “status quo” is no longer what people seem to think.

Not counting Puerto Rico and Hawaii, the ten largest school districts in the U.S. are (in order): New York City; Los Angeles; Chicago; Dade County (FL); Clark County (NV); Broward County (FL); Houston; Hillsborough (FL); Orange County (FL); and Palm Beach County (FL). Together, they serve about eight percent of all K-12 public school students in the U.S., and over one in ten of the nation’s low-income children.

Although details vary, every single one of them is either currently using test-based measures of effectiveness in its evaluations, or is in the process of designing/implementing these systems (most due to statewide legislation).

Attracting The "Best Candidates" To Teaching

** Also posted here on "Valerie Strauss' Answer Sheet" in the Washington Post

One of the few issues that all sides in the education debate agree upon is the desirability of attracting “better people” into the teaching profession. While this certainly includes the possibility of using policy to lure career-switchers, most of the focus is on attracting “top” candidates right out of college or graduate school.

The common metric that is used to identify these “top” candidates is their pre-service (especially college) characteristics and performance. Most commonly, people call for the need to attract teachers from the “top third” of graduating classes, an outcome that is frequently cited as being the case in high-performing nations such as Finland. Now, it bears noting that “attracting better people," like “improving teacher quality," is a policy goal, not a concrete policy proposal – it tells us what we want, not how to get it. And how to make teaching more enticing for “top” candidates is still very much an open question (as is the equally important question of how to improve the performance of existing teachers).

In order to answer that question, we need to have some idea of whom we’re pursuing – who are these “top” candidates, and what do they want? I sometimes worry that our conception of this group – in terms of the “top third” and similar constructions – doesn’t quite square with the evidence, and that this misconception might actually be misguiding rather than focusing our policy discussions.

Teacher Evaluations: Don't Begin Assembly Until You Have All The Parts

** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

Over the past year or two, roughly 15-20 states have passed or are considering legislation calling for the overhaul of teacher evaluation. The central feature of most of these laws is a mandate to incorporate measures of student test score growth, in most cases specifying a minimum percentage of a teacher’s total score that must consist of these estimates.

There’s some variation across states, but the percentages are all quite high. For example, Florida and Colorado both require that at least 50 percent of an evaluation must be based on growth measures, while New York mandates a minimum of 40 percent. These laws also vary in terms of other specifics, such as the degree to which the growth measure proportion must be based on state tests (rather than other assessments), how much flexibility districts have in designing their systems, and how teachers in untested grades and subjects are evaluated. But they all share that defining feature of mandating a minimum proportion – or “weight” – that must be attached to a test-based estimate of teacher effects (at least for those teachers in tested grades and subjects).

Unfortunately, this is typical of the misguided manner in which many lawmakers (and the advocates advising them) have approached the difficult task of overhauling teacher evaluation systems. For instance, I have discussed previously the failure of most systems to account for random error. The weighting issue is another important example, and it violates a basic rule of designing performance assessment systems: You should exercise extreme caution in pre-deciding the importance of any one component until you know what the other components will be. Put simply, you should have all the parts in front of you before you begin the assembly process.

Value-Added In Teacher Evaluations: Built To Fail

With all the controversy and acrimonious debate surrounding the use of value-added models in teacher evaluation, few seem to be paying much attention to the implementation details in those states and districts that are already moving ahead. This is unfortunate, because most new evaluation systems that use value-added estimates are literally being designed to fail.

Much of the criticism of value-added (VA) focuses on systematic bias, such as that stemming from non-random classroom assignment (also here). But the truth is that most of the imprecision of value-added estimates stems from random error. Months ago, I lamented the fact that most states and districts incorporating value-added estimates into their teacher evaluations were not making any effort to account for this error. Everyone knows that there is a great deal of imprecision in value-added ratings, but few policymakers seem to realize that there are relatively easy ways to mitigate the problem.

This is the height of foolishness. Policy is details. The manner in which one uses value-added estimates is just as important – perhaps even more so – than the properties of the models themselves. By ignoring error when incorporating these estimates into evaluation systems, policymakers virtually guarantee that most teachers will receive incorrect ratings. Let me explain.