The Sensitive Task Of Sorting Value-Added Scores

The New Teacher Project’s (TNTP) recent report on teacher retention, called “The Irreplaceables," garnered quite a bit of media attention. In a discussion of this report, I argued, among other things, that the label “irreplaceable” is a highly exaggerated way of describing their definitions, which, by the way, varied between the five districts included in the analysis. In general, TNTP's definitions are better-described as “probably above average in at least one subject" (and this distinction matters for how one interprets the results).

I’d like to elaborate a bit on this issue – that is, how to categorize teachers’ growth model estimates, which one might do, for example, when incorporating them into a final evaluation score. This choice, which receives virtually no discussion in TNTP’s report, is always a judgment call to some degree, but it’s an important one for accountability policies. Many states and districts are drawing those very lines between teachers (and schools), and attaching consequences and rewards to the outcomes.

Let's take a very quick look, using the publicly-released 2010 “teacher data reports” from New York City (there are details about the data in the first footnote*). Keep in mind that these are just value-added estimates, and are thus, at best, incomplete measures of the performance of teachers (however, importantly, the discussion below is not specific to growth models; it can apply to many different types of performance measures).

Cheating, Honestly

Whatever one thinks of the heavy reliance on standardized tests in U.S. public education, one of the things on which there is wide agreement is that cheating must be prevented, and investigated when there’s evidence it might have occurred.

For anyone familiar with test-based accountability, recent cheating scandals in Atlanta, Washington, D.C., Philadelphia and elsewhere are unlikely to have been surprising. There has always been cheating, and it can take many forms, ranging from explicit answer-changing to subtle coaching on test day. One cannot say with any certainty how widespread cheating is, but there is every reason to believe that high-stakes testing increases the likelihood that it will happen. The first step toward addressing that problem is to recognize it.

A district, state or nation that is unable or unwilling to acknowledge the possibility of cheating, do everything possible to prevent it, and face up to it when evidence suggests it has occurred, is ill-equipped to rely on test-based accountability policies. 

Selective Schools In New Orleans

Charter schools in New Orleans, LA (NOLA) receive a great deal of attention, in no small part because they serve a larger proportion of public school students than do charters in any other major U.S. city. Less discussed, however, is the prevalence of NOLA’s “selective schools” (elsewhere, they are sometimes called “exam schools”). These schools maintain criteria for admission and/or retention, based on academic and other qualifications (often grades and/or standardized test scores).

At least six of NOLA’s almost 90 public schools are selective – one high school, four (P)K-8 schools and one serving grades K-12. When you add up their total enrollment, around one in eight NOLA students attends one of these schools.*

Although I couldn’t find recent summary data on the prevalence of selective schools in urban districts around the U.S., this is almost certainly an extremely high proportion (for instance, selective schools in New York City and Chicago, which are mostly secondary schools, serve only a tiny fraction of students in those cities).

Creating A Valid Process For Using Teacher Value-Added Measures

** Reprinted here in the Washington Post

Our guest author today is Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, Value-Added Measures in Education, provides an excellent, accessible review of the technical and practical issues surrounding these models. 

Now that the election is over, the Obama Administration and policymakers nationally can return to governing.  Of all the education-related decisions that have to be made, the future of teacher evaluation has to be front and center.
In particular, how should “value-added” measures be used in teacher evaluation? President Obama’s Race to the Top initiative expanded the use of these measures, which attempt to identify how much each teacher contributes to student test scores. In doing so, the initiative embraced and expanded the controversial reliance on standardized tests that started under President Bush’s No Child Left Behind.

In many respects, The Race was well designed. It addresses an important problem - the vast majority of teachers report receiving limited quality feedback on instruction. As a competitive grants program, it was voluntary for states to participate (though involuntary for many districts within those states). The Administration also smartly embraced the idea of multiple measures of teacher performance.

But they also made one decision that I think was a mistake.  They encouraged—or required, depending on your vantage point—states to lump value-added or other growth model estimates together with other measures. The raging debate since then has been over what percentage of teachers’ final ratings should be given to value-added versus the other measures. I believe there is a better way to approach this issue, one that focuses on teacher evaluations not as a measure, but rather as a process.

Describing, Explaining And Affecting Teacher Retention In D.C.

The New Teacher Project (TNTP) has released a new report on teacher retention in D.C. Public Schools (DCPS). It is a spinoff of their “The Irreplaceables” report, which was released a few months ago, and which is discussed in this post. The four (unnamed) districts from that report are also used in this one, and their results are compared with those from DCPS.

I want to look quickly at this new supplemental analysis, not to rehash the issues I raised about“The Irreplaceables," but rather because of DCPS’s potential importance as a field test site for a host of policy reform ideas – indeed, the majority of core market-based reform policies have been in place in D.C. for several years, including teacher evaluations in which test-based measures are the dominant component, automatic dismissals based on those ratings, large performance bonuses, mutual consent for excessed teachers and a huge charter sector. There are many people itching to render a sweeping verdict, positive or negative, on these reforms, most often based on pre-existing beliefs, rather than solid evidence.

Although I will take issue with a couple of the conclusions offered in this report, I'm not going to review it systematically. I think research on retention is important, and it’s difficult to produce reports with original analysis, while very easy to pick them apart. Instead, I’m going to list a couple of findings in the report that I think are worth examining, mostly because they speak to larger issues.

Annual Measurable Objections

As states’ continue to finalize their applications for ESEA/NCLB “flexibility” (or “waivers”), controversy has arisen in some places over how these plans set proficiency goals, both overall and for demographic subgroups (see our previous post about the situation in Virginia).

One of the underlying rationales for allowing states to establish new targets (called “annual measurable objectives," or AMOs) is that the “100 percent” proficiency goals of NCLB were unrealistic. Accordingly, some (but not all) of the new plans have set 2017-18 absolute proficiency goals that are considerably below 100 percent, and/or lower for some subgroups relative to others. This shift has generated pushback from advocates, most recently in Florida, who believe that lowering state targets is tantamount to encouraging or accepting failure.

I acknowledge the central role of goals in any accountability system, but I would like to humbly suggest that this controversy, over where and how states set proficiency targets for 2017-18, may be misguided. There are four reasons why I think this is the case (and one silver lining if it is).

NCLB And The Institutionalization Of Data Interpretation

It is a gross understatement to say that the No Child Left Behind (NCLB) law is, was – and will continue to be – a controversial piece of legislation. Although opinion tends toward the negative, there are certain features, such as a focus on student subgroup data, that many people support. And it’s difficult to make generalizations about whether the law’s impact on U.S. public education was “good” or “bad” by some absolute standard.

The one thing I would say about NCLB is that it has helped to institutionalize the improper interpretation of testing data.

Most of the attention to the methodological shortcomings of the law focuses on “adequate yearly progress” (AYP) – the crude requirement that all schools must make “adequate progress” toward the goal of 100 percent proficiency by 2014. And AYP is indeed an inept measure. But the problems are actually much deeper than AYP.

Rather, it’s the underlying methods and assumptions of NCLB (including AYP) that have had a persistent, negative impact on the way we interpret testing data.

Assessing Ourselves To Death

** Reprinted here in the Washington Post

I have two points to make. The first is something that I think everyone knows: Educational outcomes, such as graduation and test scores, are signals of or proxies for the traits that lead to success in life, not the cause of that success.

For example, it is well-documented that high school graduates earn more, on average, than non-graduates. Thus, one often hears arguments that increasing graduation rates will drastically improve students’ future prospects, and the performance of the economy overall. Well, not exactly.

The piece of paper, of course, only goes so far. Rather, the benefits of graduation arise because graduates are more likely to possess the skills – including the critical non-cognitive sort – that make people good employees (and, on a highly related note, because employers know that, and use credentials to screen applicants).

We could very easily increase the graduation rate by easing requirements, but this wouldn’t do much to help kids advance in the labor market. They might get a few more calls for interviews, but over the long haul, they’d still be at a tremendous disadvantage if they lacked the required skills and work habits.

Are Charter Caps Keeping Great Schools From Opening?

** Reprinted here in the Washington Post

Charter school “caps” are state-imposed limits on the size or growth of charter sectors. Currently, around 25 states set caps on schools or enrollment, with wide variation in terms of specifics: Some states simply set a cap on the number of schools (or charters in force); others limit annual growth; and still others specify caps on both growth and size (there are also a few places that cap proportional spending, coverage by individual operators and other dimensions).

A great many charter school supporters strongly support the lifting of these restrictions, arguing that they prevent the opening of high-quality schools. This is, of course, an oversimplification at best, as lifting caps could just as easily lead to the proliferation of the many unsuccessful charters. If the charter school experiment has taught us anything, it’s that these schools are anything but sure bets, and that even includes the tiny handful of highly successful models such as KIPP.*

Overall, the only direct impact of charter caps is to limit the potential size or growth of a state’s charter school sector. Assessing their implications for quality, on the other hand, is complicated, and there is every reason to believe that the impact of caps, and thus the basis of arguments for lifting them, varies by context – including the size and quality of states’ current sectors, as well as the criteria by which low-performing charters are closed and new ones are authorized. 

New Teacher Evaluations Are A Long-Term Investment, Not Test Score Arbitrage

One of the most important things in education policy to keep an eye on is the first round of changes to new teacher evaluation systems. Given all the moving parts and the lack of evidence on how these systems should be designed and their impact, course adjustments along the way are not just inevitable, but absolutely essential.

Changes might be guided by different types of evidence, such as feedback from teachers and administrators or analysis of ratings data. And, of course, human judgment will play a big role. One thing that states and districts should not be doing, however, is assessing their new systems – or making changes to them – based whether or not raw overall test scores go up or down within the first few years.

Here’s a little reality check: Even the best-designed, best-implemented new evaluations are unlikely to have an immediate measurable impact on aggregate student performance. Evaluations are an investment, not a quick fix. And they are not risk-free. Their effects will depend on the quality of systems, how current teachers and administrators react to them and how all of this shapes and plays out in the teacher labor market. As I’ve said before, the realistic expectation for overall performance – and this is no guarantee – is that there will be some very small, gradual improvements, unfolding over a period of years and decades.

States and districts that expect anything more risk making poor decisions during these crucial, early phases.