Value-Added Versus Observations, Part Two: Validity

In a previous post, I compared value-added (VA) and classroom observations in terms of reliability – the degree to which they are free of error and stable over repeated measurements. But even the most reliable measures aren’t useful unless they are valid – that is, unless they’re measuring what we want them to measure.

Arguments over the validity of teacher performance measures, especially value-added, dominate our discourse on evaluations. There are, in my view, three interrelated issues to keep in mind when discussing the validity of VA and observations. The first is definitional – in a research context, validity is less about a measure itself than the inferences one draws from it. The second point might follow from the first: The validity of VA and observations should be assessed in the context of how they’re being used.

Third and finally, given the difficulties in determining whether either measure is valid in and of itself, as well as the fact that so many states and districts are already moving ahead with new systems, the best approach at this point may be to judge validity in terms of whether the evaluations are improving outcomes. And, unfortunately, there is little indication that this is happening in most places.

Value-Added Versus Observations, Part One: Reliability

Although most new teacher evaluations are still in various phases of pre-implementation, it’s safe to say that classroom observations and/or value-added (VA) scores will be the most heavily-weighted components toward teachers’ final scores, depending on whether teachers are in tested grades and subjects. One gets the general sense that many - perhaps most - teachers strongly prefer the former (observations, especially peer observations) over the latter (VA).

One of the most common arguments against VA is that the scores are error-prone and unstable over time - i.e., that they are unreliable. And it's true that the scores fluctuate between years (also see here), with much of this instability due to measurement error, rather than “real” performance changes. On a related note, different model specifications and different tests can yield very different results for the same teacher/class.

These findings are very important, and often too casually dismissed by VA supporters, but the issue of reliability is, to varying degrees, endemic to all performance measurement. Actually, many of the standard reliability-based criticisms of value-added could also be leveled against observations. Since we cannot observe “true” teacher performance, it’s tough to say which is “better” or “worse," despite the certainty with which both “sides” often present their respective cases. And, the fact that both entail some level of measurement error doesn't by itself speak to whether they should be part of evaluations.*

Nevertheless, many states and districts have already made the choice to use both measures, and in these places, the existence of imprecision is less important than how to deal with it. Viewed from this perspective, VA and observations are in many respects more alike than different.

If Your Evidence Is Changes In Proficiency Rates, You Probably Don't Have Much Evidence

Education policymaking and debates are under constant threat from an improbable assailant: Short-term changes in cross-sectional proficiency rates.

The use of rate changes is still proliferating rapidly at all levels of our education system. These measures, which play an important role in the provisions of No Child Left Behind, are already prominent components of many states’ core accountability systems (e..g, California), while several others will be using some version of them in their new, high-stakes school/district “grading systems." New York State is awarding millions in competitive grants, with almost half the criteria based on rate changes. District consultants issue reports recommending widespread school closures and reconstitutions based on these measures. And, most recently, U.S. Secretary of Education Arne Duncan used proficiency rate increases as “preliminary evidence” supporting the School Improvement Grants program.

Meanwhile, on the public discourse front, district officials and other national leaders use rate changes to “prove” that their preferred reforms are working (or are needed), while their critics argue the opposite. Similarly, entire charter school sectors are judged, up or down, by whether their raw, unadjusted rates increase or decrease.

So, what’s the problem? In short, it’s that year-to-year changes in proficiency rates are not valid evidence of school or policy effects. These measures cannot do the job we’re having them do, even on a limited basis. This really has to stop.

The Uses (And Abuses?) Of Student Data

Knewton, a technology firm founded in 2008, has developed an “adaptive learning platform” that received significant media attention (also here, here, here and here), as well as funding and recognition early last fall and, again, in February this year (here and here). Although the firm is not alone in the adaptive learning game – e.g., Dreambox, Carnegie Learning – Knewton’s partnership with Pearson puts the company in a whole different league.

Adaptive learning takes advantage of student-generated information; thus, important questions about data use and ownership need to be brought to the forefront of the technology debate.

Adaptive learning software adjusts the presentation of educational content to students' needs, based on students’ prior responses to such content. In the world of research, such ‘prior responses’ would count and be treated as data. To the extent that adaptive learning is a mechanism for collecting information about learners, questions about privacy, confidentiality and ownership should be addressed.

Learning From Teach For America

There is a small but growing body of evidence about the (usually test-based) effectiveness of teachers from Teach for America (TFA), an extremely selective program that trains and places new teachers in mostly higher needs schools and districts. Rather than review this literature paper-by-paper, which has already been done by others (see here and here), I’ll just give you the super-short summary of the higher-quality analyses, and quickly discuss what I think it means.*

The evidence on TFA teachers focuses mostly on comparing their effect on test score growth vis-à-vis other groups of teachers who entered the profession via traditional certification (or through other alternative routes). This is no easy task, and the findings do vary quite a bit by study, as well as by the group to which TFA corps members are compared (e.g., new or more experienced teachers). One can quibble endlessly over the methodological details (and I’m all for that), and this area is still underdeveloped, but a fair summary of these papers is that TFA teachers are no more or less effective than comparable peers in terms of reading tests, and sometimes but not always more effective in math (the differences, whether positive or negative, tend to be small and/or only surface after 2-3 years). Overall, the evidence thus far suggests that TFA teachers perform comparably, at least in terms of test-based outcomes.

Somewhat in contrast with these findings, TFA has been the subject of both intensive criticism and fawning praise. I don’t want to engage this debate directly, except to say that there has to be some middle ground on which a program that brings talented young people into the field of education is not such a divisive issue. I do, however, want to make a wider point specifically about the evidence on TFA teachers – what it might suggest about the current focus to “attract the best people” to the profession.

Beware Of Anecdotes In The Value-Added Debate

A recent New York Times "teacher diary" presents the compelling account of a New York City teacher whose value-added rating was 6th percentile in 2009 – one of the lowest scores in the city – and 96th percentile the following year, one of the highest. Similar articles - for example, about teachers with errors in their rosters or scores that conflict with their colleagues'/principals' opinions - have been published since the release of the city’s teacher data reports (also see here). These accounts provoke a lot of outrage and disbelief, and that makes sense – they can sound absurd.

Stories like these can be useful as illustrations of larger trends and issues - in this case, of the unfairness of publishing the NYC scores,  most of which are based on samples that are too small to provide meaningful information. But, in the debate over using these estimates in actual policy, we need to be careful not to focus too much on anecdotes. For every one NYC teacher whose value-added rank changed over 90 points between 2009 and 2010, there are almost 100 teachers whose ranks were within 10 points (and percentile ranks overstate the actual size of all these differences). Moreover, even if the models yielded perfect measures of test-based teacher performance, there would still be many implausible fluctuations between years - those that are unlikely to be "real" change - due to nothing more than random error.*

The reliability of value-added estimates, like that of all performance measures (including classroom observations), is an important issue, and is sometimes dismissed by supporters in a cavalier fashion. There are serious concerns here, and no absolute answers. But none of this can be examined or addressed with anecdotes.

Dispatches From The Nexus Of Bad Research And Bad Journalism

In a recent story, the New York Daily News uses the recently-released teacher data reports (TDRs) to “prove” that the city’s charter school teachers are better than their counterparts in regular public schools. The headline announces boldly: New York City charter schools have a higher percentage of better teachers than public schools (it has since been changed to: "Charters outshine public schools").

Taking things even further, within the article itself, the reporters note, “The newly released records indicate charters have higher performing teachers than regular public schools."

So, not only are they equating words like “better” with value-added scores, but they’re obviously comfortable drawing conclusions about these traits based on the TDR data.

The article is a pretty remarkable display of both poor journalism and poor research. The reporters not only attempted to do something they couldn’t do, but they did it badly to boot. It’s unfortunate to have to waste one’s time addressing this kind of thing, but, no matter your opinion on charter schools, it's a good example of how not to use the data that the Daily News and other newspapers released to the public.

Ready, Aim, Hire: Predicting The Future Performance Of Teacher Candidates

In a previous post, I discussed the idea of “attracting the best candidates” to teaching by reviewing the research on the association between pre-service characteristics and future performance (usually defined in terms of teachers’ estimated effect on test scores once they get into the classroom). In general, this body of work indicates that, while far from futile, it’s extremely difficult to predict who will be an “effective” teacher based on their paper traits, including those that are typically used to define “top candidates," such as the selectivity of the undergraduate institutions they attend, certification test scores and GPA (see here, here, here and here, for examples).

There is some very limited evidence that other, “non-traditional” measures might help. For example, a working paper, released last year, found a statistically discernible, fairly strong association between first-year math value-added and an index constructed from surveys administered to Teach for America candidates. There was, however, no association in reading (note that the sample was small), and no relationships in either subject found during these teachers’ second years.*

A recently-published paper – which appears in the peer-reviewed journal Education Finance and Policy, originally released as working paper in 2008 –  represents another step forward in this area. The analysis, presented by the respected quartet of Jonah Rockoff, Brian Jacob, Thomas Kane, and Douglas Staiger (RJKS), attempts to look beyond the set of characteristics that researchers are typically constrained (by data availability) to examine.

In short, the results do reveal some meaningful, potentially policy-relevant associations between pre-service characteristics and future outcomes. From a more general perspective, however, they are also a testament to the difficulties inherent in predicting who will be a good teacher based on observable traits.

Do Value-Added Models "Control For Poverty?"

There is some controversy over the fact that Florida’s recently-announced value-added model (one of a class often called “covariate adjustment models”), which will be used to determine merit pay bonuses and other high-stakes decisions, doesn’t include a direct measure of poverty.

Personally, I support adding a direct income proxy to these models, if for no other reason than to avoid this type of debate (and to facilitate the disaggregation of results for instructional purposes). It does bear pointing out, however, that the measure that’s almost always used as a proxy for income/poverty – students’ eligibility for free/reduced-price lunch – is terrible as a poverty (or income) gauge. It tells you only whether a student’s family has earnings below (or above) a given threshold (usually 185 percent of the poverty line), and this masks most of the variation among both eligible and non-eligible students. For example, families with incomes of $5,000 and $20,000 might both be coded as eligible, while families earning $40,000 and $400,000 are both coded as not eligible. A lot of hugely important information gets ignored this way, especially when the vast majority of students are (or are not) eligible, as is the case in many schools and districts.

That said, it’s not quite accurate to assert that Florida and similar models “don’t control for poverty." The model may not include a direct income measure, but it does control for prior achievement (a student’s test score in the previous year[s]). And a student’s test score is probably a better proxy for income than whether or not they’re eligible for free/reduced-price lunch.

Even more importantly, however, the key issue about bias is not whether the models “control for poverty," but rather whether they control for the range of factors – school and non-school – that are known to affect student test score growth, independent of teachers’ performance. Income is only one part of this issue, which is relevant to all teachers, regardless of the characteristics of the students that they teach.

Guessing About NAEP Results

Every two years, the release of data from the National Assessment of Educational Progress (NAEP) generates a wave of research and commentary trying to explain short- and long-term trends. For instance, there have been a bunch of recent attempts to “explain” an increase in aggregate NAEP scores during the late 1990s and 2000s. Some analyses postulate that the accountability provisions of NCLB were responsible, while more recent arguments have focused on the “effect” (or lack thereof) of newer market-based reforms – for example, looking to NAEP data to “prove” or “disprove” the idea that changes in teacher personnel and other policies have (or have not) generated “gains” in student test scores.

The basic idea here is that, for every increase or decrease in cross-sectional NAEP scores over a given period of time (both for all students and especially for subgroups such as minority and low-income students), there must be “something” in our education system that explains it. In many (but not all) cases, these discussions consist of little more than speculation. Discernible trends in NAEP test score data are almost certainly due to a combination of factors, and it’s unlikely that one policy or set of policies is dominant enough to be identified as “the one." Now, there’s nothing necessarily wrong with speculation, so long as it is clearly identified as such, and conclusions presented accordingly. But I find it curious that some people involved with these speculative arguments seem a bit too willing to assume that schooling factors – rather than changes in cohorts’ circumstances outside of school – are the primary driver of NAEP trends.

So, let me try a little bit of illustrative speculation of my own: I might argue that changes in the economic conditions of American schoolchildren and their families are the most compelling explanation for changes in NAEP.