Matching Up Teacher Value-Added Between Different Tests

The U.S. Department of Education has released a very short, readable report on the comparability of value-added estimates using two different tests in Indiana – one of them norm-referenced (the Measures of Academic Progress test, or MAP), and the other criterion-referenced (the Indiana Statewide Testing for Educational Progress Plus, or ISTEP+, which is also the state’s official test for NCLB purposes).

The research design here is straightforward – fourth and fifth grade students in 46 schools across 10 districts in Indiana took both tests, their teachers’ value-added scores were calculated, and the scores were compared. Since both sets of scores were based on the same students and teachers, this is allows a direct comparison of how teachers’ value-added estimates compare between these two tests. The results are not surprising, and they square with similar prior studies (see here, here, here, for example): The estimates based on the two tests are moderately correlated. Depending on the grade/subject, they are between 0.4 and 0.7. If you’re not used to interpreting correlation coefficients, consider that only around one-third of teachers were in the same quintile (fifth) on both tests, and another 40 or so percent were one quintile higher or lower. So, most teachers were within a quartile, about a quarter of teachers moved two or more quintiles, and a small percentage moved from top to bottom or vice-versa.

Although, as mentioned above, these findings are in line with prior research, it is worth remembering why this “instability” occurs (and what can be done about it).

Contrary to what many might believe, the primary reason is probably not differences in content between the tests (see Appendix C of the report). No doubt that is a factor, but, rather, the big cause of the moderate correlations is the same cause of instability between years using the same type of test: Value-added estimates are imprecisely estimated, and so there’s a limit as to how much they can “match up” with each other, whether between tests or years.

In fact, even if value-added was a perfectly “accurate” measure of teacher effectiveness, there would still be a great deal of instability due to nothing more than this error (which is itself mostly a result of the countless factors that might affect student testing performance).

That the instability is inevitable does not, of course, mitigate the concerns about it. The fact that teacher value-added estimates using two different tests match up only modestly is a serious issue (though those who express such concerns too often fail to mention that all measures worth their salt, including classroom observations, are unstable over time and vary between observation protocols).

Nor does it mean that states and districts are powerless in the face of imprecision. There are a few ways to reduce error and thus ensure a more faithful interpretation of the estimates. One is pooling data across multiple years (when possible). This boosts the sample size and reduces the likelihood, for example, that a teacher’s scores were biased by something that happened on testing day (there is also a technique called “shrinkage," by which the estimates are adjusted based on sample size).

A second means of reducing error, put simply, is to address it directly when incorporating value-added estimates into teacher evaluations and other personnel policies. As I’ve discussed many times before, each estimate is accompanied by a margin of error, which is a statistical measure of the confidence we can have in its precision. If that margin of error is larger than the difference between the estimate and average, whether higher or lower, that means we cannot have much confidence that it is above or below average (i.e., it is not statistically significant).

One very simple way for districts to incorporate this idea into evaluations is to account for these margins of error directly in evaluations.

These two approaches would help ensure that teachers with huge margins of error were not being “penalized," although there are costs to consider. Most notably, requiring multiple years of data would limit the number of teachers who are “eligible” for the estimates. Similarly, using margins of error would entail the forfeiture of information (i.e., differentiation of teachers’ scores). For example, in practice, the majority of teachers’ estimates are statistically no different from the average (actually, on the fall-to-fall reading comparisons in the report, it’s a pretty striking 97-98 percent, though it is typically more like 65-80 percent). The report discussed above finds that coding teachers' scores using the margins of error reduces "instability" to a relatively negligible level.

Among those states and districts that have already decided to incorporate value-added, there are a few that have decided to employ some version of these “fixes." But a rather disturbing number are simply using unadjusted estimates based on one year of data (and some are doing so while, ironically, also requiring multiple classroom observations, which is a different form of boosting sample size).

Given the trade-offs, it is not possible to argue that there is one specific “correct” approach (and not using value-added at all is of course an option favored by many). These systems are all pretty new, and it's difficult to know how they will work (including how teachers will respond to them) without more field testing.

In my view, however, it would be nice to see a larger group of states/districts taking a more active role in addressing the imprecision of value-added estimates. And the issue certainly receives too little attention.

- Matt Di Carlo