Do We Know How To Hold Teacher Preparation Programs Accountable?

This piece is co-authored by Cory Koedel and Matthew Di Carlo. Koedel is an Associate Professor of Economics and Public Policy at the University of Missouri, Columbia.

The United States Department of Education (USED) has proposed regulations requiring states to hold teacher preparation programs accountable for the performance of their graduates. According to the proposal, states must begin assigning ratings to each program within the next 2-3 years, based on outcomes such as graduates’ “value-added” to student test scores, their classroom observation scores, how long they stay in teaching, whether they teach in high-needs schools, and surveys of their principals’ satisfaction.

In the long term, we are very receptive to, and indeed optimistic about, the idea of outcomes-based accountability for teacher preparation programs (TPPs). In the short to medium term, however, we contend that the evidence base underlying the USED regulations is nowhere near sufficient to guide a national effort toward high-stakes TPP accountability.

This is a situation in which the familiar refrain of “it’s imperfect but better than nothing” is false, and rushing into nationwide design and implementation could be quite harmful.

Over the past 5-10 years, there has been a widespread, albeit still controversial, effort to improve evaluation systems for individual teachers, including the incorporation of test-based productivity measures (e.g., value-added). This is the culmination of decades of rigorous research on the reliability and validity of the measures constituting the new systems.

The idea of using similar measures, and value-added in particular, to gauge the effectiveness of TPPs may seem like a common sense extension of what we’re already doing, but in reality it is fraught with complications.

The critical issue is that the pool of annual graduates from most teacher preparation programs is very small. This means that measures of TPP performance are extremely imprecise, statistically speaking. Several recent studies have shown that we simply cannot detect differences between the vast majority of TPPs with a reasonable degree of confidence. And this issue applies not only to test-based measures, but to virtually every one of the types of outcomes required by the USED regulations. All of them depend on tracking outcomes for small numbers of graduates who get jobs in K-12 schools.

Compounding this problem is the fact that current differences between TPPs in terms of the average test-based effectiveness of their graduates in the classroom are not particularly large. Efforts to rank programs will almost surely lead to an unacceptably high number of situations in which two programs that are similar in “true” performance receive very different rankings.

Moreover, once again, this concern is not limited to test-based value-added measures. It is more than plausible that some or all of the other outcomes required by USED’s regulations do not differ much between graduates of different TPPs. The unfortunate reality is we don’t know whether this is the case, for one simple reason: there is very little research about these measures in the TPP context. The idea of pre-selecting accountability indicators now, without even this most basic empirical groundwork, is disconcerting – and ill-advised. 

Granted, our reasons for caution in this endeavor do not make for the best talking points – it’s tough to advocate for policies with slogans about statistical imprecision, and “wait for more research” isn’t exactly the most thrilling rallying cry.

But this does not make them any less important. A system designed around measures that cannot convey the information required will not be successful, and has the potential to do real harm. For example, programs might lose federal funding based on rankings, or administrators and professors might attempt to replicate the practices of highly-ranked programs, even if the rankings do not reflect real differences in efficacy – they would be “chasing (statistical) noise.”

To be clear, the issues we raise here are serious, but not insurmountable. The recent expansion of availability of data in K-12 education has greatly accelerated our ability to find solutions for these types of problems. Thus, our primary concern is not with outcomes-based TPP accountability, with which we agree in principle, but rather with the USED pre-selecting the specific types of measures that states will have to use, and requiring states to design their initial systems within just a few short years.

This implementation schedule is startling to the point of seeming reckless. It would seem to imply that we already know what outcomes are important for TPP accountability, and that we are able to construct informative measures of these outcomes. Neither is true at this time.

Despite the push by USED, spending bills recently drafted by the House and Senate appropriation committees would, among other things, prevent the Department from issuing teacher-preparation accountability rules for now. Although this aspect of the spending bills is likely more about governance philosophy (i.e., whether to regulate or legislate) than the substance of the USED teacher preparation regulations, in this case we believe the end result would be beneficial.

In particular, the delay would allow more time for additional research and experimentation in this area. States could be encouraged to innovate and try out different metrics, including those proposed by the USED. Only after such a learning period, and a better understanding of what will and will not work, does it make sense to develop a national strategy. Policy is no different from great teaching: both require preparation.

Issues Areas