Measurement And Incentives In The USED Teacher Preparation Regulations

Late last year, the U.S. Department of Education (USED) released a set of regulations, the primary purpose of which is to require states to design formal systems of accountability for teacher preparation (TP) programs. Specifically, states are required to evaluate annually the programs operating within their boundaries, and assign performance ratings. Importantly, the regulations specify that programs receiving low ratings should face possible consequences, such as the loss of federal funding.

The USED regulations on TP accountability put forth several outcomes that states are to employ in their ratings, including: Student outcomes (e.g., test-based effectiveness of graduates); employment outcomes (e.g., placement/retention); and surveys (e.g., satisfaction among graduates/employers). USED proposes that states have their initial designs completed by the end of this year, and start generating ratings in 2017-18.

As was the case with the previous generation of teacher evaluations, teacher preparation is an area in which there is widespread agreement about the need for improvement. And formal high stakes accountability systems can (even should) be a part of that at some point. Right now, however, requiring all states to begin assigning performance ratings to schools, and imposing high stakes accountability for those ratings within a few years, is premature. The available measures have very serious problems, and the research on them is in its relative infancy. If we cannot reliably distinguish between programs in terms of their effectiveness, it is ill-advised to hold them formally accountable for that effectiveness. The primary rationale for the current focus on teacher quality and evaluations was established over decades of good research. We are nowhere near that point for TP programs. This is one of those circumstances in which the familiar refrain of “it’s imperfect but better than nothing” is false, and potentially dangerous.

Let’s take a look at several reasons why this is the case, starting with a discussion of TP value-added measures, which will, most likely, end up being the major component of most states’ systems. (Side note: Bruce Baker has been writing about the issue of TP accountability for several years - see, for example, hereherehere, and here).

The existing research in this area suggests that there are only small (and imprecisely estimated) differences between TP programs. The current push to reform teacher evaluation systems, and to improve teacher quality in general, was motivated largely by the impressive body of evidence that teachers’ test-based effectiveness varies widely. USED relies heavily on this finding to justify their TP regulations. The problem is that the existing work on TPs' test-based effectiveness, as it relates to formal accountability, is in many respects distinct from that focused on teacher-level effectiveness. Moreover, while this body of evidence on TP programs is just starting to accumulate, it suggests that measured differences in the value-added of graduates of different programs may be rather small.

In the press materials accompanying the regulations, USED cites as support for its TP plan an analysis of graduates of TP programs in Washington State who end up teaching in elementary schools (Goldhaber et al. 2013), which finds that there are a handful of meaningfully different programs at the tails of the TP value-added distribution, but also that the vast majority of programs produce teachers whose average effectiveness once they reach the classroom is not substantially different from teachers trained out of state.

Moreover, in a paper forthcoming in Education Finance and Policy, Koedel et al. find even less meaningful and/or detectable variation between programs in Missouri. Instead, they conclude, most of the variation between teachers is found within rather than between programs, and that value-added models are, at least at present, of little practical utility to adminstrators (also see Koedel and Parsons 2014 for further discussion of these findings and implications). And researchers have reached reached roughly similar conclusions from analyses of states such as North Carolina (Henry et al. 2013), Louisiana (Gansle 2010), Florida (Mihaly et al. 2013) and Texas (Von Hippel et al. 2014). In short, while there are exceptions (e.g., Boyd et al. 2009, though see Koedel and Parsons 2014), the current evidence does not support the idea that there are large, statistically discernible differences between most TP programs, at least in terms of test-based productivity among graduates.

As a result, two programs might receive very different rankings, but still have effect sizes that are estimated imprecisely and not very different (from each other, and from the state average). This is a huge problem in an accountability system in which performance ratings are attached to high-stakes consequences. These findings do not, of course, mean that all TP programs are the same in terms of quality. And it is entirely possible that programs in states that have not yet been studied will yield different results. In the meantime, however, the current evidence suggests, rather clearly, that available measures are not yet equipped to detect that variation (in part, perhaps, because it may actually be rather minor).

TP value-added cannot separate selection from actual program effects. In addition to the lack of measured variation between programs, assessing the effectiveness of TP programs based on their graduates' (eventual) test-based productivity entails another big problem. As discussed in this post, some programs attract better candidates than others, and test-based productivity measures cannot separate the degree to which it was the pre-existing "ability" of applicants, rather than attending a specific program, that is responsible for the differences in effectiveness. Put differently, a great new teacher might have been just as great had she attended a different school. 

For some users of the ratings, this may not be an issue - for example, if principals who are looking to hire teachers using the ratings might not care whether they reflect program or selection effects, since they are interested solely in hiring the best people. In other contexts, however, including a formal high-stakes accountability system in which programs' funding and reputations ride on the ratings, this is a very serious problem.

(There is also concern that teachers’ effectiveness depends to some extent on factors such as school context, quality, or “fit,” which means that TP value-added might be confounded to some degree by where their graduates end up teaching [see Mihaly et al. 2013]).

The alternative measures suggested by USED suffer from similar problems, and are almost entirely without empirical validation. To reiterate, the USED regulations allow states some leeway in which measures states can use, and how they are weighted. Yet these alternatives too are very problematic and under-researched. For example, the idea of holding TP programs accountable for the retention of their graduates, and whether or not they teach in high-needs schools, is ill-considered. One can only applaud TP institutions that encourage their graduates to make a career of teaching, or to teach in high-needs schools, but the purpose here should be to hold programs accountable for their graduates’ performance, not for these teachers' career trajectories or where they choose to be employed.

Moreover, where teachers end up teaching, and/or whether they stay in the profession, cannot really be attributed to TP programs in a manner that would not be severely confounded by non-program factors. For instance, perhaps a given program’s graduates are more likely to end up teaching in high-needs schools because of the institution’s proximity to these schools, and because teachers tend to prefer finding jobs near where they grew up (Boyd et al. 2005). Or maybe some programs attract more idealistic candidates, or candidates who are more certain about their career choice. Raw retention or placement outcomes vary for many reasons. Some of them may be part of TP programs (e.g., student teaching - see Ronfeldt 2012), but many others have nothing to do with which program teachers attend.

Similarly, surveys of graduates and employers seem like a reasonable idea, and perhaps they will be useful in TP accountability systems in the future. For now, however, there is precious little evidence on what these surveys are telling us and how they compare with the other measures being considered. In addition, once again, employers might give high marks to some programs and low marks to others, but these differences might very well be driven by the selection effects discussed above, or by teachers’ decisions about where they end up seeking employment. And it’s difficult to know how to interpret graduates’ opinions of their programs, since they have nothing to which they can compare their experiences, and the perceived quality of preparation they received.

We don’t know how TP programs (and other stakeholders) will respond to formal high-stakes accountability, and immediate, national implementation precludes informed decision making. Given the measures currently being pushed by the USED regulations, high-stakes accountability under these regulations are virtually certain to reward and punish programs based partially, and perhaps even largely, on factors other than their actual quality.

Now, in fairness, it bears noting that the core purpose of accountability systems is to encourage productive behavior, and that this can happen even if the measures underlying TP ratings are highly imperfect. Yet the USED TP regulations require all states to begin high-stakes accountability within a few short years. These programs stand to lose not just prestige or applicant volume, but also federal funding and, perhaps, accreditation. There will be no opportunity to develop a more robust body of evidence on these measures, or even to see how these new systems work on a smaller scale before imposing them on the entire TP landscape, all at once.

Faced with stark consequences based on unproven, imprecise, and easily-interpreted measures, it is not difficult to imagine a future in which TP programs (and candidates) respond unproductively. TP institution administrators and teachers might look to programs with high ratings and attempt to replicate their approaches, even though the ratings are transmitting severely biased (and/or noisy) information about the efficacy of those approaches.

Those programs that have traditionally recruited and trained more disadvantaged students might scale back these efforts. And prospective teachers may make suboptimal decisions about which program to attend based on ratings that are telling them more about the supply of applicants than the quality of programs those applicants attend. 

Policymakers cannot, of course, prevent unintended responses to accountability systems, but, in this case, the information will be so deeply flawed that it's unclear whether -- perhaps even unlikely that -- the benefits of productive behavioral changes will outweigh the cost of the unproductive changes (and states that have started using TP value-added for low-stakes accountability have not inspired much confidence that they understand the limitations of these measures).

The prudent approach here is allow at least 5-7 years for additional research and smaller scale, low-stakes implementation. Policymaking almost never entails certainty, and it is often worth going ahead with large-scale policy implementation even when proposed systems are imperfect. There will eventually be a right time for high-stakes nationwide TP accountability. But now is not that time. It is not yet sufficiently established that programs vary widely in their effectiveness, nor is it anywhere near apparent that current measurement tools can detect that variation if and where it exists.

Accordingly, there needs to be several years of additional research on TP value-added (as well as retention) in multiple locations, including the persistence of program effects, the sensitivity of estimates to approaches that address the placement of graduates, and the degree to which they reflect the quality and characteristics of programs’ applicants, rather than actual program quality. On a related note, there seems to be some potential in examining the variation of pre-service teachers within programs, and/or in assessing the impact of programs’ master’s degree programs, since researchers might exploit the fact that many teachers are in the classroom a while before pursuing a master’s, which would provide a baseline for assessing program effects. States should also be encouraged to begin trying out accountability ratings without stakes, and to focus particularly on developing measures that do not rely on test scores (including testing those “recommended” by the USED regulations). In this sense, it is less the individual components of the USED regulations that are the problem than the scope and speed of their implementation.

Finally, there is a big picture here. Over the past 5-10 years, test-based accountability, both formal and informal, has ramped up systematically, and has been imposed on schools, teachers, principals and even superintendents. Additional high stakes applications must approached with extreme caution. Granted, recommendations to wait before implementing new policies can be frustrating for policymakers and supporters of those policies. They are very sensitive to the political climate, and they know that waiting and/or slow implementation run the risk of subsequent administrations shutting policies down before they get off the ground. These are fair points, but they are political arguments, not policy arguments. And there is no political argument that can justify holding institutions accountable for measures that are not yet ready to be used in an accountability system.

Issues Areas