The Uncertain Short-Term Future Of School Growth Models

Over the past 20 years, public schools in the U.S. have come to rely more and more on standardized tests, and the COVID-19 pandemic has halted the flow of these data. This is hardly among the most important disruptions that teachers, parents, and students have endured over the past year or so. But one of the corollaries of skipping a year (or more) of testing is its implications for estimating growth models, which are statistical approaches for assessing the association between students' testing progress and those students' teachers, schools, or districts. 

This type of information, used properly, is always potentially useful, but it may be particularly timely right now, as we seek to understand how the COVID-19 pandemic affected educational outcomes, and, perhaps, how those outcomes varied by different peri-pandemic approaches to schooling. This includes the extent to which there were meaningful differences by student subgroup (e.g., low-income students who may have had more issues with virtual schooling). 

To be clear, the question of when states should resume testing should be evaluated based on what’s best for schools and students, and in my view this decision should not include consideration of any impact on accountability systems (the latest development is that states will not be allowed to cancel testing entirely but may be allowed to curtail it). In either case, though, the fate of growth models over the next couple of years is highly uncertain. The models rely on tracking student test scores over time, and so skipping a year (and maybe even more) is obviously a potential problem. A new working paper takes a first step toward assessing the short-term feasibility of growth estimates (specifically school and district scores). But this analysis also provides a good context for a deeper discussion of how we use (and sometimes misuse) testing data in education policy.

Teacher Evaluations And Turnover In Houston

We are now entering a time period in which we might start to see a lot of studies released about the impact of new teacher evaluations. This incredibly rapid policy shift, perhaps the centerpiece of the Obama Administration’s education efforts, was sold based on illustrations of the importance of teacher quality.

The basic argument was that teacher effectiveness is perhaps the most important factor under schools’ control, and the best way to improve that effectiveness was to identify and remove ineffective teachers via new teacher evaluations. Without question, there was a logic to this approach, but dismissing or compelling the exits of low performing teachers does not occur in a vacuum. Even if a given policy causes more low performers to exit, the effects of this shift can be attenuated by turnover among higher performers, not to mention other important factors, such as the quality of applicants (Adnot et al. 2016).

A new NBER working paper by Julie Berry Cullen, Cory Koedel, and Eric Parsons, addresses this dynamic directly by looking at the impact on turnover of a new evaluation system in Houston, Texas. It is an important piece of early evidence on one new evaluation system, but the results also speak more broadly to how these systems work.

Do Subgroup Accountability Measures Affect School Ratings Systems?

The school accountability provisions of No Child Left Behind (NCLB) institutionalized a focus on the (test-based) performance of student subgroups, such as English language learners, racial and ethnic groups, and students eligible for free- and reduced-price lunch (FRL). The idea was to shine a spotlight on achievement gaps in the U.S., and to hold schools accountable for serving all students.

This was a laudable goal, and disaggregating data by student subgroups is a wise policy, as there is much to learn from such comparisons. Unfortunately, however, NCLB also institutionalized the poor measurement of school performance, and so-called subgroup accountability was not immune. The problem, which we’ve discussed here many times, is that test-based accountability systems in the U.S. tend to interpret how highly students score as a measure of school performance, when it is largely a function of factors out of schools' control, such as student background. In other words, schools (or subgroups of those students) may exhibit higher average scores or proficiency rates simply because their students entered the schools at higher levels, regardless of how effective the school may be in raising scores. Although NCLB’s successor, the Every Student Succeeds Act (ESSA), perpetuates many of these misinterpretations, it still represents some limited progress, as it encourages greater reliance on growth-based measures, which look at how quickly students progress while they attend a school, rather than how highly they score in any given year (see here for more on this).

Yet this evolution, slow though it may be, presents a somewhat unique challenge for the inclusion of subgroup-based measures in formal school accountability systems. That is, if we stipulate that growth model estimates are the best available test-based way to measure school (rather than student) performance, how should accountability systems apply these models to traditionally lower scoring student subgroups?

A Few Reactions To The Final Teacher Preparation Accountability Regulations

The U.S. Department of Education (USED) has just released the long-anticipated final regulations for teacher preparation (TP) program accountability. These regulations will guide states, which are required to design their own systems for assessing TP program performance for full implementation in 2018-19. The earliest year in which stakes (namely, eligibility for federal grants) will be attached to the ratings is 2021-22.

Among the provisions receiving attention is the softening of the requirement regarding the use of test-based productivity measures, such as value-added and other growth models (see Goldhaber et al. 2013; Mihaly et al. 2013; Koedel et al. 2015). Specifically, the final regulations allow greater “flexibility” in how and how much these indicators must count toward final ratings. For the reasons that Cory Koedel and I laid out in this piece (and I will not reiterate here), this is a wise decision. Although it is possible that value-added estimates will eventually play a significant role in these TP program accountability systems, the USED timeline provides insufficient time for the requisite empirical groundwork.

Yet this does not resolve the issues facing those who must design these systems, since putting partial brakes on value-added for TP programs also puts increased focus on the other measures which might be used to gauge program performance. And, as is often the case with formal accountability systems, the non-test-based bench is not particularly deep.

The Details Matter In Teacher Evaluations

Throughout the process of reforming teacher evaluation systems over the past 5-10 years, perhaps the most contentious, discussed issue was the importance, or weights, assigned to different components. Specifically, there was a great deal of debate about the proper weight to assign to test-based teacher productivity measures, such estimates from value-added and other growth models.

Some commentators, particularly those more enthusiastic about test-based accountability, argued that the new teacher evaluations somehow were not meaningful unless value-added or growth model estimates constituted a substantial proportion of teachers’ final evaluation ratings. Skeptics of test-based accountability, on the other hand, tended toward a rather different viewpoint – that test-based teacher performance measures should play little or no role in the new evaluation systems. Moreover, virtually all of the discussion of these systems’ results, once they were finally implemented, focused on the distribution of final ratings, particularly the proportions of teachers rated “ineffective.”

A recent working paper by Matthew Steinberg and Matthew Kraft directly addresses and informs this debate. Their very straightforward analysis shows just how consequential these weighting decisions, as well as choices of where to set the cutpoints for final rating categories (e.g., how many points does a teacher need to be given an “effective” versus “ineffective” rating), are for the distribution of final ratings.

Research On Teacher Evaluation Metrics: The Weaponization Of Correlations

Our guest author today is Cara Jackson, Assistant Director of Research and Evaluation at the Urban Teacher Center.

In recent years, many districts have implemented multiple-measure teacher evaluation systems, partly in response to federal pressure from No Child Left Behind waivers and incentives from the Race to the Top grant program. These systems have not been without controversy, largely owing to the perception – not entirely unfounded - that such systems might be used to penalize teachers.  One ongoing controversy in the field of teacher evaluation is whether these measures are sufficiently reliable and valid to be used for high-stakes decisions, such as dismissal or tenure.  That is a topic that deserves considerably more attention than a single post; here, I discuss just one of the issues that arises when investigating validity.

 The diagram below is a visualization of a multiple-measure evaluation system, one that combines information on teaching practice (e.g. ratings from a classroom observation rubric) with student achievement-based measures (e.g. value-added or student growth percentiles) and student surveys.  The system need not be limited to three components; the point is simply that classroom observations are not the sole means of evaluating teachers.   

In validating the various components of an evaluation system, researchers often examine their correlation with other components.  To the extent that each component is an attempt to capture something about the teacher’s underlying effectiveness, it’s reasonable to expect that different measurements taken of the same teacher will be positively related.  For example, we might examine whether ratings from a classroom observation rubric are positively correlated with value-added.

Empower Teachers To Lead, Encourage Students To Be Curious

Our guest author today is Ashim Shanker, a former English Language Arts teacher in public schools in Tokyo, Japan. Ashim has a Master’s Degree in International Education Policy from Harvard University and is the author of three books, including Don’t Forget to Breathe. Follow him on Twitter at @ashimshanker.

In the 11 years that I was a public school teacher in Japan, I came to view education as a holistic enterprise. Schools in Japan not only imbued students with relevant skills, but also nurtured within them the wherewithal to experience a sense of connection with the larger world, and the exploratory capacity to discover their place within it.

In my language arts classes, I encouraged students to read about current events and human rights issues around the world. I asked them to make lists of the electronics they used, the garments they wore, and the food products they consumed on a daily basis. I then had them research where these products were made and under what labor conditions.

The students gave presentations on child laborers and about modern-day slavery. They debated about government secrecy laws in Japan and cover-ups in the aftermath of the Fukushima nuclear disaster. They read an essay on self-reliance by Emerson and excerpts on civil disobedience by Thoreau, and I asked them how these two activists might have felt about the actions of groups like Anonymous, or about whistleblowers like Edward Snowden. We discussed the Milgram Experiment and the Stanford Prison experiment, exploring how obedience and situational role conformity might tip even those with the best of intentions toward acts of cruelty. We talked about bullying, and shared anecdotes of instances in which we might unintentionally have hurt others. There were opportunities for self-reflection, engagement, and character building—attributes that I would like to think foster the empathic foundations for better civic engagement and global citizenship.

Do We Know How To Hold Teacher Preparation Programs Accountable?

This piece is co-authored by Cory Koedel and Matthew Di Carlo. Koedel is an Associate Professor of Economics and Public Policy at the University of Missouri, Columbia.

The United States Department of Education (USED) has proposed regulations requiring states to hold teacher preparation programs accountable for the performance of their graduates. According to the proposal, states must begin assigning ratings to each program within the next 2-3 years, based on outcomes such as graduates’ “value-added” to student test scores, their classroom observation scores, how long they stay in teaching, whether they teach in high-needs schools, and surveys of their principals’ satisfaction.

In the long term, we are very receptive to, and indeed optimistic about, the idea of outcomes-based accountability for teacher preparation programs (TPPs). In the short to medium term, however, we contend that the evidence base underlying the USED regulations is nowhere near sufficient to guide a national effort toward high-stakes TPP accountability.

This is a situation in which the familiar refrain of “it’s imperfect but better than nothing” is false, and rushing into nationwide design and implementation could be quite harmful.

Will Value-Added Reinforce The Walls Of The Egg-Crate School?

Our guest author today is Susan Moore Johnson, Jerome T. Murphy Research Professor in Education at Harvard Graduate School of Education. Johnson directs the Project on the Next Generation of Teachers, which examines how best to recruit, develop, and retain a strong teaching force.

Academic scholars are often dismayed when policymakers pass laws that disregard or misinterpret their research findings. The use of value-added methods (VAMS) in education policy is a case in point.

About a decade ago, researchers reported that teachers are the most important school-level factor in students’ learning, and that that their effectiveness varies widely within schools (McCaffrey, Koretz, Lockwood, & Hamilton 2004; Rivkin, Hanushek, & Kain 2005; Rockoff 2004). Many policymakers interpreted these findings to mean that teacher quality rests with the individual rather than the school and that, because some teachers are more effective than others, schools should concentrate on increasing their number of effective teachers.

Based on these assumptions, proponents of VAMS began to argue that schools could be improved substantially if they would only dismiss teachers with low VAMS ratings and replace them with teachers who have average or higher ratings (Hanushek 2009). Although panels of scholars warned against using VAMS to make high-stakes decisions because of their statistical limitations (American Statistical Association, 2014; National Research Council & National Academy of Education, 2010), policymakers in many states and districts moved quickly to do just that, requiring that VAMS scores be used as a substantial component in teacher evaluation.

Measurement And Incentives In The USED Teacher Preparation Regulations

Late last year, the U.S. Department of Education (USED) released a set of regulations, the primary purpose of which is to require states to design formal systems of accountability for teacher preparation (TP) programs. Specifically, states are required to evaluate annually the programs operating within their boundaries, and assign performance ratings. Importantly, the regulations specify that programs receiving low ratings should face possible consequences, such as the loss of federal funding.

The USED regulations on TP accountability put forth several outcomes that states are to employ in their ratings, including: Student outcomes (e.g., test-based effectiveness of graduates); employment outcomes (e.g., placement/retention); and surveys (e.g., satisfaction among graduates/employers). USED proposes that states have their initial designs completed by the end of this year, and start generating ratings in 2017-18.

As was the case with the previous generation of teacher evaluations, teacher preparation is an area in which there is widespread agreement about the need for improvement. And formal high stakes accountability systems can (even should) be a part of that at some point. Right now, however, requiring all states to begin assigning performance ratings to schools, and imposing high stakes accountability for those ratings within a few years, is premature. The available measures have very serious problems, and the research on them is in its relative infancy. If we cannot reliably distinguish between programs in terms of their effectiveness, it is ill-advised to hold them formally accountable for that effectiveness. The primary rationale for the current focus on teacher quality and evaluations was established over decades of good research. We are nowhere near that point for TP programs. This is one of those circumstances in which the familiar refrain of “it’s imperfect but better than nothing” is false, and potentially dangerous.