Skip to:

Value-Added

  • Do Subgroup Accountability Measures Affect School Ratings Systems?

    Written on October 28, 2016

    The school accountability provisions of No Child Left Behind (NCLB) institutionalized a focus on the (test-based) performance of student subgroups, such as English language learners, racial and ethnic groups, and students eligible for free- and reduced-price lunch (FRL). The idea was to shine a spotlight on achievement gaps in the U.S., and to hold schools accountable for serving all students.

    This was a laudable goal, and disaggregating data by student subgroups is a wise policy, as there is much to learn from such comparisons. Unfortunately, however, NCLB also institutionalized the poor measurement of school performance, and so-called subgroup accountability was not immune. The problem, which we’ve discussed here many times, is that test-based accountability systems in the U.S. tend to interpret how highly students score as a measure of school performance, when it is largely a function of factors out of schools' control, such as student background. In other words, schools (or subgroups of those students) may exhibit higher average scores or proficiency rates simply because their students entered the schools at higher levels, regardless of how effective the school may be in raising scores. Although NCLB’s successor, the Every Student Succeeds Act (ESSA), perpetuates many of these misinterpretations, it still represents some limited progress, as it encourages greater reliance on growth-based measures, which look at how quickly students progress while they attend a school, rather than how highly they score in any given year (see here for more on this).

    Yet this evolution, slow though it may be, presents a somewhat unique challenge for the inclusion of subgroup-based measures in formal school accountability systems. That is, if we stipulate that growth model estimates are the best available test-based way to measure school (rather than student) performance, how should accountability systems apply these models to traditionally lower scoring student subgroups?

    READ MORE
  • A Few Reactions To The Final Teacher Preparation Accountability Regulations

    Written on October 19, 2016

    The U.S. Department of Education (USED) has just released the long-anticipated final regulations for teacher preparation (TP) program accountability. These regulations will guide states, which are required to design their own systems for assessing TP program performance for full implementation in 2018-19. The earliest year in which stakes (namely, eligibility for federal grants) will be attached to the ratings is 2021-22.

    Among the provisions receiving attention is the softening of the requirement regarding the use of test-based productivity measures, such as value-added and other growth models (see Goldhaber et al. 2013; Mihaly et al. 2013; Koedel et al. 2015). Specifically, the final regulations allow greater “flexibility” in how and how much these indicators must count toward final ratings. For the reasons that Cory Koedel and I laid out in this piece (and I will not reiterate here), this is a wise decision. Although it is possible that value-added estimates will eventually play a significant role in these TP program accountability systems, the USED timeline provides insufficient time for the requisite empirical groundwork.

    Yet this does not resolve the issues facing those who must design these systems, since putting partial brakes on value-added for TP programs also puts increased focus on the other measures which might be used to gauge program performance. And, as is often the case with formal accountability systems, the non-test-based bench is not particularly deep.

    READ MORE
  • The Details Matter In Teacher Evaluations

    Written on September 22, 2016

    Throughout the process of reforming teacher evaluation systems over the past 5-10 years, perhaps the most contentious, discussed issue was the importance, or weights, assigned to different components. Specifically, there was a great deal of debate about the proper weight to assign to test-based teacher productivity measures, such estimates from value-added and other growth models.

    Some commentators, particularly those more enthusiastic about test-based accountability, argued that the new teacher evaluations somehow were not meaningful unless value-added or growth model estimates constituted a substantial proportion of teachers’ final evaluation ratings. Skeptics of test-based accountability, on the other hand, tended toward a rather different viewpoint – that test-based teacher performance measures should play little or no role in the new evaluation systems. Moreover, virtually all of the discussion of these systems’ results, once they were finally implemented, focused on the distribution of final ratings, particularly the proportions of teachers rated “ineffective.”

    A recent working paper by Matthew Steinberg and Matthew Kraft directly addresses and informs this debate. Their very straightforward analysis shows just how consequential these weighting decisions, as well as choices of where to set the cutpoints for final rating categories (e.g., how many points does a teacher need to be given an “effective” versus “ineffective” rating), are for the distribution of final ratings.

    READ MORE
  • Research On Teacher Evaluation Metrics: The Weaponization Of Correlations

    Written on July 21, 2015

    Our guest author today is Cara Jackson, Assistant Director of Research and Evaluation at the Urban Teacher Center.

    In recent years, many districts have implemented multiple-measure teacher evaluation systems, partly in response to federal pressure from No Child Left Behind waivers and incentives from the Race to the Top grant program. These systems have not been without controversy, largely owing to the perception – not entirely unfounded - that such systems might be used to penalize teachers.  One ongoing controversy in the field of teacher evaluation is whether these measures are sufficiently reliable and valid to be used for high-stakes decisions, such as dismissal or tenure.  That is a topic that deserves considerably more attention than a single post; here, I discuss just one of the issues that arises when investigating validity.

     The diagram below is a visualization of a multiple-measure evaluation system, one that combines information on teaching practice (e.g. ratings from a classroom observation rubric) with student achievement-based measures (e.g. value-added or student growth percentiles) and student surveys.  The system need not be limited to three components; the point is simply that classroom observations are not the sole means of evaluating teachers.   

    In validating the various components of an evaluation system, researchers often examine their correlation with other components.  To the extent that each component is an attempt to capture something about the teacher’s underlying effectiveness, it’s reasonable to expect that different measurements taken of the same teacher will be positively related.  For example, we might examine whether ratings from a classroom observation rubric are positively correlated with value-added.

    READ MORE
  • Empower Teachers To Lead, Encourage Students To Be Curious

    Written on July 9, 2015

    Our guest author today is Ashim Shanker, a former English Language Arts teacher in public schools in Tokyo, Japan. Ashim has a Master’s Degree in International Education Policy from Harvard University and is the author of three books, including Don’t Forget to Breathe. Follow him on Twitter at @ashimshanker.

    In the 11 years that I was a public school teacher in Japan, I came to view education as a holistic enterprise. Schools in Japan not only imbued students with relevant skills, but also nurtured within them the wherewithal to experience a sense of connection with the larger world, and the exploratory capacity to discover their place within it.

    In my language arts classes, I encouraged students to read about current events and human rights issues around the world. I asked them to make lists of the electronics they used, the garments they wore, and the food products they consumed on a daily basis. I then had them research where these products were made and under what labor conditions.

    The students gave presentations on child laborers and about modern-day slavery. They debated about government secrecy laws in Japan and cover-ups in the aftermath of the Fukushima nuclear disaster. They read an essay on self-reliance by Emerson and excerpts on civil disobedience by Thoreau, and I asked them how these two activists might have felt about the actions of groups like Anonymous, or about whistleblowers like Edward Snowden. We discussed the Milgram Experiment and the Stanford Prison experiment, exploring how obedience and situational role conformity might tip even those with the best of intentions toward acts of cruelty. We talked about bullying, and shared anecdotes of instances in which we might unintentionally have hurt others. There were opportunities for self-reflection, engagement, and character building—attributes that I would like to think foster the empathic foundations for better civic engagement and global citizenship.

    READ MORE
  • Do We Know How To Hold Teacher Preparation Programs Accountable?

    Written on June 30, 2015

    This piece is co-authored by Cory Koedel and Matthew Di Carlo. Koedel is an Associate Professor of Economics and Public Policy at the University of Missouri, Columbia.

    The United States Department of Education (USED) has proposed regulations requiring states to hold teacher preparation programs accountable for the performance of their graduates. According to the proposal, states must begin assigning ratings to each program within the next 2-3 years, based on outcomes such as graduates’ “value-added” to student test scores, their classroom observation scores, how long they stay in teaching, whether they teach in high-needs schools, and surveys of their principals’ satisfaction.

    In the long term, we are very receptive to, and indeed optimistic about, the idea of outcomes-based accountability for teacher preparation programs (TPPs). In the short to medium term, however, we contend that the evidence base underlying the USED regulations is nowhere near sufficient to guide a national effort toward high-stakes TPP accountability.

    This is a situation in which the familiar refrain of “it’s imperfect but better than nothing” is false, and rushing into nationwide design and implementation could be quite harmful.

    READ MORE
  • Will Value-Added Reinforce The Walls Of The Egg-Crate School?

    Written on June 25, 2015

    Our guest author today is Susan Moore Johnson, Jerome T. Murphy Research Professor in Education at Harvard Graduate School of Education. Johnson directs the Project on the Next Generation of Teachers, which examines how best to recruit, develop, and retain a strong teaching force.

    Academic scholars are often dismayed when policymakers pass laws that disregard or misinterpret their research findings. The use of value-added methods (VAMS) in education policy is a case in point.

    About a decade ago, researchers reported that teachers are the most important school-level factor in students’ learning, and that that their effectiveness varies widely within schools (McCaffrey, Koretz, Lockwood, & Hamilton 2004; Rivkin, Hanushek, & Kain 2005; Rockoff 2004). Many policymakers interpreted these findings to mean that teacher quality rests with the individual rather than the school and that, because some teachers are more effective than others, schools should concentrate on increasing their number of effective teachers.

    Based on these assumptions, proponents of VAMS began to argue that schools could be improved substantially if they would only dismiss teachers with low VAMS ratings and replace them with teachers who have average or higher ratings (Hanushek 2009). Although panels of scholars warned against using VAMS to make high-stakes decisions because of their statistical limitations (American Statistical Association, 2014; National Research Council & National Academy of Education, 2010), policymakers in many states and districts moved quickly to do just that, requiring that VAMS scores be used as a substantial component in teacher evaluation.

    READ MORE
  • Measurement And Incentives In The USED Teacher Preparation Regulations

    Written on April 22, 2015

    Late last year, the U.S. Department of Education (USED) released a set of regulations, the primary purpose of which is to require states to design formal systems of accountability for teacher preparation (TP) programs. Specifically, states are required to evaluate annually the programs operating within their boundaries, and assign performance ratings. Importantly, the regulations specify that programs receiving low ratings should face possible consequences, such as the loss of federal funding.

    The USED regulations on TP accountability put forth several outcomes that states are to employ in their ratings, including: Student outcomes (e.g., test-based effectiveness of graduates); employment outcomes (e.g., placement/retention); and surveys (e.g., satisfaction among graduates/employers). USED proposes that states have their initial designs completed by the end of this year, and start generating ratings in 2017-18.

    As was the case with the previous generation of teacher evaluations, teacher preparation is an area in which there is widespread agreement about the need for improvement. And formal high stakes accountability systems can (even should) be a part of that at some point. Right now, however, requiring all states to begin assigning performance ratings to schools, and imposing high stakes accountability for those ratings within a few years, is premature. The available measures have very serious problems, and the research on them is in its relative infancy. If we cannot reliably distinguish between programs in terms of their effectiveness, it is ill-advised to hold them formally accountable for that effectiveness. The primary rationale for the current focus on teacher quality and evaluations was established over decades of good research. We are nowhere near that point for TP programs. This is one of those circumstances in which the familiar refrain of “it’s imperfect but better than nothing” is false, and potentially dangerous.

    READ MORE
  • The Persistent Misidentification Of "Low Performing Schools"

    Written on February 3, 2015

    In education, we hear the terms “failing school” and “low-performing school” quite frequently. Usually, they are used in soundbyte-style catchphrases such as, “We can’t keep students trapped in ‘failing schools.’” Sometimes, however, they are used to refer to a specific group of schools in a given state or district that are identified as “failing” or “low-performing” as part of a state or federal law or program (e.g., waivers, SIG). There is, of course, interstate variation in these policies, but one common definition is that schools are “failing/low-performing” if their proficiency rates are in the bottom five percent statewide.

    Putting aside the (important) issues with judging schools based solely on standardized testing results, low proficiency rates (or low average scores) tell you virtually nothing about whether or not a school is “failing.” As we’ve discussed here many times, students enter their schools performing at different levels, and schools cannot control the students they serve, only how much progress those students make while they’re in attendance (see here for more).

    From this perspective, then, there may be many schools that are labeled “failing” or “low performing” but are actually of above average effectiveness in raising test scores. And, making things worse, virtually all of these will be schools that serve the most disadvantaged students. If that’s true, it’s difficult to think of anything more ill-advised than closing these schools, or even labeling them as “low performing.” Let’s take a quick, illustrative look at this possibility using the “bottom five percent” criterion, and data from Colorado in 2013-14 (note that this simple analysis is similar to what I did in this post, but this one is a little more specific; also see Glazerman and Potamites 2011; Ladd and Lauen 2010; and especially Chingos and West 2015).

    READ MORE
  • Multiple Measures And Singular Conclusions In A Twin City

    Written on November 12, 2014

    A few weeks ago, the Minneapolis Star Tribune published teacher evaluation results for the district’s public school teachers in 2013-14. This decision generated a fair amount of controversy, but it’s worth noting that the Tribune, unlike the Los Angeles Times and New York City newspapers a few years ago, did not publish scores for individual teachers, only totals by school.

    The data once again provide an opportunity to take a look at how results vary by student characteristics. This was indeed the focus of the Tribune’s story, which included the following headline: “Minneapolis’ worst teachers are in the poorest schools, data show." These types of conclusions, which simply take the results of new evaluations at face value, have characterized the discussion since the first new systems came online. Though understandable, they are also frustrating and a potential impediment to the policy process. At this early point, “the city’s teachers with the lowest evaluation ratings” is not the same thing as “the city’s worst teachers." Actually, as discussed in a previous post, the systematic variation in evaluation results by student characteristics, which the Tribune uses to draw conclusions about the distribution of the city’s “worst teachers," could just as easily be viewed as one of the many ways that one might assess the properties and even the validity of those results.

    So, while there are no clear-cut "right" or "wrong" answers here, let’s take a quick look at the data and what they might tell us.

    READ MORE

Pages

Subscribe to Value-Added

DISCLAIMER

This web site and the information contained herein are provided as a service to those who are interested in the work of the Albert Shanker Institute (ASI). ASI makes no warranties, either express or implied, concerning the information contained on or linked from shankerblog.org. The visitor uses the information provided herein at his/her own risk. ASI, its officers, board members, agents, and employees specifically disclaim any and all liability from damages which may result from the utilization of the information provided herein. The content in the Shanker Blog may not necessarily reflect the views or official policy positions of ASI or any related entity or organization.