Skip to:

Accountability

  • The Busy Intersection Of Test-Based Accountability And Public Perception

    Written on June 28, 2012

    Last year, the New York City Department of Education (NYCDOE) rolled out its annual testing results for the city’s students in a rather misleading manner. The press release touted the “significant progress” between 2010 and 2011 among city students, while, at a press conference, Mayor Michael Bloomberg called the results “dramatic." In reality, however, the increase in proficiency rates (1-3 percentage points) was very modest, and, more importantly, the focus on the rates hid the fact that actual scale scores were either flat or decreased in most grades. In contrast, one year earlier, when the city's proficiency rates dropped due to the state raising the cut scores, Mayor Bloomberg told reporters (correctly) that it was the actual scores that "really matter."

    Most recently, in announcing their 2011 graduation rates, the city did it again. The headline of the NYCDOE press release proclaims that “a record number of students graduated from high school in 2011." This may be technically true, but the actual increase in the rate (rather than the number of graduates) was 0.4 percentage points, which is basically flat (as several reporters correctly noted). In addition, the city's "college readiness rate" was similarly stagnant, falling slightly from 21.4 percent to 20.7 percent, while the graduation rate increase was higher both statewide and in New York State's four other large districts (the city makes these comparisons when they are favorable).*

    We've all become accustomed to this selective, exaggerated presentation of testing data, which is of course not at all limited to NYC. And it illustrates the obvious fact that test-based accountability plays out in multiple arenas, formal and informal, including the court of public opinion.

    READ MORE
  • Colorado's Questionable Use Of The Colorado Growth Model

    Written on June 25, 2012

    I have been writing critically about states’ school rating systems (e.g., OhioFloridaLouisiana), and I thought I would find one that is, at least in my (admittedly value-laden) opinion, more defensibly designed. It didn't quite turn out as I had hoped.

    One big starting point in my assessment is how heavily the systems weight absolute performance (how highly students score) versus growth (how quickly students improve). As I’ve argued many times, the former (absolute level) is a poor measure of school performance in a high-stakes accountability system. It does not address the fact that some schools, particularly those in more affluent areas, serve  students who, on average, enter the system at a higher-performing level. This amounts to holding schools accountable for outcomes they largely cannot control (see Doug Harris' excellent book for more on this in the teacher context). Thus, to whatever degree testing results can be used to judge actual school effectiveness, growth measures, while themselves highly imperfect, are to be preferred in a high-stakes context.

    There are a few states that assign more weight to growth than absolute performance (see this prior post on New York City’s system). One of them is Colorado's system, which uses the well-known “Colorado Growth Model” (CGM).*

    In my view, putting aside the inferential issues with the CGM (see the first footnote), the focus on growth in Colorado's system is in theory a good idea. But, looking at the data and documentation reveals a somewhat unsettling fact: There is a double standard of sorts, by which two schools with the same growth score can receive different ratings, and it's mostly their absolute performance levels determining whether this is the case.

    READ MORE
  • Louisiana's "School Performance Score" Doesn't Measure School Performance

    Written on June 18, 2012

    Louisiana’s "School Performance Score" (SPS) is the state’s primary accountability measure, and it determines whether schools are subject to high-stakes decisions, most notably state takeover. For elementary and middle schools, 90 percent of the SPS is based on testing outcomes. For secondary schools, it is 70 percent (and 30 percent graduation rates).*

    The SPS is largely calculated using absolute performance measures – specifically, the proportion of students falling into the state’s cutpoint-based categories (e.g., advanced, mastery, basic, etc.). This means that it is mostly measuring student performance, rather than school performance. That is, insofar as the SPS only tells you how high students score on the test, rather than how much they have improved, schools serving more advantaged populations will tend to do better (since their students tend to perform well when they entered the school) while those in impoverished neighborhoods will tend to do worse (even those whose students have made the largest testing gains).

    One rough way to assess this bias is to check the association between SPS and student characteristics, such as poverty. So let’s take a quick look.

    READ MORE
  • We Should Only Hold Schools Accountable For Outcomes They Can Control

    Written on May 29, 2012

    Let’s say we were trying to evaluate a teacher’s performance for this academic year, and part of that evaluation would use students’ test scores (if you object to using test scores this way, put that aside for a moment). We checked the data and reached two conclusions. First, we found that her students made fantastic progress this year. Second, we also saw that the students’ scores were still quite a bit lower than their peers’ in the district. Which measure should we use to evaluate this teacher?

    Would we consider judging her even partially based on the latter – students’ average scores? Of course not. Those students made huge progress, and the only reason their absolute performance levels are relatively low is because they were low at the beginning of the year. This teacher could not control the fact that she was assigned lower-scoring students. All she can do is make sure that they improve. That’s why no teacher evaluation system places any importance on students’ absolute performance, instead focusing on growth (and, of course, non-test measures). In fact, growth models control for absolute performance (prior year’s test scores) so it doesn't bias the results.

    If we would never judge teachers based on absolute performance, why are we judging schools that way? Why does virtually every school/district rating system place some emphasis – often the primary emphasis – on absolute performance?

    READ MORE
  • Herding FCATs

    Written on May 22, 2012

    About a week ago, Florida officials went into crisis mode after revealing that the proficiency rate on the state’s writing test (FCAT) dropped from 81 percent to 27 percent among fourth graders, with similarly large drops in the other two grades in which the test is administered (eighth and tenth). The panic was almost immediate. For one thing, performance on the writing FCAT is counted in the state’s school and district ratings. Many schools would end up with lower grades and could therefore face punitive measures.

    Understandably, a huge uproar was also heard from parents and community members. How could student performance decrease so dramatically? There was so much blame going around that it was difficult to keep track – the targets included the test itself, the phase-in of the state’s new writing standards, and test-based accountability in general.

    Despite all this heated back-and-forth, many people seem to have overlooked one very important, widely-applicable lesson here: That proficiency rates, which are not "scores," are often extremely sensitive to where you set the bar.

    READ MORE
  • Growth And Consequences In New York City's School Rating System

    Written on May 14, 2012

    In a New York Times article a couple of weeks ago, reporter Michael Winerip discusses New York City’s school report card grades, with a focus on an issue that I have raised many times – the role of absolute performance measures (i.e., how highly students scores) in these systems, versus that of growth measures (i.e., whether students are making progress).

    Winerip uses the example of two schools – P.S. 30 and P.S. 179 – one of which (P.S. 30) received an A on this year’s report card, while the other (P.S. 179) received an F. These two schools have somewhat similar student populations, at least so far as can be determined using standard education variables, and their students are very roughly comparable in terms of absolute performance (e.g., proficiency rates). The basic reason why one received an A and the other an F is that P.S. 179 received a very low growth score, and growth is heavily weighted in the NYC grade system (representing 60 out of 100 points for elementary and middle schools).

    I have argued previously that unadjusted absolute performance measures such as proficiency rates are inappropriate for test-based assessments of schools' effectiveness, given that they tell you almost nothing about the quality of instruction schools provide, and that growth measures are the better option, albeit one that also has its own issues (e.g., they are more unstable), and must be used responsibly. In this sense, the weighting of the NYC grading system is much more defensible than most of its counterparts across the nation, at least in my view.

    But the system is also an example of how details matter – each school’s growth portion is calculated using an unconventional, somewhat questionable approach, one that is, as yet, difficult to treat with a whole lot of confidence.

    READ MORE
  • The Weighting Game

    Written on May 9, 2012

    A while back, I noted that states and districts should exercise caution in assigning weights (importance) to the components of their teacher evaluation systems before they know what the other components will be. For example, most states that have mandated new evaluation systems have specified that growth model estimates count for a certain proportion (usually 40-50 percent) of teachers’ final scores (at least those in tested grades/subjects), but it’s critical to note that the actual importance of these components will depend in no small part on what else is included in the total evaluation, and how it's incorporated into the system.

    In slightly technical terms, this distinction is between nominal weights (the percentage assigned) and effective weights (the percentage that actually ends up being the case). Consider an extreme hypothetical example – let’s say a district implements an evaluation system in which half the final score is value-added and half is observations. But let’s also say that every teacher gets the same observation score. In this case, even though the assigned (nominal) weight for value-added is 50 percent, the actual importance (effective weight) will be 100 percent, since every teacher receives the same observation score, and so all the variation between teachers’ final scores will be determined by the value-added component.

    This issue of nominal/versus effective weights is very important, and, with exceptions, it gets almost no attention. And it’s not just important in teacher evaluations. It’s also relevant to states’ school/district grading systems. So, I think it would be useful to quickly illustrate this concept in the context of Florida’s new district grading system.

    READ MORE
  • There's No One Correct Way To Rate Schools

    Written on April 10, 2012

    Education Week reports on the growth of websites that attempt to provide parents with help in choosing schools, including rating schools according to testing results. The most prominent of these sites is GreatSchools.org. Its test-based school ratings could not be more simplistic – they are essentially just percentile rankings of schools’ proficiency rates as compared to all other schools in their states (the site also provides warnings about the data, along with a bunch of non-testing information).

    This is the kind of indicator that I have criticized when reviewing states’ school/district “grading systems." And it is indeed a poor measure, albeit one that is widely available and easy to understand. But it’s worth quickly discussing the fact that such criticism is conditional on how the ratings are employed - there is a difference between the use of testing data to rate schools for parents versus for high-stakes accountability purposes.

    In other words, the utility and proper interpretation of data vary by context, and there's no one "correct way" to rate schools. The optimal design might differ depending on the purpose for which the ratings will be used. In fact, the reasons why a measure is problematic in one context might very well be a source of strength in another.

    READ MORE
  • If Your Evidence Is Changes In Proficiency Rates, You Probably Don't Have Much Evidence

    Written on March 22, 2012

    Education policymaking and debates are under constant threat from an improbable assailant: Short-term changes in cross-sectional proficiency rates.

    The use of rate changes is still proliferating rapidly at all levels of our education system. These measures, which play an important role in the provisions of No Child Left Behind, are already prominent components of many states’ core accountability systems (e..g, California), while several others will be using some version of them in their new, high-stakes school/district “grading systems." New York State is awarding millions in competitive grants, with almost half the criteria based on rate changes. District consultants issue reports recommending widespread school closures and reconstitutions based on these measures. And, most recently, U.S. Secretary of Education Arne Duncan used proficiency rate increases as “preliminary evidence” supporting the School Improvement Grants program.

    Meanwhile, on the public discourse front, district officials and other national leaders use rate changes to “prove” that their preferred reforms are working (or are needed), while their critics argue the opposite. Similarly, entire charter school sectors are judged, up or down, by whether their raw, unadjusted rates increase or decrease.

    So, what’s the problem? In short, it’s that year-to-year changes in proficiency rates are not valid evidence of school or policy effects. These measures cannot do the job we’re having them do, even on a limited basis. This really has to stop.

    READ MORE
  • Guessing About NAEP Results

    Written on February 15, 2012

    Every two years, the release of data from the National Assessment of Educational Progress (NAEP) generates a wave of research and commentary trying to explain short- and long-term trends. For instance, there have been a bunch of recent attempts to “explain” an increase in aggregate NAEP scores during the late 1990s and 2000s. Some analyses postulate that the accountability provisions of NCLB were responsible, while more recent arguments have focused on the “effect” (or lack thereof) of newer market-based reforms – for example, looking to NAEP data to “prove” or “disprove” the idea that changes in teacher personnel and other policies have (or have not) generated “gains” in student test scores.

    The basic idea here is that, for every increase or decrease in cross-sectional NAEP scores over a given period of time (both for all students and especially for subgroups such as minority and low-income students), there must be “something” in our education system that explains it. In many (but not all) cases, these discussions consist of little more than speculation. Discernible trends in NAEP test score data are almost certainly due to a combination of factors, and it’s unlikely that one policy or set of policies is dominant enough to be identified as “the one." Now, there’s nothing necessarily wrong with speculation, so long as it is clearly identified as such, and conclusions presented accordingly. But I find it curious that some people involved with these speculative arguments seem a bit too willing to assume that schooling factors – rather than changes in cohorts’ circumstances outside of school – are the primary driver of NAEP trends.

    So, let me try a little bit of illustrative speculation of my own: I might argue that changes in the economic conditions of American schoolchildren and their families are the most compelling explanation for changes in NAEP.

    READ MORE

Pages

Subscribe to Accountability

DISCLAIMER

This web site and the information contained herein are provided as a service to those who are interested in the work of the Albert Shanker Institute (ASI). ASI makes no warranties, either express or implied, concerning the information contained on or linked from shankerblog.org. The visitor uses the information provided herein at his/her own risk. ASI, its officers, board members, agents, and employees specifically disclaim any and all liability from damages which may result from the utilization of the information provided herein. The content in the Shanker Blog may not necessarily reflect the views or official policy positions of ASI or any related entity or organization.