Schools' Effectiveness Varies By What They Do, Not What They Are

There may be a mini-trend emerging in certain types of charter school analyses, one that seems a bit trivial but has interesting implications that bear on the debate about charter schools in general. It pertains to how charter effects are presented.

Usually, when researchers estimate the effect of some intervention, the main finding is the overall impact, perhaps accompanied by a breakdown by subgroups and supplemental analyses. In the case of charter schools, this would be the estimated overall difference in performance (usually testing gains) between students attending charters versus their counterparts in comparable regular public schools.

Two relatively recent charter school reports, however – both generally well-done given their scope and available data – have taken a somewhat different approach, at least in the “public roll-out” of their results.

A Dark Day For Educational Measurement In The Sunshine State

Just this week, Florida announced its new district grading system. These systems have been popping up all over the nation, and given the fact that designing one is a requirement of states applying for No Child Left Behind waivers, we are sure to see more.

I acknowledge that the designers of these schemes have the difficult job of balancing accessibility and accuracy. Moreover, the latter requirement – accuracy – cannot be directly tested, since we cannot know “true” school quality. As a result, to whatever degree it can be partially approximated using test scores, disagreements over what specific measures to include and how to include them are inevitable (see these brief analyses of Ohio and California).

As I’ve discussed before, there are two general types of test-based measures that typically comprise these systems: absolute performance and growth. Each has its strengths and weaknesses. Florida’s attempt to balance these components is a near total failure, and it shows in the results.

Performance And Chance In New York's Competitive District Grant Program

New York State recently announced a new $75 million competitive grant program, which is part of its Race to the Top plan. In order to receive some of the money, districts must apply, and their applications receive a score between zero and 115. Almost a third of the points (35) are based on proposals for programs geared toward boosting student achievement, 10 points are based on need, and there are 20 possible points awarded for a description of how the proposal fits into districts’ budgets.

The remaining 50 points – almost half – of the application is based on “academic performance” over the prior year. Four measures are used to produce the 0-50 point score: One is the year-to-year change (between 2010 and 2011) in the district’s graduation rate, and the other three are changes in the state “performance index” in math, English Language Arts (ELA) and science. The “performance index” in these three subjects is calculated using a simple weighting formula that accounts for the proportion of students scoring at levels 2 (basic), 3 (proficient) and 4 (advanced).

The idea of using testing results as a criterion in the awarding of grants is to reward those districts that are performing well. Unfortunately, due to the choice of measures and how they are used, the 50 points will be biased and to no small extent based on chance.

Burden Of Proof, Benefit Of Assumption

** Also posted here on "Valerie Strauss' Answer Sheet" in the Washington Post

Michelle Rhee, the controversial former chancellor of D.C. public schools, is a lightning rod. Her confrontational style has made her many friends as well as enemies. As is usually the case, people’s reaction to her approach in no small part depends on whether or not they support her policy positions.

I try to be open-minded toward people with whom I don’t often agree, and I can certainly accept that people operate in different ways. Honestly, I have no doubt as to Ms. Rhee’s sincere belief in what she’s doing; and, even if I think she could go about it differently, I respect her willingness to absorb so much negative reaction in order to try to get it done.

What I find disturbing is how she continues to try to build her reputation and advance her goals based on interpretations of testing results that are insulting to the public’s intelligence.

Trial And Error Is Fine, So Long As You Know The Difference

It’s fair to say that improved teacher evaluation is the cornerstone of most current education reform efforts. Although very few people have disagreed on the need to design and implement new evaluation systems, there has been a great deal of disagreement over how best to do so – specifically with regard to the incorporation of test-based measures of teacher productivity (i.e., value-added and other growth model estimates).

The use of these measures has become a polarizing issue. Opponents tend to adamantly object to any degree of incorporation, while many proponents do not consider new evaluations meaningful unless they include test-based measures as a major element (say, at least 40-50 percent). Despite the air of certainty on both sides, this debate has mostly been proceeding based on speculation. The new evaluations are just getting up and running, and there is virtually no evidence as to their effects under actual high-stakes implementation.

For my part, I’ve said many times that I'm receptive to trying value-added as a component in evaluations (see here and here), though I disagree strongly with the details of how it’s being done in most places. But there’s nothing necessarily wrong with divergent opinions over an untested policy intervention, or with trying one. There is, however, something wrong with fully implementing such a policy without adequate field testing, or at least ensuring that the costs and effects will be carefully evaluated post-implementation. To date, virtually no states/districts of which I'm aware have mandated large-scale, independent evaluations of their new systems.*

If this is indeed the case, the breathless, speculative debate happening now will only continue in perpetuity.

The Year In Research On Market-Based Education Reform: 2011 Edition

** Also posted here on 'Valerie Strauss' Answer Sheet' in the Washington Post

If 2010 was the year of the bombshell in research in the three “major areas” of market-based education reform – charter schools, performance pay, and value-added in evaluations – then 2011 was the year of the slow, sustained march.

Last year, the landmark Race to the Top program was accompanied by a set of extremely consequential research reports, ranging from the policy-related importance of the first experimental study of teacher-level performance pay (the POINT program in Nashville) and the preliminary report of the $45 million Measures of Effective Teaching project, to the political controversy of the Los Angeles Times’ release of teachers’ scores from their commissioned analysis of Los Angeles testing data.

In 2011, on the other hand, as new schools opened and states and districts went about the hard work of designing and implementing new evaluations compensation systems, the research almost seemed to adapt to the situation. There were few (if any) "milestones," but rather a steady flow of papers and reports focused on the finer-grained details of actual policy.*

Nevertheless, a review of this year's research shows that one thing remained constant: Despite all the lofty rhetoric, what we don’t know about these interventions outweighs what we do know by an order of magnitude.

Education Advocacy Organizations: An Overview

Our guest author today is Ken Libby, a graduate student studying educational foundations, policy and practice at the University of Colorado at Boulder.

Education advocacy organizations (EAOs) come in a variety of shapes and sizes. Some focus on specific issues (e.g. human capital decisions, forms of school choice, class size) while others approach policy more broadly (e.g. changing policy environments, membership decisions). Proponents of these organizations claim they exist, at least in part, to provide a counterbalance to various other powerful interest groups.

In just the past few years, Stand for Children, Democrats for Education Reform (DFER), 50CAN, and StudentsFirst have emerged as well-organized, well-funded groups capable of influencing education policy. While these four groups support some of the same legislation - most notably teacher evaluations based in part on test scores and the expansion of school choice - each group has some distinct characteristics that are worth noting.

One thing’s for sure: The proliferation of EAOs, especially during the past five or six years, is playing a critical role in certain education policy decisions and discussions. They are not, contrary to some of the rhetoric, dominating powerhouses, but they aren’t paper tigers either.

What Value-Added Research Does And Does Not Show

Value-added and other types of growth models are probably the most controversial issue in education today. These methods, which use sophisticated statistical techniques to attempt to isolate a teacher’s effect on student test score growth, are rapidly assuming a central role in policy, particularly in the new teacher evaluation systems currently being designed and implemented. Proponents view them as a primary tool for differentiating teachers based on performance/effectiveness.

Opponents, on the other hand, including a great many teachers, argue that the models’ estimates are unstable over time, subject to bias and imprecision, and that they rely entirely on standardized test scores, which are, at best, an extremely partial measure of student performance. Many have come to view growth models as exemplifying all that’s wrong with the market-based approach to education policy.

It’s very easy to understand this frustration. But it's also important to separate the research on value-added from the manner in which the estimates are being used. Virtually all of the contention pertains to the latter, not the former. Actually, you would be hard-pressed to find many solid findings in the value-added literature that wouldn't ring true to most educators.

Has Teacher Quality Declined Over Time?

** Also posted here on "Valerie Strauss' Answer Sheet" in the Washington Post

One of the common assumptions lurking in the background of our education debates is that “quality” of the teaching workforce has declined a great deal over the past few decades (see here, here, here and here [slide 16]). There is a very plausible storyline supporting this assertion: Prior to the dramatic rise in female labor force participation since the 1960s, professional women were concentrated in a handful of female-dominated occupations, chief among them teaching. Since then, women’s options have changed, and many have moved into professions such as law and medicine instead of the classroom.

The result of this dynamic, so the story goes, is that the pool of candidates to the teaching profession has been “watered down." This in turn has generated a decline in the aggregate “quality” of U.S. teachers, and, it follows, a stagnation of student achievement growth. This portrayal is often used as a set-up for a preferred set of solutions – e.g., remaking teaching in the image of the other professions into which women are moving, largely by increasing risk and rewards.

Although the argument that “teacher quality” has declined substantially is sometimes taken for granted, its empirical backing is actually quite thin, and not as clear-cut as some might believe.

The Categorical Imperative In New Teacher Evaluations

There is a push among many individuals and groups advocating new teacher evaluations to predetermine the number of outcome categories – e.g., highly effective, effective, developing, ineffective, etc. - that these new systems will include. For instance, a "statement of principles" signed by 25 education advocacy organizations recommends that the reauthorized ESEA law require “four or more levels of teacher performance." The New Teacher Project’s primary report on redesigning evaluations made the same suggestion.* For their part, many states have followed suit, mandating new systems with a minimum of 4-5 categories.

The rationale here is pretty simple on the surface: Those pushing for a minimum number of outcome categories believe that teacher performance must be adequately differentiated, a goal on which prior systems, most of which relied on dichotomous satisfactory/unsatisfactory schemes, fell short. In other words, the categories in new evaluation systems must reflect the variation in teacher performance, and that cannot be accomplished when there are only a couple of categories.

It’s certainly true that the number of categories matters – it is an implicit statement as to the system’s ability to tease out the “true” variation in teacher performance. The number of categories a teacher evaluation system employs should depend on how on how well it can differentiate teachers with a reasonable degree of accuracy. If a system is unable to pick up this “true” variation, then using several categories may end up doing more harm than good, because it will be providing faulty information. And, at this early stage, despite the appearance of certainty among some advocates, it remains unclear whether all new teacher evaluation systems should require four or more levels of “effectiveness."