Trial And Error Is Fine, So Long As You Know The Difference

It’s fair to say that improved teacher evaluation is the cornerstone of most current education reform efforts. Although very few people have disagreed on the need to design and implement new evaluation systems, there has been a great deal of disagreement over how best to do so – specifically with regard to the incorporation of test-based measures of teacher productivity (i.e., value-added and other growth model estimates).

The use of these measures has become a polarizing issue. Opponents tend to adamantly object to any degree of incorporation, while many proponents do not consider new evaluations meaningful unless they include test-based measures as a major element (say, at least 40-50 percent). Despite the air of certainty on both sides, this debate has mostly been proceeding based on speculation. The new evaluations are just getting up and running, and there is virtually no evidence as to their effects under actual high-stakes implementation.

For my part, I’ve said many times that I'm receptive to trying value-added as a component in evaluations (see here and here), though I disagree strongly with the details of how it’s being done in most places. But there’s nothing necessarily wrong with divergent opinions over an untested policy intervention, or with trying one. There is, however, something wrong with fully implementing such a policy without adequate field testing, or at least ensuring that the costs and effects will be carefully evaluated post-implementation. To date, virtually no states/districts of which I'm aware have mandated large-scale, independent evaluations of their new systems.*

If this is indeed the case, the breathless, speculative debate happening now will only continue in perpetuity.

New Report: Does Money Matter?

Over the past few years, due to massive budget deficits, governors, legislators and other elected officials are having to slash education spending. As a result, incredibly, there are at least 30 states in which state funding for 2011 is actually lower than in 2008. In some cases, including California, the amounts are over 20 percent lower.

Only the tiniest slice of Americans believe that we should spend less on education, while a large majority actually supports increased funding. At the same time, however, there’s a concerted effort among some advocates, elected officials and others to convince the public that spending more money on education will not improve outcomes, while huge cuts need not do any harm.

Often, their evidence comes down to some form of the following graph:

Do Half Of New Teachers Leave The Profession Within Five Years?

You’ll often hear the argument that half or almost half of all beginning U.S. public school teachers leave the profession within five years.

The implications of this statistic are, of course, that we are losing a huge proportion of our new teachers, creating a “revolving door” of sorts, with teachers constantly leaving the profession and having to be replaced. This is costly, both financially (it is expensive to recruit and train new teachers) and in terms of productivity (we are losing teachers before they reach their peak effectiveness). And this doesn’t even include teachers who stay in the profession but switch schools and/or districts (i.e., teacher mobility).*

Needless to say, some attrition is inevitable, and not all of it is necessarily harmful, Many new teachers, like all workers, leave (or are dismissed) because they are just aren’t good at it – and, indeed, there is test-based evidence that novice leavers are, on average, less effective. But there are many other excellent teachers who exit due to working conditions or other negative factors that might be improved (for reviews of the literature on attrition/retention, see here and here).

So, the “almost half of new teachers leave within five years” statistic might serve as a useful diagnosis of the extent of the problem. As is so often the case, however, it's rarely accompanied by a citation. Let’s quickly see where it comes from, how it might be interpreted, and, finally, take a look at some other relevant evidence.

New Policy Brief: The Evidence On Charter Schools And Test Scores

In case you missed it, today we released a new policy brief, which provides an accessible review of the research on charter schools’ testing effects, how their varying impacts might be explained, and what this evidence suggests about the ongoing proliferation of these schools.

The brief is an adaptation of a three-part series of posts on this blog (here is part one, part two and part three).

Download the policy brief (PDF)

The abstract is pasted directly below.

The Deafening Silence Of Unstated Assumptions

Here’s a thought experiment. Let’s say we were magically granted the ability to perfectly design our public education system. In other words, we were somehow given the knowledge of the most effective policies and how to implement them, and we put everything in place. How quickly would schools improve? Where would we be after 20 years of having the best possible policies in place? What about after 50 years?

I suspect there is much disagreement here, and that answers would vary widely. But, since there is a tendency in education policy to shy away from even talking realistically about expectations, we may never really know. We sometimes operate as though we expect immediate gratification - quick gains, every single year. When schools or districts don't achieve gains, even over a short period of time, they are subject to being labeled as failures.

Without question, we need to set and maintain high expectations, and no school or district should ever cease trying to improve. Yet, in the context of serious policy discussions, the failure to even discuss expectations in a realistic manner hinders our ability to interpret and talk about evidence, as it often means that we have no productive standard by which to judge our progress or the effects of the policies we try.

Do Teachers Really Come From The "Bottom Third" Of College Graduates?

** Also posted here on 'Valerie Strauss' Answer Sheet' in the Washington Post

The conventional wisdom among many education commentators is that U.S. public school teachers “come from the bottom third” of their classes. Most recently, New York City Mayor Michael Bloomberg took this talking point a step further, and asserted at a press conference last week that teachers are drawn from the bottom 20 percent of graduates.

All of this is supposed to imply that the U.S. has a serious problem with the “quality” of applicants to the profession.

Despite the ubiquity of the “bottom third” and similar arguments (which are sometimes phrased as massive generalizations, with no reference to actual proportions), it’s unclear how many of those who offer them know what specifically they refer to (e.g., GPA, SAT/ACT, college rank, etc.). This is especially important since so many of these measurable characteristics are not associated with future test-based effectiveness in the classroom, while those that are are only modestly so.

Still, given how often it is used, as well as the fact that it is always useful to understand and examine the characteristics of the teacher labor supply, it’s worth taking a quick look at where the “bottom third” claim comes from and what it might or might not mean.

What Value-Added Research Does And Does Not Show

Value-added and other types of growth models are probably the most controversial issue in education today. These methods, which use sophisticated statistical techniques to attempt to isolate a teacher’s effect on student test score growth, are rapidly assuming a central role in policy, particularly in the new teacher evaluation systems currently being designed and implemented. Proponents view them as a primary tool for differentiating teachers based on performance/effectiveness.

Opponents, on the other hand, including a great many teachers, argue that the models’ estimates are unstable over time, subject to bias and imprecision, and that they rely entirely on standardized test scores, which are, at best, an extremely partial measure of student performance. Many have come to view growth models as exemplifying all that’s wrong with the market-based approach to education policy.

It’s very easy to understand this frustration. But it's also important to separate the research on value-added from the manner in which the estimates are being used. Virtually all of the contention pertains to the latter, not the former. Actually, you would be hard-pressed to find many solid findings in the value-added literature that wouldn't ring true to most educators.

Has Teacher Quality Declined Over Time?

** Also posted here on "Valerie Strauss' Answer Sheet" in the Washington Post

One of the common assumptions lurking in the background of our education debates is that “quality” of the teaching workforce has declined a great deal over the past few decades (see here, here, here and here [slide 16]). There is a very plausible storyline supporting this assertion: Prior to the dramatic rise in female labor force participation since the 1960s, professional women were concentrated in a handful of female-dominated occupations, chief among them teaching. Since then, women’s options have changed, and many have moved into professions such as law and medicine instead of the classroom.

The result of this dynamic, so the story goes, is that the pool of candidates to the teaching profession has been “watered down." This in turn has generated a decline in the aggregate “quality” of U.S. teachers, and, it follows, a stagnation of student achievement growth. This portrayal is often used as a set-up for a preferred set of solutions – e.g., remaking teaching in the image of the other professions into which women are moving, largely by increasing risk and rewards.

Although the argument that “teacher quality” has declined substantially is sometimes taken for granted, its empirical backing is actually quite thin, and not as clear-cut as some might believe.

Smear Review

A few weeks ago, the National Education Policy Center (NEPC) issued a review of the research on virtual learning. Several proponents of online education issued responses that didn't offer much substance beyond pointing out NEPC’s funding sources. A similar reaction ensued after the release last year of the Gates Foundation's preliminary report on the Measures of Effective Teaching Project. There were plenty of substantive critiques, but many of the reactions amounted to knee-jerk dismissals of the report based on pre-existing attitudes toward the foundation's agenda.

More recently, we’ve even seen unbelievably puerile schemes in which political operatives actually pretend to represent legitimate organizations requesting consulting services. They record the phone calls, and post out-of-context snippets online to discredit the researchers.

Almost all of the people who partake in this behavior share at least one fundamental characteristic: They are unable to judge research for themselves, on its merits. They can’t tell the difference, so they default to attacking substantive work based on nothing more than the affiliations and/or viewpoints of the researchers.

The Categorical Imperative In New Teacher Evaluations

There is a push among many individuals and groups advocating new teacher evaluations to predetermine the number of outcome categories – e.g., highly effective, effective, developing, ineffective, etc. - that these new systems will include. For instance, a "statement of principles" signed by 25 education advocacy organizations recommends that the reauthorized ESEA law require “four or more levels of teacher performance." The New Teacher Project’s primary report on redesigning evaluations made the same suggestion.* For their part, many states have followed suit, mandating new systems with a minimum of 4-5 categories.

The rationale here is pretty simple on the surface: Those pushing for a minimum number of outcome categories believe that teacher performance must be adequately differentiated, a goal on which prior systems, most of which relied on dichotomous satisfactory/unsatisfactory schemes, fell short. In other words, the categories in new evaluation systems must reflect the variation in teacher performance, and that cannot be accomplished when there are only a couple of categories.

It’s certainly true that the number of categories matters – it is an implicit statement as to the system’s ability to tease out the “true” variation in teacher performance. The number of categories a teacher evaluation system employs should depend on how on how well it can differentiate teachers with a reasonable degree of accuracy. If a system is unable to pick up this “true” variation, then using several categories may end up doing more harm than good, because it will be providing faulty information. And, at this early stage, despite the appearance of certainty among some advocates, it remains unclear whether all new teacher evaluation systems should require four or more levels of “effectiveness."