Revisiting The Widget Effect
In 2009, The New Teacher Project (TNTP) released a report called “The Widget Effect." You would be hard-pressed to find too many more recent publications from an advocacy group that had a larger influence on education policy and the debate surrounding it. To this day, the report is mentioned regularly by advocates and policy makers.
The primary argument of the report was that teacher performance “is not measured, recorded, or used to inform decision making in any meaningful way." More specifically, the report shows that most teachers received “satisfactory” or equivalent ratings, and that evaluations were not tied to most personnel decisions (e.g., compensation, layoffs, etc.). From these findings and arguments comes the catchy title – a “widget” is a fictional product commonly used in situations (e.g., economics classes) where the product doesn’t matter. Thus, treating teachers like widgets means that we treat them all as if they’re the same.
Given the influence of “The Widget Effect," as well as how different the teacher evaluation landscape is now compared to when it was released, I decided to read it closely. Having done so, I think it’s worth discussing a few points about the report.
The first reaction I had was that the report’s primary empirical contribution, which, as is often the case, could have been expressed with a single table and a few paragraphs of text, was important and warranted much of the attention it received. Namely, it was the finding that, in the 12 districts included in the study, only a tiny minority (about 1-5 percent) of evaluated teachers received an “unsatisfactory” or equivalent rating (and thus, predictably, very few tenured teachers were dismissed during this time).
That was a bombshell of sorts. Even though these findings are sometimes portrayed inappropriately as national estimates (it’s just 12 districts), and even though the report puts forth the somewhat misleading argument that low proficiency rates in these districts mean the evaluation results must be wrong, the fact that just a tiny minority of teachers received low ratings is implausible.
The inadequacy of teacher evaluation regimes had been a long-standing issue for decades, but this was one of the first times that the public, outside of the education field at least, was made aware of the simple fact that teacher evaluations in these 12 districts, at least insofar as final ratings are the gauge, seemed to be more of a formality than anything else.
Such a conclusion was among the big catalysts for current efforts to redesign evaluations. It also shows how even the most simple descriptive statistics -- in this case, a percentage -- can be more powerful than the most complex statistical approaches. The report is cited frequently in academic journal articles.
Of course, this immediately raises the question: Why? There’s a simple, albeit obvious starting point to this explanation: Principals didn’t assign low ratings, and so most ratings were not low.
The report tends to imply that this was a systemic or a design failure. For example, the authors note that half of the 12 systems for which they had data offered only two categories, that many of them required only infrequent evaluations, and that principals were not properly trained to conduct them. In addition, they note that most personnel decisions, such as compensation, were not tied to evaluation ratings, and that dismissal procedures can be burdensome, which could mean that principals lacked the incentive to assign low ratings to their teachers.
These are all very important points, and probably played a role in producing the implausibly high results. Still, it bears mentioning that, in at least a few of the states that have released results of their new teacher evaluations, the ratings have not exhibited a whole lot more differentiation (see more discussion of this here). These systems have attempted to address many of the concerns discussed above, yet in some cases continue to reward the vast majority of teachers with the highest ratings.
Unfortunately, I haven't seen much analysis of which components of these systems are driving the results, but it stands to reason that classroom observations are one of the big factors. If so, it may simply be the case that principals think their teachers are doing a good job, or are unwilling to give them unsatisfactory ratings for some other reason unconnected to personnel policies, such as their estimation of the prospects for finding better replacements. It is therefore important to examine which design features, whether in the composition of ratings or the incentives attached to them, are associated with different outcomes, and to balance the need for differentiation with sound measurement.*
Beyond these important descriptive findings, the rest of the report consists mostly of two elements. The first, mentioned briefly above, is advocacy for policies that might help address the issue of implausibly low differentiation (and a summary of whether these policies were in force in the 12 districts included in the study). Some of them, such as new, better evaluations and administrator training, receive relatively wide support (at least in general, without regard to specifics), whereas others, such as performance-based pay, are more controversial.
The second element that makes up most of the rest of the report is the results of a survey of teachers and administrators in the 12 districts included in the study. The authors use these survey results liberally throughout the entire report, in some cases to make rather bold statements such as “teachers and administrators broadly agree about the existence and scope of the problem and about what steps need to be taken to address poor performance in schools."
The problem is that these surveys were entirely non-random (it was a voluntary online survey), and the report makes no effort to compare the characteristics of their survey sample to that of the teacher workforces in their 12 districts. The data, therefore, could have been (and were) used to offer useful insights, but they cannot be used to draw generalized conclusions about “what teachers and administrators think” even in these 12 districts, to say nothing of nationally.**
Overall, however, the “Widget Effect” did have a considerable impact on the debate about education policy, and probably even on policymaking. It remains to be seen how the issue upon which it shined light -- the results of teacher evaluations -- will shake out going forward.
- Matt Di Carlo
* This is one of the reasons why value-added estimates, which essentially impose a distribution on teacher performance - i.e., some teachers must be poorly rated by design - are valued by many of those who are focused on the spread of ratings across categories.
As a member in one of the participating districts, it was interesting to read your reflections on this report. At the time, we as a local were looking to make significant changes to what we believed was an old and ineffective system. While we were not surprised by the result, we did believe there were some obvious flaws in the report.
As you point out the low number of poor ratings is often quoted. What the study did not bring to light is the number of teachers that were convinced to resign in order to receive a less punitive rating.
Since that time, we have tried to implement a newer system, and yes, without significant shift in culture, the results are still lacking credibility in truly reflecting teacher practice. We thought we were making headway until the state legislature once again tried to reform that which did not need change and missed a change to help districts really adopt better systems.
However, we do think the change was necessary and we will continue to push for more authentic appraisal. The real work is in finding expertise in the appraiser and equity in the application of the system.
I'm a little surprised that such a big deal was made of this study. Anyone paying attention knew the facts on the ground. I recall a conversation with Adam Urbanski in Rochester 20 years ago when the first results came down of their new system - 5-6 teachers were found to be unsatisfactory. C'MON, Adam, I said. He said: it's a start, and we're on the right track.
Perhaps not too surprisingly, the author doesn't mention the 800-pound gorilla in the room: the likely grievances by the unions that are predictable if hundreds of teachers are found wanting. because then the union can (correctly) say: WHAT??? You gave these people satisfactory reviews for x number of years. This is unfair, and we're going to fight it.
In short, there are very few incentives for Principals to rate a teacher poorly; indeed, there are many disincentives. Until and unless Principals are held accountable for building results, and until and unless, the entire district agrees to 'own' the evaluation system done right, little will change - and more kids will be cheated out of an education.
As a former teacher, I was involved in teacher evaluation reform, based in large part on this study. Now, as a staffer at AFT, still involved in teacher evaluation reform, I find myself thinking that that 5% will probably not shift dramatically, precisely because it's not an unrealistic number. For instance, I doubt 25% of the national teacher workforce is "ineffective" regardless of the instrument used. And as Grant Wiggins pointed out, districts need to "own" their evaluation systems, which is exactly the stance AFT takes: through labor-management collaboration, and through the process of contiuous improvement, districts can and will grow their teachers and mitigate the number of ineffective ones in their ranks. So whether by "convincing [ineffective] teachers to resign," as Kathryn Castle says, or by providing targeted professional development, we will continue to see small, and increasingly smaller, numbers of ineffective teachers and larger numbers of effective and highly effective ones. Let's revisit these data in another 5, 10, 15 years.