PISA For Our Time: A Balanced Look
Press coverage of the latest PISA results over the past two months has almost been enough to make one want to crawl under the bed and hide. Over and over, we’ve been told that this is a “Sputnik moment," that the U.S. among the lowest performing nations in the world, and that we’re getting worse.
Thankfully, these claims are largely misleading. Insofar as we’re sure to hear them repeated often over the next few years—at least until the next set of international results come in — it makes sense to try to correct the record (also see here and here).
But, first, I want to make it very clear that U.S. PISA results are not good enough by any stretch of the imagination, and we can and should do a whole lot better. Nevertheless, international comparisons of any kind are very difficult, and if we don’t pay careful attention to what the data are really telling us, it will be more difficult to figure out how to respond appropriately.
This brings me to three basic points about the 2009 PISA results that we need to bear in mind.
The first is that the U.S. is not among the worst-performing developed nations. We are, on the whole, average. The “evidence” for our dismal performance is, almost invariably, expressed in terms of national rankings. Out of 34 OECD nations, we rank 14th in reading, 17th in science, and 25th in math. These figures are certainly alarming, but what do they really mean?
As it happens, rankings are often among the worst ways to express quantitative data. Not only do they ignore the size of differences between nations (two nations with consecutive ranks might have hugely different scores), but they also tend to ignore error margins (part of the differences between nations are simply random noise, so nations that are far apart in rank might really be statistically indistinguishable). As is the case with value-added, characterizing PISA scores without accounting for error is almost certain to mislead.
When you account for the error margins and look at scores instead of rankings, the picture is different. In the table below, I present PISA scores for OECD nations, by subject. For each subject, the nations shaded in yellow had scores that were higher than the U.S. average by a statistically significant margin, while the nations shaded in blue were lower. Scores in the unshaded nations were not measurably different from the U.S. (i.e., they are tied with us).
As you can see, there is a lot of “bunching” among nations in the middle of the distribution. For example, while the U.S. “raw” score in science (the center columns) places us at the rank of 17, this distorts the fact that only 12 nations had a score that was significantly higher (in this sense, the U.S. was in a 13-nation tie for 13th place). Similarly, only six OECD nations had a significantly higher average reading score, and 17 had a significantly higher math score.
Once you take these error margins into consideration, it also becomes clear that the U.S. scores are indistinguishable from the average OECD scores in reading and science, while our math score is below average by a small but significant margin. Now, these results are definitely cause for serious concern, and there are other breakdowns in which U.S. performance is worse (e.g., how our highest-scoring students do versus those in other nations). They do, however, represent a rather different characterization of U.S. performance than the raw rankings alone, which are all that most people hear about.
So, needs a lot of work? Absolutely. A “Sputnik moment?" Not exactly.
The second point to keep in mind is that U.S. scores actually improved in two subjects. One of the more positive developments in education policy over the past decade or two has been a shift in focus from absolute scores to growth. The idea is that schools with relatively low average scores need not be considered failing if they exceed expectations in boosting their students’ performance every year. Rather, they are often considered successful. This same standard might apply to PISA as well.
And the new 2009 math and science scores are higher than they were in 2006 (the last PISA round) by a statistically significant margin (though the math scores are not higher than in 2003, because there was a drop between 2003 and 2006). There are no 2006 data for U.S. reading (due to a printing error in the test booklets), but reading scores are not statistically different from 2003 or 2000.
It’s therefore somewhat misleading to say that we’ve gotten worse. Overall science and math scores actually improved since the last administration of PISA.
Again, the fact that some U.S. scores increased hardly means that we can spike the ball and do an end zone dance. Reading scores are flat, while the increase in math only served to get us back to our 2003 level (and it’s below the OECD average). And, of course, we’re still way behind the highest performers in all three subjects. But it does show that there has been some discernible progress in math and science (which may generate benefits for economic growth), and this was scarcely acknowledged in the coverage of the results.
The third and final point is oft-discussed, but it is also the most important: U.S. students are demographically different from those in many other OECD nations, and comparing countries en masse is basically an exercise in inaccuracy. For instance, our child poverty rate is almost twice the OECD average, and the proportion of our students who are not native English speakers is also high. There are also institutional differences, such as those between welfare states. We all know how important these and other social/economic factors are, and this is clear in how we quantitatively assess charter schools and teachers – by trying to compare them to schools and teachers with similar students. Yet we persist in comparing apples and oranges when talking about PISA.
In order to illustrate how accounting for these differences among students changes the interpretation of results, I fit a quick regression model of PISA scores that controls for child poverty and per-pupil spending (on all services). I include the spending variable to address the efficiency-based argument that we can’t assess the U.S. education system without accounting for how much we spend on education relative to other nations. Keep in mind, however, that this spending variable is not comparable between nations, since we spend money on things (e.g., health insurance) that many other developed countries provide outside of the education system. The poverty and spending variables both come from the OECD website, and they are both from 2007, which is the latest year I could find (due to missing data, four participating PISA nations – Chile, Estonia, Israel, and Slovenia - are excluded).
Before reviewing the results, I want to strongly warn against overinterpreting them. Cross-national differences in performance—and how this performance relates to factors like poverty and spending—are incredibly complex. Even large, micro-level datasets are limited in their ability to uncover this dynamic, to say nothing of my nation-level data (within-nation differences are often as important as those between nations). Moreover, there are dozens and dozens of influential factors that are not included in this model. Thus, the results should be regarded as purely illustrative.
That said, take a look at the graph below. The red dots each represent a nation, plotted by child poverty (x-axis) and PISA score (y-axis). To keep things simple, I use the three-subject average of scores for each nation. The line in the middle of the graph comes from the regression, and it represents the average relationship between scores and poverty level among these nations (controlling for spending). In a sense, it represents the model’s “expected performance” of nations with varying child poverty levels. Nations above the line did better than average, while nations below did worse (though the nations very close to the line on either side are probably within the error margin, and should therefore be considered average in this relationship).
The downward slope of the line clearly indicates a negative association (not necessarily an “effect”) — the higher a nation’s child poverty level, the lower its scores tend to be. According to this relationship, a nation with the poverty level of the U.S. would tend to score around 480, whereas the U.S. score is around 10 points higher. Therefore, given our poverty level, and even when we control for spending, U.S. performance may be seen as very slightly above average (again – only according to these purely illustrative results).
Similarly, Poland and Finland, which have a roughly 20 percentage point difference in poverty, also score above what the model “expects” (even more so than the U.S.). By contrast, Norway, which has an extremely low poverty rate, scores well below the line. So does Mexico, which has a child poverty rate that is roughly as high as in the U.S. and Poland, but which scores much, much lower. (Side note: These results don’t change much if Mexico is removed from the analysis.)
Yes, my methods are crude compared with the complexity of the subject, but even this very partial accounting for cross-national differences paints a somewhat different picture of U.S. performance (and that of some other nations) than what we’re used to hearing. Moreover, the strong relationship between poverty and performance appears pretty obvious, but high poverty is no guarantee of failure, just as low poverty is no guarantee of success.
So, we still lag way behind the world’s highest performing nations (especially in math), but characterizing the U.S. as completely dismal is kind of like saying that the average KIPP school is a failure just because it gets lower overall test scores than do schools in an affluent suburb nearby. You can't compare student performance without comparing students, and we should stop doing so without qualification.
Given the “sky is falling” rhetoric that accompanied these latest PISA results, it’s hard to shake the feeling that some people have a vested interest in portraying the U.S. education system as an across-the-board failure. They have policy preferences which gain strength if policymakers and the general public believe that our current system is among the worst in the world. Or perhaps they just fear complacency, as we all should.
Whatever the case, propagating mass, simplistic comparisons of U.S. test scores with those of other nations is misleading and counterproductive, plain and simple. It’s hard enough to compare nations using thorough, sophisticated analysis (there should be plenty available for this round of PISA, and there’s already some good stuff in the official OECD report). So, let’s stop with the sound-byte comparisons of rankings and the overly alarmist rhetoric about Soviet satellites.
Nobody is satisfied with mediocrity. The case against it is strong enough to stand on its own. We need a lot of improvement, but we should also be as clear as possible about where we started from.