Herding FCATs

About a week ago, Florida officials went into crisis mode after revealing that the proficiency rate on the state’s writing test (FCAT) dropped from 81 percent to 27 percent among fourth graders, with similarly large drops in the other two grades in which the test is administered (eighth and tenth). The panic was almost immediate. For one thing, performance on the writing FCAT is counted in the state’s school and district ratings. Many schools would end up with lower grades and could therefore face punitive measures.

Understandably, a huge uproar was also heard from parents and community members. How could student performance decrease so dramatically? There was so much blame going around that it was difficult to keep track – the targets included the test itself, the phase-in of the state’s new writing standards, and test-based accountability in general.

Despite all this heated back-and-forth, many people seem to have overlooked one very important, widely-applicable lesson here: That proficiency rates, which are not "scores," are often extremely sensitive to where you set the bar.

Some very quick background on what happened: Since 2008-09, Florida’s FCAT writing exam has been an essay-only test. This year, the state increased the rigor of the scoring, with more emphasis on spelling, punctuation, the quality of evidence presented, and other factors, depending on grade level. As a result, the average scores decreased in all three grades that took the test – for example, the fourth grade mean went from just over 4 in 2011 to around 3.25 in 2012 (with possible scores ranging from 0-6).

(Side note: Some of the people leveling accusations regarding the “drop in scores," in addition to often missing the distinction between scores and rates, seem to implicitly dismiss the massive complications involved in comparing these two years of scores, given the change in scoring standards. There’s no way to know, for example, whether this year’s fourth graders performed better than their predecessors, because we don’t know how last year’s students might have scored under the new standards.)

But it wasn’t a 0.75 decrease in the average score that made this a national story (though that's a large change relative to the total variation in scores). Rather, it was how this change affected proficiency rates, in turn. For all three tested grades, these were around 40-50 percentage points lower than last year.

How did Florida respond to the uproar? They simply lowered the bar, or cutpoint - the score at which students are classified as proficient.

More accurately, they re-lowered it. Last year, it was 3.5. Months ago, the state announced that it was raising the bar to 4.0, effective this year. A few days ago, it was dropped to 3.0, even lower than last year (though, again, these scores are not necessarily comparable between years).*

As a result, the percentage of fourth graders rated as proficient in writing tripled, going from 27 percent to 81 percent in the blink of an eye. The eighth grade rate was upgraded from 33 percent to 77 percent, an increase of 133 percent.

I cannot say whether 3.0 or 4.0 is the “correct” proficiency level (for their part, it seems that Florida officials could – and did – accept either one as defensible). Nor does this whole series of events necessarily tell us much about the quality of the FCAT writing exam.

Rather, in my view, the primary takeaway from this situation is that it illustrates perfectly how the (often somewhat arbitrary) choice of cutpoints can completely change the proficiency rates upon which states, Florida in particular, rely so heavily. Lowering the bar from 4.0 to 3.0 led to vastly different interpretations of the exact same set of scores.

These rates have their role – they are an accessible, goal-oriented means of summarizing test score data that might otherwise be difficult for many people to interpret. But they are not “scores," and are severely limited in terms of the scope and reliability of the information they transmit about those scores.

Granted, this is an extreme example, as scores from most state tests vary much more than those of the writing FCAT (on which most students receive between 3-4). Nevertheless, I hope that this situation compels people, in Florida and elsewhere, to exercise extreme caution when interpreting the rates that so dominate our discussion of school and student performance.

- Matt Di Carlo

*****

* Theoretically, the state also had the option of re-establishing last year’s cutpoint of 3.5. This would have resulted in rates of 48 and 52 percent in fourth and eighth grade, respectively – still lower than last year, but much less so than under the 4.0 definition. Interestingly, last year, this change - lowering the bar from 4.0 to 3.5 - would have been completely meaningless. Students could only receive whole number scores (0-6). So, it didn’t matter whether the proficiency bar was set at 3.5 or 4.0, because students could not receive a score of 3.5, and they needed a 4.0 to be deemed proficient. But yet another change Florida made to this year’s writing FCAT was to require that each test be independently scored twice. This was done (commendably, in my view) to increase the reliability of the scores, but it also means that possible scores are in 0.5 increments – for example, a student who gets a three from one scorer and a four from the other receives a score of 3.5.

Blog Topics

This is an important and highlights the fact that the setting of a cut score is not a statistical or scientific decision, but a political one. Too few people--including educators--realize that the setting of cut scores is essentially a political decision. Further, those setting the cut scores can make achievement appear (artificially, I might add), either high or low. The decision can also change the achievement gap in any direction that the committee choosing the cut score would like to see the gap move. The Florida case is a perfect example of politicians responding to a public outcry because the identification of failure reached beyond the poor/minority/inner-city schools and out into the wealthier suburbs. Whenever this happens, cut scores are typically adjusted in a way that protects the suburbs and, concomitantly, policymakers and elected officials.