Review: Making the Grades


Jim Feast

Originally published in The Evergreen Review Issue 120 in October, 2009.

Making the Grades: My Misadventures in the Standardized Testing Industry
by Todd Farley
(Sausalito: PoliPoint Press, 2009)

Review by Jim Feast

Making the Grades by Todd Farley is a devastating look at the U.S. standardized testing industry that works on two levels. First, it shows that tests are scored in ways that are arbitrary and screwball, ending with results that are often (no, in most cases) doctored. Beyond this, it shows this travesty is unavoidable, given current economic conditions.

Let’s begin by noting why test results are skewed every which way, usually with little relation to a student’s actual achievement. Understand, Farley is not writing about students answering a question such as: In what year was Reagan first elected president? Those can be scored by a machine. He is discussing tests where students give written responses, ones that must be evaluated by fallible humans.

The first problem is that the questions and, more importantly, the rubrics, the set of written standards, such “good use of details” or “lacks details” that those scoring the test are supposed to use to grade responses), are written by experts who are blithely ignorant of students’ potential creativity and ingenuity.  Take the problems that arise in grading these two “science” questions: What is your favorite food? How does it taste? The second one is even simplified in that the students are told to choose sweet, sour, salty or bitter. Fine. But, for pizza, for example, different students give the following answers to the second question: sweet, sour, salty or bitter. The graders, those who defend these responses to the supervisor, note (among other things) that pizza with pineapple is sweet, while ones with anchovies can be salty, and so on. Not only pizza is categorized with different tastes but so is ice cream – isn’t lemon sherbet sour? – as are multiple other foods.

But, let’s go further. What is food? What if a student says his favorite food is dirt? No, the supervisor tells the grader, that’s not right, because a food has to be nutritious. Then a grader brings up a different student, one who says her favorite food is … grass. Isn’t that nutritious? What about ice cubes. And so on.

And add to that the fact that many of the scorers have considerable deficits. Take Michi. Now you need a college degree to be hired, and Michi has one, but it’s from Japan. When Farley, her supervisor, talks to her he notices one problem, she barely understands English. Of course, the higher-ups have assigned her to score a reading exam!

Her problems surface when she is to give a numerical grade to a response in which the students are to describe how a character, named Dallas, in a book they had read got mad at her aunt dancing in the grandstand. Sentences such as “Dallas [the character] didn’t like it” get full credit. Checking Michi’s work he learns she gave these sentences, “Dallas found her aunt’s behavior irksome. She was embarrassed by the dancing,” no credit. Farley notes, though, “It was the single best answer I’d seen in the entire time we scored the question.” When he asks Michi why she gave it no credit, she replies, “‘Irksome’ not on rubric … What is embasser … embras?”  He explains, so she says, “Not on rubric.”

It’s not simply that some readers have language difficulties, though that’s common, but that some have difficulty reading, such as an elderly gentleman who takes 15 minutes to grade a one-sentence responses, and others lack comprehension skills, due to such causes as being personally devastated (evidenced by the many new hires who appear after being laid off, without pensions, from decades of working in the same factory or office), being drug or alcohol abusers and even (Farley surmises in the case of an extreme fighter who had suffered many head injuries) addled.

With test graders like these – there are good scorers mixed in, of course – it might seem surprising these tests pass muster, meeting such benchmarks as showing congruence in double reads, matching national trends, and in other ways coming up to snuff. The answer is simple enough: Cheating.       

Take double reads, when two scorers grade each test and their scores are expected to match. What does a supervisor do when she or he finds that many of the scores of certain double readers in the crew are not matching?  When Farley is new to the business and working as a scorer, his supervisor explains how to deal with problems. “If I go into the system and see you gave a “0” to some student response and another scorer gave it a 1, I change one of the scores. … The computer counts it as a disagreement only as long as the scores don’t match. Once I change one of the scores, the reliability number goes up.”

Farley can’t help but ask, “The reliability numbers aren’t legit?”

The supervisor explains, “The reliability numbers are what we make them.”

Of course, this is a high tech project, using computers. When Farley becomes a supervisor, he “corrects” un-matching double scores in a more hands-on way, with an eraser.

What about the qualifying tests that have to be taken at the beginning of scoring any given test to make sure the graders have grasped the rubric?  Here there are also problems, such as the one Farley meets – by this time he is a troubleshooter – when he is brought in to deal with an emergency. One group of ten scorers hasn’t been able to start work grading, because they have failed the qualifying test two or three times a day for two weeks. No problem. Farley and a willing lower supervisor identify one of the ten who really gets the rubric, then the three of them score everyone’s tests. The other nine are given bogus tests, and are not told they are doing busy work. Instead, when the group passes the qualifiers, everyone assumes they did it right.

However, there is a more serious distortion than any of these juggling with statistics carried out by the supervisors, bad as they may sound. Worse distortions occur when a higher-up demands the scores be “pushed” in one direction or another. Take the time when, in a different testing situation in which Farley has better scorers, he is confronted by a psychometrician from Princeton, who tells him, “Don’t do too good a job scoring it [the test]. … Someone’s going to have to match your reliability in 2009.” After she tells him how to do this, such as by giving more work to “less good scorers,” he understands her (and the company’s) standpoint. “She cared more about getting a reliability number that could be easily matched in four years than she did about the correct scores on the tests.”

An even more shocking display of the power of the psychometricians appears at a high level conference at Princeton University at which top educators and testing experts, all loaded with “sheepskins,” …that is, Ph.D.s and other academic diplomas. After a weekend break, their director, Clarice, calls everyone together to say,

“It seems we haven’t been scoring correctly.”

There was a buzz in the room … a disbelief anyone could doubt us, the country’s preeminent experts on scoring writing assignments….

“Says who?” a bespectacled fellow asked Clarice.

“Well,” she replied, “our numbers don’t match up with what the psychometricians predicted … They predicted they see between 6 and 9 percent 1’s, but from the work we did last week, we’re only seeing 3 percent … [we’re] short on 1’s.”

The extraordinary thing is the psychometricians rule the roost. Grumbling, the “experts” sit down and “rescore the items” done last week to bring them in line with what the mathematicians, who never look at a test, want. So much for objectivity. 

I’ve said Farley not only exposes the testing industry’s insanity but displays sociological insight into why it operates as it does. Much is explained by one sentence: All the employees are temps. Well, not everybody, but except for the highest level of management, temps include the scorers, their supervisors, and their supervisors’ supervisors. They are hired for one job, lasting two or three months, then out the door. No benefits, no health insurance for them as they work away at mind-numbing, alienated labor. Under these conditions, even the best intentioned, most intelligent grader will err. Farley describes when he worked as a scorer with such candor and power, I have to quote it.

For the duration of the writing project, four weeks long, I worked eight hours a day for at least five days a week and was expected to score approximately 30 essays an hour, or one essay every two minutes: a two or three page-essay every two minutes about some high school kid’s inability to make the baseball team….

The downside of such work is that it leaves one frazzled and uncommitted to the job. The interesting upside is that the workers and lower supervisors, all in the same boat, show remarkable loyalty and solidarity. Why, a reader might have thought before, was a blunderer like Michi kept on? The author, in charge of the project, says, about his subordinate supervisor, Heidi, and himself “we decided to keep the sweetheart on employed even if she was doing an atrocious job. The only other option … was to get Michi fired and neither Heidi nor I (temporary employees ourselves) wanted any part of (my emphasis).

Or take the case of Butch, a native speaker, this time, but one who can’t grasp the rubric. His supervisor Scott (a good pal of Farley’s) sets him up grading last year’s tests, which won’t be counted, without telling him, so his results won’t mess up the group’s results. Farley comments, “Maybe it would have been easier to get Butch taken off the project, [but] that meant Scott – a temporary employee – would be costing Butch – another temporary employee – his job. No one had the heart for that, so everyone fudged the statistics instead.” 

In sum, Farley has, with great humor and aplomb, indicated the unscientific world of standardized testing in which lofty authorities devise rubrics, which in no way can be applied to scoring the exuberant, creative productions of children and young people, and which, since the experts can never be told they are wrong, given all those sheepskins, have to be applied in an improvisatory, fluid and, above all, haphazard way by the temps who do the grunt work.

This should, but probably won’t, demolish the pretensions that tests will help us assess students and their teachers. Let me end with my own theory of why this crazed testing is gaining so much ground, a theory inspired by but not based in Farley’s statements. Society is now filled with temps, not so much losers as those purposely lost in an economic system driven by downsizing, cutbacks and outsourcing. The system can’t supply the jobs and, god forbid, can’t blame itself, so it thinks up a method to mete out more and more individual blame. Blame for students who fail a test, for teachers who are failures because their students have failed, and principals who fail when their teachers’ students fail. A lot of blame to go around while the big culprit goes free.