Wednesday, April 30, 2014

Prose must be evaluated by HUMANS

This is one of the most upsetting things I have read in a long time on the subject of contemporary education. Decades ago, when it was already common to purchase essays by ghost writers, and diplomas became less reliable indicators of "being educated," I predicted a time when honored human "examiners" would orally test and interview applicants for "certification." Here we see the opposite approach and its potential for disaster. Shame on anyone who calls him or her self and educator and relies on a machine to evaluate prose!

Boston Globe Opinion Section
Flunk the robo-graders
By Les Perelman | APRIL 30, 2014

“According to professor of theory of knowledge Leon Trotsky, privacy is the most fundamental report of humankind. Radiation on advocates to an orator transmits gamma rays of parsimony to implode.’’

ANY NATIVE speaker over age 5 knows that the preceding sentences are incoherent babble. But a computer essay grader, like the one Massachusetts may use as part of its new public school tests, thinks it is exceptionally good prose.

PARCC, the consortium of states including Massachusetts that is developing assessments for the Common Core Curriculum, has contracted with Pearson Education, the same company that graded the notorious SAT essay, to grade the essay portions of the Common Core tests. Some students throughout Massachusetts just took the pilot test, which wasted precious school time on an exercise that will provide no feedback to students or to their schools.

It was, however, not wasted time for Pearson. The company is using these student essays to train its robo-grader to replace one of the two human readers grading the essay, although there are no published data on their effectiveness in correcting human readers.

Robo-graders do not score by understanding meaning but almost solely by use of gross measures, especially length and the presence of pretentious language. The fallacy underlying this approach is confusing association with causation. A person makes the observation that many smart college professors wear tweed jackets and then believes that if she wears a tweed jacket, she will be a smart college professor.

Robo-graders rely on the same twisted logic. Papers written under time pressure often have a significant correlation between length and score. Robo-graders are able to match human scores simply by over-valuing length compared to human readers. A much publicized study claimed that machines could match human readers. However, the machines accomplished this feat primarily by simply counting words.

Recently, three computer science students, Damien Jiang and Louis Sobel from MIT and Milo Beckman from Harvard, demonstrated that these machines are not measuring human communication. They have been able to develop a computer application that generates gibberish that one of the major robo-graders, IntelliMetric, has consistently scored above the 90th percentile overall. In fact, IntelliMetric scored most of the incoherent essays they generated as “advanced” in focus and meaning as well as in language use and style.

Unfortunately, the problem in evaluating these machines is the lack of transparency on the part of the private vendors and the researchers associated with them. None of the major testing companies allow easy access or open-ended demonstrations of their robo-graders. My requests to the testing companies to examine their products have largely gone unanswered. A Pearson vice president explained that I was denied access to test the product now being considered by PARCC because I wanted “to show why it doesn’t work.” I was able to obtain access to Vantage’s IntelliMetric only by buying its Home School Edition.

PARCC should experiment with much more effective and educationally defensible methods for quality control, such as having expert teachers check readers by reading and scoring 20 percent of their papers. This method, already used by the National Writing Project and some of the Advanced Placement examinations, is more reliable for ensuring accuracy. Moreover, the more the scoring is done by real teachers, the more the process produces a double benefit in providing in-service training for teachers along with assessment.

Education, like medicine, is too important a public resource to allow corporate secrecy. If PARCC does not insist that Pearson allow researchers access to its robo-grader and release all raw numerical data on the scoring, then Massachusetts should withdraw from the consortium. No pharmaceutical company is allowed to conduct medical tests in secret or deny legitimate investigators access. The FDA and independent investigators are always involved. Indeed, even toasters have more oversight than high stakes educational tests.

Our children deserve better than having their writing evaluated by machines whose workings are both flawed and hidden from public scrutiny. Whatever benefit current computer technology can provide emerging writers is already embodied in imperfect but useful word processors. Conversations with colleagues at MIT who know much more than I do about artificial intelligence has led me to Perelman’s Conjecture: People’s belief in the current adequacy of Automated Essay Scoring is proportional to the square of their intellectual distance from people who actually know what they are talking about.

Les Perelman recently retired as director of Writing Across the Curriculum at MIT, where he is now a research affiliate.

No comments: