Could a computer really be a good judge of student writing?
Pennsylvania education officials say yes. They have tested computerized essay scoring with about 30,000 students. Meanwhile, in Indiana, about 29,000 students are participating this spring in a pilot test of online essay-grading software designed by the Educational Testing Service.
Other states—and many educators—are watching those developments to decide if they should consider using such technology.
“One of our goals was to see how online scoring compared to human scoring—they both ranked very equally,” said Mary Gaydos, a spokeswoman for the Pennsylvania Department of Education.
Still, some educators and testing experts caution that essay-scoring systems are far from perfect, and that using them to evaluate students on high-stakes exams could be a mistake.
Pennsylvania conducted three pilot tests, from 1999 to 2001, of the Intellimetric essay-scoring system, which was developed by Yardley, Pa.-based Vantage Learning. Students in grades 6, 9, and 11 used the Web-based system to take reading and writing tests.
As it is, the state has no immediate plans to replace paper-and-pencil testing with Web-based assessments, Ms. Gaydos said. She said such a decision would have to consider whether all schools have the computer capabilities to administer such tests.
Indiana is conducting a test this spring of a competing essay-grading tool called the “e- rater,” which was developed by the ETS, based in Princeton, N.J. High school students whose schools volunteered for the trial were scheduled to take Indiana’s end-of-course test for English 11 online. That test is a mixture of multiple-choice items and essay questions.
Other states are watching the trial closely.
“We’re very excited about the potential” of essay-scoring technology, said Robert Olsen, the head of the online-assessment program for the Oregon Department of Education. Oregon is in the second year of pilot- testing a multiple-choice online assessment. (“Testing Computerized Exams,” May 23, 2001.)
Essay-scoring technology could soon be added to the Oregon system. “We are in the process of completing a study in Oregon to verify the reports of the vendor [Vantage Learning] in terms of its accuracy and utility,” Mr. Olson said, “and are very, very seriously looking at implementing it in this state.”
The Massachusetts Department of Education has also announced a test of an online writing-analysis tool that uses the Vantage Learning engine through the state’s “Virtual Education Space,” a Web site devoted to preparing students for state-sponsored assessments.
Testing the Software
If they prove effective, the new tools could have many benefits, some educators and policymakers say. Lessening the reliance on human scorers would reduce costs, for instance, and could help avert a possible shortage of scorers when state and federal mandates strain the capacity of testing programs over the next few years.
Some experts also argue that the tools could help improve online-testing systems that rely on multiple- choice questions, because tests with essay items are generally regarded as a more complete measure of student abilities than tests with multiple-choice items alone.
And online, computer-scored tests can return results to schools almost instantly, helping educators address students’ academic weaknesses soon after they’re spotted. Educators say it often takes months to get the results of paper tests.
ETS Technologies, the for-profit subsidiary of the nonprofit developer of the SAT college-entrance exam, approached the Indiana education department in January of this year and offered to set up a small pilot for online assessment, said Wes Bruce, the department’s director of the division of school assessment.
Indiana officials asked for a large-scale statewide trial that would use not the Indiana Statewide Testing for Educational Progress, the state’s high-stakes academic test, but the Core 40, a set of tests that the state has devised to get a sense of how students are performing in core academic courses. Those voluntary tests will become mandatory over the next few years.
“If you look at our [state educational accountability law], see all of its components, and the timeline for rolling it out, it will become particularly obvious why we piloted online testing this year,” said Mary Tiede Wilhelmus, the communications director of the state education department.
Human vs. Machine
People hired to score student essays typically have a four-year college degree and good writing skills, said Alison Lyden, an official at Data Recognition Corp., a testing company in Maple Grove, Minn. She said scorers, who are paid about $12 an hour, are trained before scoring student essays. And two people usually score each test independently.
Still, officials from the testing-technology companies suggest that the essay-scoring software can match the human scorers.
Generally, the computer scores a student response by comparing it with hundreds of human-scored responses to the same test item. If it looks most like a response that human experts have given, say, a 5 on a 1-to-5 scale, then the machine will assign it a 5.
The Intellimetric engine used in Pennsylvania is prepped by scanning in thousands of test items, said Scott Elliot, the chief operating officer of Vantage Learning, adding that he prefers to have 300 scored responses for each item on a test. “By learning the characteristics of 300 typical responses, it can apply that learning to score a novel response,” he said.
Once primed, the software looks for patterns in about 76 different features of the responses, some of which might not be readily discernible to every human scorer, the company maintains.
Some are structural, mechanical elements, such as spelling, punctuation, syntax, and subject-verb agreement. Other features involve content— “concepts and relationships among those concepts,” said Mr. Elliot.
“It ultimately comes down to vocabulary,” he said.
All those patterns, layered together and anchored in the human-scored samples, create an effective scorer, Mr. Elliot argued.
“The bottom line,” he said, “is our engine typically matches [human] experts more often than two [human] experts can match each other.”
And, the computer “doesn’t need a cigarette break, doesn’t need a cup of coffee, and scores the first and last essay the same,” he said.
The essay-scoring engine created by Knowledge Analysis Technologies uses another analytical method, called “latent semantic analysis,” that is based on a broader model of English, said Lynn A. Streeter, the business-development officer of the company, based in Boulder, Colo.
It involves creating three lexicons, or collections of words: The first is a general model of English for the typical test-taker, such as a college freshman; the second is words pertaining to the subject of the test; the third is specific to each essay question, she said.
Ms. Streeter claims that having the first “general semantic space” allows the computer to recognize student responses that might be further afield from the average. For example, she said, if the word “doctor” was consistently used in a sample essay question, “then somebody writes a test essay in which they refer to a dermatologist, in our model we’d know that it’s very close to doctor and essentially means almost same thing.”
Potential Problems
But the use of essay-scoring software faces some big hurdles before becoming a part of state or federally mandated academic assessments. For starters, the uneven availability of computers and high-speed Internet connections in schools is a problem.
In addition, several studies by Boston College researchers suggest that students perform better on essay tests when the test-delivery method—whether on paper or computer—is the same method they use for regular writing assignments.
For now, Ms. Streeter said, machine- scoring of essays is best used to grade practice tests or to help teachers wade through student writing exercises, which would allow them to assign more of them. “It should be more about helping a person, than ‘you flunk,’” she said.
For example, her company’s essay-scoring tool is used in a literacy project at the University of Colorado, called “Summary Street,” in which students in grades 3-12 write summaries of book chapters they have read. The computer gives feedback on how to improve their writing and concepts they have missed.
Michael K. Russell, a researcher at the Center for the Study of Testing, Evaluation, and Assessment, at Boston College, suggests that essay- scoring software might be best used as a diagnostic tool to analyze student essays to reveal misconceptions about academic topics.
Beyond that, Mr. Russell said, increased use of essay-scoring technologies must first be matched by more use of computers for student writing and classroom learning.
Coverage of technology is supported in part by the William and Flora Hewlett Foundation.