In Holliston, Mass., a middle-class, college-minded suburb west of Boston, students are accustomed to taking standardized tests. And teachers like Kenneth L. Worsley, a longtime math instructor at Holliston High, usually review the test scores to get a handle on their students’ year-to-year progress.
But when results from Massachusetts’ tough new state exams began trickling in a little more than a year ago, it became apparent to Worsley that there was nothing usual about the state testing program.
Poring over the numbers, Worsley noticed that one 8th grader who was rated “advanced” on mathematics on the Stanford Achievement Test-9th Edition, the nationally normed test that Holliston students ordinarily took, fell into the “failure” category on the math portion of the state exam. Two others, both strong performers in the classroom, dropped from “advanced” on the national test to “needs improvement” on the Massachusetts test.
In all, 35 percent of the school system’ 8th graders had achieved advanced status on the Stanford-9. Yet, a few months later, only 3 percent of the same group of students had earned the same label on the state test.
“What a test!” Worsley wrote in an e-mail to colleagues around the state. How, he wondered, could two math tests given to the same group of students in such a short time span produce such different results?
Many explanations probably exist for the test-score differences the Massachusetts teacher noticed. The easiest is to dismiss the discrepancies as evidence that the two tests, though meant for students at the same grade level, do not cover identical material. But experts say the varied results may also illustrate something else: What is meant by “high standards” is largely in the eye of the beholder.
Across the country, states have been moving to hold schools, students, and educators accountable for their performance. As they do so, states must decide how high to set the bar and how fast they can expect students and schools to improve.
Although established technical procedures exist for determining passing scores and the like, the final decision about where states set the academic bar in their accountability programs is, in the end, a judgment call. Someone--usually several groups of people--has to pin down what to test and how difficult to make test items, where to set cutoff scores for who passes and who get labeled “proficient” or “in need of improvement,” how much test-score improvement is realistic to demand from schools, and how long it should take to get there.
“Regardless of the process, it’s always a decision of judgment, and people believe there’s an exact science to creating a cut-score point, but there’s not,” says Catherine L. Horn, a research associate for the National Board on Educational Testing and Public Policy, based at Boston College.
But as increasing numbers of states add teeth to their testing programs, such decisions are becoming critical. In Kenneth Worsley’s state, for example, a few scale-score points can mean the difference between graduating on time or putting in another year of school for students in the class of 2003. And making sure that academic expectations are high, yet realistically attainable, can mean the difference between the ultimate success or failure of a state’s system of standards and testing.
Math teacher Kenneth L. Worsley was perplexed when his 8th graders who scored in the “advanced” category on one national exam failed or were deemed “in need of improvement” on the Massachusetts test.
“It’s hard to know in the scheme of setting expectations what’s the right thing to do,” says Brian M. Stecher, a senior social scientist for the RAND Corp., a Santa Monica, Calif.-based think tank. “If you set them too low, that’s not going to lead to closing the achievement gap. If you set them too high, you’re encouraging people to ‘game’ the system.”
Discrepancies Across Exams
In building the Massachusetts Comprehensive Assessment System, known as MCAS, policymakers staked out the high end of the academic-challenge spectrum.
“It’s a very, very hard test,” says Worsley of the Holliston district, which enrolls about 3,000 students. “I don’t know how the inner-city schools in our state are ever going to get there.”
The idea was to make the state’s academic standards and its schools “world class.”
“It’s something to stretch for rather than something that simply validates the existing curriculum,” says James A. Peyser, the chairman of the state school board.
Worsley was not the only educator in the state to notice that there were sometimes big differences in how students ranked on the MCAS tests and the labels they were given on other measures.
Horn and her colleagues from the Boston College testing center did a similar, though more sophisticated, analysis using scores from four Massachusetts districts. In all four, students had also taken at least one standardized test in addition to the state tests in the same year. Those tests included the Stanford-9, Educational Records Bureau exams, and the Preliminary SAT.
Looking first at students’ scale scores on the exams, the researchers found few surprises. As might be expected, students who did well on the state tests also scored high on the other measures.
But the Massachusetts tests also assign students labels based on where they fall along an 80-point continuum. A scale score of 200 to 220, for example, signifies failure, while a student scoring 221 is classified “in need of improvement.” Higher scorers are labeled either “proficient” or “advanced.”
The problem was that students scoring at advanced levels on the comparison tests wound up in all four categories on the MCAS--much as Worsley’s students did.
“When you take an 80-point continuum and reduce it to four points, that becomes problematic,” Horn says. “And, at least in Massachusetts, the focus is really not on scale scores. It’s on performance levels because they’re so easy.”
But Massachusetts officials say the more important point to keep in mind is that all the tests studied were highly correlated, because the top performers on the state test also did well, for the most part, on the other tests.
“Just because one student scores high on the [Stanford-9] but does poorly on the MCAS doesn’t tell you anything about correlation,” says Peyser.
What’s more, he notes, the two are completely different kinds of tests. “One is a criterion-referenced test based on published standards, and one is a norm-referenced test based on no published standards. If they were the same tests, we’d be wasting our money on MCAS when we could buy the Stanford tests off the shelf,” Peyser says.
While the variation found in the test scores may give the impression that policymakers were pulling the cutoff scores for their tests out of thin air, that was hardly the case. Like most states with student-testing programs, Massachusetts used some well-established test-development procedures to determine where to draw academic distinctions.
“Almost all of these tests will get challenged in court,” says P. Uri Treisman, who, as the director of the Charles A. Dana Center at the University of Texas at Austin, has watched Texas’ accountability system evolve. “A meaningful part of court proceedings, something all state agency people know, is that you have to have your psychometrics together,”--meaning the statistical underpinnings of the test design are sound.
In it process, Massachusetts policymakers used a performance-level-setting procedure known as the “booklet classification” method. In every subject area, 20-member panels made up of teachers, administrators, and community representatives spent two days reviewing examples of student responses to test questions, says Jeffrey M. Nellhaus, the state testing director.
Their task was to decide whether the work in the test booklet represented a minimal understanding of the content tested, partial mastery of the material, solid understanding, or comprehensive, in-depth understanding.
“Each panelist classified the same set of booklets,” Nellhau explained. “Since we know the raw scores on the booklets, we could then establish the cut scores.” The state board, in turn, adopted the panels’ recommendations.
High Failure Rates
To the south of Massachusetts, Virginia--which has also set it academic goals high--relied on a 30-year-old method known as the “Modified Angoff” procedure to set the passing scores for its new state tests.
That approach centered on 20-member, geographically balanced committees that included teachers, curriculum experts, and school administrators. Committee members were shown test items and asked to judge the probability that a minimally competent person would get each right. By averaging those verdicts, the committees came up with ranges of passing scores, which were sent on as recommendations to the state board of education.
Most of the time, the Virginia state school board chose scores from the high end of the range. In two case, it exceeded committee recommendations.
The result in both Massachusetts and Virginia was a testing program with some very high hurdles for either schools or students to jump over.
In 1998, when the Massachusetts test results were reported for the first time, 81 percent of 4th graders were either failing or in need of improvement on the English/language arts exam; 71 percent of 8th graders fared just as poorly on the science/technology tests; and 74 percent of 10th graders got failing or needs-improvement ratings on the math test.
In Virginia, 98 percent of schools were given failing marks on the first administration of the Virginia Standards of Learning tests in 1998. But on some of the tests, such as 8th grade science, as many as 71 percent of individual students were earning passing grade that year. Last year, the percentage of individual students passing the tests ranged from 39 percent in 10th grade U.S. history to 85 percent in writing.
The number of schools labeled “accredited with warning"--the lowest possible grade on the tests--dropped to 12.8 percent.
The initially high failure rates prompted protests against the testing programs in both states. In Massachusetts, teachers said they were worried about the potentially harmful effect of describing so many students as somehow deficient--particularly minority students who have traditionally scored lower on standardized tests.
“My concern with labeling students so young is will we have increased dropout rates?” Worsley remarks.
Hundreds of students, most of them from suburban western Massachusetts, boycotted the MCAS tests altogether last spring. The protesters represented only a small fraction of the 220,000 4th, 8th, and 10th graders scheduled to take the tests, however.
Still, the protests have not disappeared. On this past Election Day, for example, voters in six, mostly urban districts approved a nonbinding resolution to suspend plans to use the test as a graduation requirement.
In Virginia, where penalties for poor performance are still years away, protests were more muted. But the state school board last July extended from 2001 to 2004 the date by which students will have to pass the state tests to graduate. Schools now have until 2007, rather than 2004, to get their students’ passing rates up to 70 percent in order to avoid losing their state accreditation.
“You can’t set the bar and expect everybody to jump over it in the same period of time with the same basic instruction,” says William C. Bosher Jr., Virginia’s former state superintendent. “I believe the Virginia board of education is making adjustments that will enable the time to be flexible while not forgoing the standards.”
Inch by Inch
But the transition to higher academic standards might be less painful, some observers have argued, if state policymakers set a lower academic bar in the beginning.
“There’s an axiom that if your constituents can’t meet a requirement, you’d better not pass it into law,” says Treisman of the Dana Center in Texas. “Some legislators are sensitive to that, and others set the standards so high as to violate that axiom.”
In contrast, he says, Texas policymakers in 1993 set low passing scores for the Texas Assessment of Academic Skills and then raised the bar, inch by inch. State school officials notify districts of the upcoming changes to the testing program up to two years in advance.
“The fact that the state was able to set passing standards and ratchet them up five points every year was the genius of the system,” Treisman says. Even with rising standards, overall student passing rates on the tests have increased from 53 percent in 1994 to 80 percent last spring, with some of the biggest gains coming among minority students. (In the Texas system, schools have to demonstrate that learning is improving for their minority populations as well as for their entire enrollments.)
But that system, known as the TAAS, has had its share of detractors, too. Hispanic and black students who had failed the state’s high school exit test brought an unsuccessful lawsuit against the state in 1999. Citing passing rates that were two-thirds those for white students, minority groups contended that the testing program was unfairly stacked against them.
Some Texas teachers, meanwhile, have argued that the push to do well on the tests is effectively narrowing the curriculum for all students.
“No one state has gotten all of this right straight out of the chute,” says Jim Watts, the vice president for state services for the Southern Regional Education Board, an Atlanta-based group that promotes school improvement in 23 states. “To some extent, it doesn’t get real until it’s real.”
Policymakers in Kentucky, for example, overhauled that state’s 6-year-old accountability system in 1998, going so far as to replace some tests, Now, schools have until 2014 to reach a score of 100 on an index that is based on improving dropout and retention rate as well as test scores.
Under the old system, schools were given test-score targets to meet in 20 years. They were expected to reach one-tenth of the distance toward their targets every two years, and rewards and punishments were meted out based on their progress. High-achieving schools complained they were unfairly penalized because they were topping out on the tests.
The new system sets the same target of 100 for everyone and plots a growth line for schools to follow as they move toward that goal. Schools are designated to be “in reward” or “in assistance” based on how far above or below that line they fall.
Whether 16 years or 20, the target date is an arbitrary number, supporters and critics agree.
“It’s a policy decision made by the legislature based on what’s reasonable and what do taxpayers think is reasonable,” says Robert F. Sexton, the executive director of the Prichard Committee for Academic Excellence, a citizens’ group that promotes school reform in Kentucky.
Even that long timeline, however, is too short, say some superintendents. “It’s our belief that only 30 percent of the schools here in Kentucky can make that mark,” says Stephen Daeschner, the superintendent in Jefferson County. With 96,000 students and the city of Louisville in its domain, his district is the state’s most urban. “It’s not realistic,” he contends.
State officials, for their part, say it’s too soon to tell whether Daeschner’s projections have any merit because the new system is just getting under way.
Achievement Underestimated
In contrast, California school officials may have underestimated the number of schools that would qualify last year as having raised their achievement-test scores. State officials had predicted that educators in 60 percent of schools would be entitled to bonuses of up to $25,000 each because of their schools’ gains. But when scores were calculated last October, two-thirds of schools-67 percent--had met their target goals.
The new program is primarily based on results from the Stanford-9. Schools in 1999 were each given a baseline score, and their improvement targets were set at 5 percent of the difference between that starting number and a statewide performance target of 800.
The 800 target is an interim number based on data from the Stanford publisher projecting that 10 percent of students across the state could score at that level. “People just sort of accepted the fact that if you’re in the top 10 percent of anything, that’s probably a good thing,” says William L. Padia, the director of the office of policy and evaluation for the California education department.
State lawmakers came up with the 5 percent figure for the improvement target; expert panels were appointed to figure out 5 percent of what. Should the formula be 5 percent of the previous year’s test scores? Or 5 percent of the average test-score growth across the state?
“They attempted to balance a number of things,” Stecher of RAND says. “They wanted to put the greatest incentive on the schools doing the poorest, and the 5 percent of the distance to the target metric does that.”
It also helped, says Padia, that state school board members knew, in adopting their new targets, that the legislature had given them the authority to make changes later as the testing program evolves.
Experts predict that such adjustments, in fact, will inevitably occur in most states.
If setting cutoff scores for performance levels is an inexact science, adds Stecher, determining what kinds of academic-growth expectations are reasonable for schools can be pretty cIose to an educated guess.
“We have a lot of history now with regard to setting passing rates for licensing exams or professional certification,” he says. “What we have less history on is setting standards for gains or improvements.”
But to policymakers, the bigger mistake would be to have no goals at all to which students and schools could aspire.
“You may never get there if you don’t set them high,” Sexton of Kentucky’s Prichard Committee says. “You can give us every research-based argument as to why that can’t happen, and we will ignore them all because we will not go to the public and say we can’t educate your child.”