Making Every Test Count

Save to favorites
Print

Email Facebook LinkedIn Twitter

Copy URL

Almost without exception, the primary yardstick that states use when they want to judge schools and students is a test. Testing in American schools is ubiquitous. And every year, the stakes attached to such tests rise higher and higher. “If anything, the trend is to do more testing. To add more subjects, more grade levels,” says Edward G. Roeber, the vice president for external relations for Advanced Systems in Measurement and Evaluation Inc., a testing company based in Dover, N.H.

But as the stakes mount, so do public criticism and scrutiny. Sometimes states are adopting tests that don’t adequately reflect their standards for what students should know and be able to do. Too often, experts warn, policymakers are using tests in ways that ignore their limitations or outstrip their technical capacity.

That’s particularly true when existing tests are used for new, high-stakes purposes, such as deciding whether a student will graduate or a teacher will get a cash bonus.

In designing their testing programs, states frequently face a conflict between two basic goals of testing. States want tests that will improve the learning process in classrooms; they also need reliable information to hold students and schools accountable for results. Then, other factors must be considered: How can states get the most information for the least cost? How much time will the tests take? How can tests be designed so that they promote needed changes without alienating parents and the public?

There are no easy answer . Even the best tests provide only a few pieces of the evidence to judge what youngsters are learning and whether schools are doing their job.

State-level decisions about who is tested, how, and when can have far-reaching implications that affect the lives of millions of children and teachers and their principals. Yet in trying to design a system that is fair and educationally sound, each state also must negotiate the maze of political issues and financial concerns that often surround tests and what they’re used for. In too many instances, contends Robert F. Sexton, the director of the Prichard Committee for Academic Excellence, a nonpartisan coalition in Kentucky, tests are the “weak underbelly” of state accountability systems. “It’s where, if you wanted to destroy something like this, you would attack.”

“I think the big question is how you manage a complex technical project like this in a political environment,” he observes, “and especially one that has been polemicized to the point that education has in this country today.”

By 2000, every state but Iowa will have at least one form of a statewide test. And even in Iowa, the vast majority of districts voluntarily administer the Iowa Tests of Basic Skills, a national multiple-choice exam.

We’ve got several dozen states in which the de facto curriculum has become the test. How much are we going to let these very narrow ideas of accountability become our curriculum?

But state testing programs vary widely in terms of who is tested and when, the subjects that are covered, the design and format of the questions, how long the exams take, and how much they cost. In many ways, the tests reflect the educational priorities and politics of the state that gives them.

“The one constant is the fact that they are all different,” says Linda Bond, a national assessment consultant for CTB/McGraw-Hill, one of the nation’s largest commercial test publishers.

For example:

Every public school student in grades 2-11 in California takes a multiple-choice test known as the Stanford Achievement Test-9th Edition. The results of the commercially developed test are used to compare the performance of California students, schools, and districts with the rest of the country.
In Maryland, 3rd, 5th, and 8th graders spend three weeks each spring on the Maryland State Performance Assessment Program. But instead of filling in bubbles on an answer sheet, the students may work in groups, write individually about their work, and then explain why they solved tasks as they did.
Parents in Maryland generally don’t receive information about how their own children rated, since no student takes all parts of the exam. (However, the state may soon begin providing parents with information on their children.) And schools and districts aren’t compared with one another. Instead, the state uses the results to determine how far schools are from state standards and to help place low-performing schools on an academic watch list.
Massachusetts is working on assessments that will compare students and schools with those nationally as well as with the state’s own standards for what young people should know and be able to do. The results will be used both to judge schools and districts and to decide whether students should be promoted or receive a diploma.

Traditionally, states have relied on multiple-choice tests. And they have reported results in terms of “norms,” which compare students with the average for a nationally representative sample of young people.

But in the 1980s, both the format of state tests and their validity underwent severe criticism.

Educators argued that multiple-choice questions could not adequately measure complex thinking and problem-solving skills. Critics contended that the tests narrowed the curriculum and encouraged teachers to focus on drill-and-skill learning, rote memorization, and decontextualized bits of information.

Studies showed that the vast majority of students in most states were scoring “above average.” Such findings led people to question whether such norm-referenced tests had become so corrupted--through outright cheating or direct coaching on test questions--that results were meaningless.

As states adopted standards for what students should know and be able to do, they also wanted tests that reflected those decisions. An off-the-shelf test from a commercial publisher may only loosely match a given state’s academic goals and objectives.

By the late 1980s, some states had begun to experiment. In 1988, Vermont made the leap to portfolios that pull together individual students’ work over time. California, Kentucky, and other states experimented with so-called performance assessments, which ask students to write short answers or essays, conduct experiments, or complete other concrete tasks on demand.

Many of the new tests also compared student against a publicly articulated benchmark, rather than against the average for other students who took the exam. The assumption was that if experts could design better tests-ones that strongly reflected the curricular goals-they would be “worth teaching to.”

But states that went too far, too fast often outpaced the available technology and lost the public trust.

In 1994, Gov. Pete Wilson of California dismantled that state’s system of performance assessments because of concerns that it could not yield reliable scores for individual students, was too subjective, and didn’t focus enough on the basics. The public was skeptical not only of the tests, but also of some of the new instructional approaches on which they were based, says Lorraine M. McDonnell, a professor of political science at the University of California, Santa Barbara.

Wisconsin spent three years and more than $1.5 million to produce innovative performance assessments in communications skills, mathematics, and science. But in 1995 lawmakers there eliminated funding for the program and decided instead to stick with a commercially developed test and a state writing assessment.

“Basically, we lost the legislative support,” says Susan Ketchum, an education program specialist in the state education department. “It was too much money and too subjective.” The program was also time-consuming: Just taking the math exam required three class periods or more.

To reflect its academic standards, Wisconsin is revising its tests and is creating a new 12th grade graduation exam.

Criticism has been particularity vocal when new assessments are used to determine rewards and sanctions for schools, as they are in Kentucky. In 1991-92, Kentucky launched an ambitious assessment system that included portfolios in writing and math and “performance events” that students worked on individually and in groups. The results helped determine whether schools were penalized or received cash rewards.

Over the next six years, state officials struggled to address technical problems with the tests and respond to criticisms. They revived multiple-choice questions so the tests could cover more content more quickly. And they stopped using results from the performance events and math portfolios to judge individual schools. After some scores were miscalculated in 1996, the state fired its testing contractor.

But rather than build public confidence, the changes increased the public’s distrust. Last year, legislators scrapped the old system altogether.

The new tests developed to replace it will still include a mix of essay and multiple-choice questions, as well as a writing portfolio. But in addition, every student in grades 3, 6, and 9 will take a commercial, norm-referenced test. And the state will be required to measure each student’s progress from year to year.

In particular, states are discovering that parents want to be able to compare test scores for individual students. And they feel more comfortable with tests that resemble those they took as children.

It’s hard for parents, says Lisa Gross, a spokeswoman for the Kentucky Department of Education, “to see the value in a test that didn’t show how well their students were doing compared with other students.”

“There’s a real tension here,” adds Anthony S. Bryk, an education professor at the University of Chicago. To improve instruction, he explains, states may lean toward portfolios and performance assessment. But those methods, he adds, have troubling issues of reliability and validity that, for purposes of high-stakes testing, “tend to drive you back to the multiple-choice format.”

Multiple-choice tests remain the most common tool, though most states are balancing them with at least some open-ended formats, such as writing an essay. Increasingly, states also are reporting how well students do compared against a state benchmark, not just against other students.

Such changes reflect attempts to design assessments that are useful for both raising accountability and influencing instruction.

“When you want to hold schools accountable, you’re presumably looking for the least expensive, least time-consuming measure of student performance,” Roeber of Advanced Systems says. “Something that would be a consistent yardstick and that could be used consistently across the state.” Multiple-choice tests generally fit that need.

“On the other side of the equation,” Roeber adds, “are people who say, ‘I want this test to have a positive impact on what students learn and how they learn it.’” Open-ended items, written responses, and custom-developed tests may suit that need better, he says. They take more time, more money, and more effort, he acknowledges, “but it’s worth it because they have a greater impact on instruction.”

Delaware’s new testing system, for example, will include the short version of the Stanford-9, so that parents can compare their children’s performance with their peers’ nationwide. But the multiple-choice test will be embedded in a larger assessment developed by Delaware teachers to measure how well students are learning the state’s own academic standards.

Students in grades 3, 5, 8, and 10 will take reading, writing, and math tests for about 10 hours of testing per grade. Science and social studies exams will be given in grades 4, 6, 8, and 11.

“It gives us, in our minds, the best of both worlds,” says John R. Tanner, the director of assessment and analysis for the Delaware Department of Education. “It gives us the opportunity to have a purely standards-based instrument, designed and developed by Delaware educators for Delaware kids, while also being able to tell our constituents how their students are performing when compared to a national average.”

The underlying assumption Is that better tests will give a more accurate picture and lead to better instruction. But whether state tests can actually improve achievement or change classroom teaching remains to be seen.

“Most standards-based assessments have only recently been implemented or are still being developed,” a recent report from the Washington-based National Research Council points out. “It is too early to determine whether they will produce the intended effects on classroom instruction.”

Some of the new testing formats may be just as susceptible to flaws as the old ones. In Kentucky, for example, mathematics gains in state test results from 1992 to 1996 were far larger than those on the National Assessment of Educational Progress, the federal program that tests a sampling of students in core subjects. That difference casts doubt on the reported improvement.

Many teachers and principals complain that the high-stakes nature of much state testing is pushing instruction in unproductive directions.

“Principals and teachers are feeling hard pressed to do the right kind of instruction because we’ve got several dozen states in which the de facto curriculum has become the test,” Larry Myatt, the director of Fenway Middle College High School in Boston, says. “How much are we going to let these very narrow ideas of accountability become our curriculum?” In addition, many state tests still do not match the state standards and won’t for several years, if then.

“In some cases, it may be that the tests are much more demanding than the standards that are on paper,” says Matt Gandal, the director of standards and assessments for Achieve Inc., an organization of governors and business leaders committed to improving U.S. schools. “In other cases, it may be that the standards are demanding, but the tests are not.”

Achieve is working with interested states to see just how well their tests and standards are aligned. In Michigan, Achieve found that the state’s assessment program was substantially more comprehensive and demanding than one might assume from reading its standards. In North Carolina, the state’s standards were strong and well balanced, but its assessments were not as challenging as the standards suggested. Achieve is also working with about 20 states to devise a common block of test items that they could embed in their existing tests to make cross-state comparisons easier.

“I don’t think we have super-strong evidence that tests are going to improve student learning,” says William A. Mehrens, a professor of educational measurement at Michigan State University. “We have some data that are consistent with that hypothesis, but they’re not gathered in a way that would permit causative inferences.”

Research by Mehrens and others does suggest that performance assessments can change instruction in desired ways. Such assessments may encourage teachers to place more emphasis on problem-solving, communication skills, or writing. But the changes are perhaps not as profound as their advocates had hoped.

States are trying to design tests that both improve instruction and provide the data they need to track the progress of their schools.

In particular, researchers are finding that tests alone can’t change teaching practice.

Brian Stecher, a social scientist at the RAND Corp, who has looked at the instructional effects of both the Vermont and Kentucky programs, says the assessments have clearly changed the way some teachers teach. “But there’s a great deal of scaffolding, of support, that’s required to bring this about,” he stresses. “One of the key things is going beyond assessments to help create the standards, frameworks, curriculum, and lessons that back it up.”

McDonnell of UC-Santa Barbara has studied the influence of the Kentucky and North Carolina assessments on teaching, and has drawn similar conclusions. “Testing does seem to have an effect on classroom instruction,” she says, “like more group work, more writing, a greater use of manipulatives. But it’s having considerably less effect in terms of getting more conceptually sophisticated content into the classroom.”

Some critics fear that state assessments have tilted too far in the direction of accountability--and as a result yield too little information that is useful to teachers. They also worry that many of the tests are not sensitive enough to reflect improvements in instruction and learning, even when they do occur.

Few systems, for example, track gains by individual students over time. They include only limited information on background factors, such as poverty, that are strongly correlated with test scores. And teachers often don’t receive test results until the following school year, long after students have gone on to other classes.

Monty Neill, the executive director of the National Center for Fair & Open Testing, or FairTest, a nonprofit watchdog organization in Cambridge, Mass., complains that no matter what the test, it inevitably measures a lot less than is reflected in a state’s standards documents.

“It’s not true that the standards are represented in a comprehensive and balanced fashion in these tests,” he says. “Even the writing samples are pretty limited.”

As a result, Neill argues, educators will be encouraged to teach to the tests in a narrow sense, rather than focus on the broader subject matter the tests are supposed to sample. In a nationally representative survey conducted by Public Agenda in conjunction with Quality Counts, 78 percent of high school students said their teachers usually spend class time helping students prepare for standardized tests. And 89 percent of students said they took the tests “somewhat” or “very seriously.” In 1995, a national forum on assessments, convened by FairTest, released principles for testing systems that were signed by more than 80 national education and civil rights groups. The principles urge that multiple-choice and short-answer methods, if used, should be a limited part of a state assessment system. In addition, assessments intended to rank students or compare them with each other should not be a significant part of the total assessment package.

“At best, tests are indicators,” Bryk of the University of Chicago says. Too much emphasis on assessments can have a “narrowing and distorting effect because they don’t come close to capturing what we’re aiming for in education.”

What’s needed, many argue, is not just better tests but better use of them.

Unless tests are aligned with curriculum and teaching, the National Research Council warns, they should not be used to make high-stakes decisions about individual students. And no important decision about a student’s future should be made solely or automatically on the basis of a single test score.

Tests alone are also insufficient for holding schools or teachers accountable, many experts contend. They say the sooner the public recognizes that, the better.

“We’re not going to test our way out of our educational problems,” says George F. Madaus, the Boisi professor of education and public policy at Boston College. “There definitely is a place for assessment,” he adds. “We need that kind of information. The problem here’ that it’s become the ultimate criterion.”

Lynn Olson

Lynn Olson was managing editor of special projects for Education Week. She also covered national policy (including “P-16 issues” issues, NCLB standards, accountability, and reform), assessment and testing.

In March 2024, Education Week announced the end of the Quality Counts report after 25 years of serving as a comprehensive K-12 education scorecard. In response to new challenges and a shifting landscape, we are refocusing our efforts on research and analysis to better serve the K-12 community. For more information, please go here for the full context or learn more about the EdWeek Research Center.

A version of this article appeared in the January 11, 1999 edition of Education Week