Alfred North Whitehead’s dictum about the virtues and dangers of simplicity helps explain why we are confused about what kind of evidence should be used to guide education policy. We often have lots of evidence to choose from; the problem is making sense of it and drawing the right lessons. Let’s look at some examples.
Educational researchers David C. Berliner and Audrey L. Amrein, both from Arizona State University, published in 2002 a report on “The Impact of High-Stakes Tests on Student Academic Performance.” They concluded that such testing failed to have the intended positive impact on student learning and was often bad for students. The New York Times ran the story on Page 1. The Times’editorial page, as well as others nationwide, featured the research and urged caution on the implementation of high-stakes testing. The editorial expressed particular concerns over the testing included in the federal No Child Left Behind Act, through which schools can be financially penalized if test scores show that they are “in need of improvement.” It was a great story and an important one. Research evidence had thrown a serious wrench into the very heart of the Bush (and Clinton) school reform strategies. Or had it?
A week later, economists Martin Carnoy and Susanna Loeb of Stanford University reported their own findings from a similar database (since published in the journal Educational Evaluation and Policy Analysis) and, using different methods of analysis, concluded that Berliner and Amrein got it wrong: High-stakes testing actually was pretty good for kids. A month later, Margaret E. Raymond and Eric A. Hanushek of the Hoover Institution published their own analysis and concluded that high-stakes testing was actually very good for kids. They used much the same data as Berliner, Carnoy, and their respective colleagues, but analyzed it differently, aggregated the information at different levels, and drew conclusions quite different from Berliner and Amrein’s and reasonably congruent with Carnoy and Loeb’s.
Research evidence informs and enlightens decisionmaking; it does not bypass the need for interpretation and judgment.
These contradictions motivated statistician Henry Braun of the Educational Testing Service to conduct a new study in which he used four different modes of analysis to evaluate the data on the connections between statewide high-stakes testing and student achievement. He concluded that the decisions that researchers made about methods of analysis largely determined which kinds of findings they reported. Analyzed in some ways, the evidence showed positive effects for high-stakes testing; analyzed in other ways, there was no discernible effect.
I happen to know personally most of the players in this drama. They are all serious scholars, careful quantitative analysts, and passionate educators. They reported evidence instead of anecdote or opinion. And they disagreed wildly. What’s a policymaker (or parent, for that matter) to do, especially when we are urged to engage in “evidence-based education”?
A similar conundrum emerged a couple of years ago when a research team from Harvard University, led by political science professor Paul E. Peterson, announced the results of a carefully designed experimental study concluding that school vouchers work to raise academic achievement for poor kids. (The high-stakes-testing studies were not experiments; they were post-hoc analyses of existing databases from the states.) The Harvard team’s claims were challenged by critics, including some of their own collaborators from the policy-research firm Mathematica. The folks from Mathematica cautioned that all we can conclude from this study is that vouchers worked positively for 6th grade African-American boys in New York City. In fact, only if the scores for all the kids in these studies are combined, including those of the African-American 6th graders, would there be a statistically significant benefit for the voucher group. A columnist in The Wall Street Journal attacked the critics, arguing that as long as there was an overall positive effect and no evidence that vouchers were harmful to anyone, it made sense to proceed with this policy initiative.
This Commentary was selected for inclusion in The Last Word: The Best Commentary and Controversy in American Education, published in 2007. Get more information on the book from the publisher.
Evidence is supposed to make life easier, or at least more rational, for policymakers in education. Instead of battling over ideologies, we are urged to conduct careful research, design real experiments whenever possible, collect data, and then dispassionately draw our conclusions. Would that the world were that simple. Truth is, research is all about exercising judgment under conditions of uncertainty, and even experimental designs don’t relieve us of those judgmental burdens. The acts of designing experiments themselves involve value judgments, and interpreting the results always demands careful judgment. As the late pioneer in educational psychology Lee Cronbach often observed, even a carefully designed experiment is ultimately a case study, conducted with particular teachers and students, in particular places at a particular time. And the analysis of all studies depends heavily on the analytic methods used, the level at which the data are aggregated and either combined or separated, and the interpretive powers and predilections of the scholars.
For the same reasons that jury members and Supreme Court justices often disagree with one another, and appeals courts often reverse the judgments of lower courts, evidence alone never tells the story. This is not a problem unique to education or the social sciences. Economists battle over whether lowering taxes stimulates the economy more than it increases deficits, and each side offers evidence. In medicine, cancer researchers give competing interpretations of studies on the efficacy of different kinds of mastectomies, and therefore of the value of alternative treatments. Surgeons disagree about the relative value of surgical vs. medical interventions for treatment of atherosclerosis. From global warming to diet and nutrition, scientists conduct studies, offer evidence, and disagree about practical or policy implications.
Does this mean that evidence is irrelevant and research is unnecessary? Does it mean that education policy cannot be based on careful research? Not at all. But we need to give up the fantasy that any single study will resolve major questions. We need to recognize that research evidence rarely speaks directly to the resolution of policy controversies without the necessary mediating agencies of human judgment, human values, and a community of scholars and actors prepared to deliberate and weigh alternatives in a world of uncertainty. Researchers in education (and in most other fields) are rarely neutral. Advocates cite evidence and research. Researchers themselves often are advocates. Indeed, it’s not very interesting for scholars to pursue studies of issues they don’t give a damn about.
So whose evidence should we believe? Let me propose a few preliminary guidelines for adjudicating the claims and counterclaims of conflicting studies.
First, I would live by the motto “Seek simplicity … and distrust it.” It is nearly unimaginable that any one study would support a simple policy conclusion, across the board. If a study claims to demonstrate that “bilingual education doesn’t work,” or that “all high-stakes testing is bad for kids,” or that “phonics is the only way to learn to read,” don’t trust the claim. Most studies of complex policy issues yield results that are themselves complex; they must be interpreted with caution and nuance. In the study of tuition vouchers, for example, the actual findings were highly variable in terms of effects on kids by race, grade, and location. Simple conclusions emerged only if we totally ignored all the variations and seriously oversimplified the findings.
It isn’t that simplicity is unachievable. The preponderance of the evidence on the value of holding back children who “fail” 1st grade appears to be both overwhelming and clear: Holding kids back is educationally worthless. But that’s a simple conclusion that comes from more than a decade of quite different studies, and, in particular circumstances involving particular kids, the best judgment may well be otherwise.
Second, I would give greater credence to any study that was conducted by either investigators who had no discernible stake in the results or—even better— those whose findings run counter to their own values, tastes, and preferences. As Judge David S. Tatel of the U.S. Court of Appeals for the District of Columbia Circuit observed last year, it is very difficult for the courts to take social-science research evidence seriously when it often appears that the scientists doing the research have a political or ideological stake in the desired results.
Research evidence rarely speaks directly to the resolution of policy controversies without the necessary mediating agencies of human judgment, human values, and a community of scholars and actors prepared to deliberate and weigh alternatives in a world of uncertainty.
If conflict of interest is a problem with pharmaceutical research, it is certainly an impediment with educational research as well. In some cases, investigators have a long and public record of advocating for one of the results they offer evidence to support. In other cases, their prior preferences are either unknown or unformed. As we typically do in qualitative studies, we should expect investigators to put their values, preferences, and commitments on the table when they offer their evidence and interpretations. It’s unrealistic to expect that every important study will be conducted by scholars who are disinterested in the findings. We need to go further to increase the credibility of evidence.
Third, I would insist that every major study with policy significance undergo serious peer review before its findings and the policy interpretations associated with them are trumpeted to the media. The review should deal with at least three aspects of the study. How well does the design and analysis permit the claims being made for the interpretation of the data? What other studies offer both complementary and contradictory findings, and how does this study compare with them? And perhaps most important, even if the findings of the research meet the strict canons of scholarly work in one’s discipline, how reasonable are the claims based on the evidence of this study to support the more general policy claims now being put forward?
Each of the three studies on high-stakes testing did undertake forms of peer review, at least with respect to a substantial chunk of the evidence they each presented. But peer review is not a universal process. Current modes of peer review for journals are unbearably slow. Therefore, we need a much swifter mechanism for such critical appraisals if this proposal is realistic. How can a serious form of review precede high-profile press releases and press conferences and yet not unacceptably impede dissemination?
Fourth, I would remind investigators that they have a social responsibility to act as “stewards” of their fields. They are responsible not only to zealously conduct their own studies and to organize the rhetoric to support their claims, but they also must, like lawyers, be “officers of the court” who bear responsibility for the fidelity of their work to the integrity of their field. They should so organize their studies that there is someone designated whose role and responsibility is to examine the procedures, data, and interpretations, and ask: “How might it be otherwise? How consistent with the findings is an interpretation opposite to that offered by the study directors?” In many European countries, all doctoral dissertations are defended publicly, with the participation of a formal “antagonist” whose job is to challenge the findings of the study.
A research study needs someone whose job it is to ask how susceptible the evidence and its interpretation are to intelligent (or just plain politically motivated) criticism. In fact, journalists have a professional obligation to be more critical in vetting stories of research before they publish them, to ask about peer review and about the questions raised by the research critics.
The bottom line is that we must move to a more evidence-based strategy for crafting our education policies, but we cannot pretend that there are some forms of research—even controlled experiments—that are guaranteed to provide answers to our questions without requiring the exercise of expert judgment and structured peer review. Evidence informs and enlightens decisionmaking; it does not bypass the need for interpretation and judgment. It’s unrealistic to expect that educational research will regularly be conducted by those who have absolutely no stake in the outcomes. Education is not, and never will be, a values-free zone. Nevertheless, we need ways to review research findings, evaluate the evidence, consider the values inherent in the situation, and render judgments that our citizenry can trust.
Beyond these proposals, I would recommend the formation of a new policy forum to assist in regularly reviewing and evaluating policy-relevant educational research. In some areas, we may need the equivalent of research-review SWAT teams that can be called in on a regular basis to review competing claims and the evidence that supports them. In other cases, the use of “consensus panels” can be quite useful in the face of complex, multiple studies with a range of findings, interpretations, and policy recommendations, though the pace of their efforts can be snail-like.
The National Research Council of the National Academies might well take the lead in such an activity, assisted by a range of both self-consciously partisan and intentionally nonpartisan bodies. Such forums would organize quick-response review panels and also conduct periodic reviews when serious policy controversies arise. The forum should be nongovernmental, to avoid conflicts of interest with the education policy missions of any federal, state, or local government. (The current swirl of controversy around the Bush administration’s implementation of the No Child Left Behind program exemplifies this problem.)
If we can follow those guidelines, there will remain big, unanswered questions about the impact of high-stakes testing on the achievement of elementary school kids, and about the value of vouchers to reduce educational inequality. But we will have much more confidence in the value of the evidence put forward to help us traverse the thickets of education policy. I can assure you, however, that the picture that emerges from the evidence won’t likely be simple. That’s not necessarily a problem with the quality of the research; it may simply be a characteristic of the world in which we live.