“Like a steady drip from a leaky faucet, the experimental studies being released this school year by the federal Institute of Education Sciences are mostly producing the same results: ‘No effects,’ ‘No effects,’ ‘No effects.’ ” So began Education Week’s description this past spring of findings reported in rigorous, more scientific evaluations from the 7-year-old research agency. (“‘No Effects’ Studies Raising Eyebrows,” April 1, 2009.)
The reports evaluated the impact of a wide array of federally supported programs and practices, in areas such as professional development, after-school programs, mentoring of new teachers, and Title I instruction. Often—though not always—the results were “no effects.”
These findings have agitated program developers and academics, who hate to see their widely touted remedies fail to significantly help children when implemented on a large scale. It is less clear that veteran educators feel the same way. After all, most of them have seen too many widely touted research-based innovations come and go with little serious change in student achievement.
One sobering insight from these studies is that conventional wisdom does not always, or even often, lead to the desired outcomes in the real world of schools.
The “no effects” results also raise a multitude of questions. Should we stop conducting large-scale evaluations and rely exclusively on studies of tightly controlled implementations of curricular programs? Should we stop conducting large-scale evaluations that reflect typical implementation? Could the problem be that the methodology is flawed?
On a deeper level, was the National Academy of Sciences wrong when its panel determined in 2002 that there was no reason to conclude that rigorous scientific research could not be conducted in education? Was Congress also wrong in insisting that the federal government improve the quality of education research? Is it really harder to conduct rigorous research in education than in equally messy fields such as public health, job training, mental health, and others that rely on hard-to-quantify human responses and are subject to multiple variables?
As one of the principal investigators of a recent IES study of supplemental reading-comprehension programs—a study that also found no significant effects, and even one negative effect—I will attempt to articulate the logic behind the type of evaluations conducted by the IES over the past six years. Here, for a start, are four frequently heard statements about these evaluations that can be answered true or false:
• Standards for demonstrating that an instructional program is effective have gotten far more demanding over the past five to eight years.
TRUE. Advances in our understanding of statistics and research design have made it a good deal more difficult to find significant effects in educational research now than 20 years ago. With the advent of hierarchical linear modeling, we now understand the errors of our earlier ways, when we sometimes disregarded the fact that we assigned classes or schools to intervention conditions and then acted as if we had assigned the students to these conditions. The advantage of this realization is that now, when we find that effects are significant, we are much more certain that an intervention will succeed in similar settings.
There are several downsides, however. In the past, we were often happy with 30 or 40 students per condition. Now we need 30 classes or 20 schools to ensure that analyses are accurate according to contemporary understanding of statistical assumptions. Thus, it is much more difficult to obtain significant effects, since we are looking at large-scale use of the program and we can no longer have graduate students closely supervising implementation in each classroom on a daily basis.
Use of more accurate procedures, however, makes it more likely that any program with a significant effect is, in fact, likely to be effective in the realistic world of schools. In the long haul, this can reduce a major source of cynicism in the profession—the fact that when schools adopt so-called “research-based interventions” they rarely experience the growth that developers or academics promise.
• Many of the recent IES studies include classrooms where teachers partially implemented the programs, or implemented them with low quality.
TRUE. Why include all teachers, rather than only those who implement the program faithfully? A clue comes from the field of public health, which realized over time that it was critical to evaluate what people really do, rather than what they should or say they do, in order to find and disseminate effective means for assisting patients with diabetes, for example, or parents with aggressive children.
So, in recent studies, implementation is not tightly controlled. But it is always measured. Often, as in the recently released reading-comprehension study, only about 80 percent to 90 percent of teachers reported using the curriculum regularly, and for some aspects of each of the programs, implementation of the procedures advocated by the developers ranged from a low of 44 percent in one classroom to as high as 100 percent in others. This situation paralleled what we had hoped for as evaluation researchers—an opportunity to assess impacts of supplemental programs on the comprehension of 5th graders whose classroom experience reflected typical implementation. In other words, the teachers in the study experienced the typical two to three days of training with a few follow-up visits, which is what publishers and districts generally provide when a district decides to purchase a supplemental reading-comprehension program.
One reason there are often “no effects” in such large-scale evaluations is that the intervention programs under review were just educated guesses as to what might work. One sobering insight from these studies is that conventional wisdom does not always, or even often, lead to the desired outcomes in the real world of schools.
• The recent set of large-scale studies has consistently found no effects.
FALSE. An evaluation of Reading First demonstrated significant effects in word reading but not comprehension. The study of reading professional development indicated that teachers did learn significantly more about how to teach, and did change their practice; there was no significant impact, however, on their students’ reading achievement. The recently released study of four mathematics curricula showed that, at the end of 1st grade, two of them—Math Expressions and Saxon Math—led to significantly better outcomes than the other two: Both programs were far more systematic in their approach than the others were. And, in the 9th grade reading studies, effects were double for the schools that implemented programs reasonably faithfully.
Evaluations that include all types of classrooms and measure implementation allow for understandings of whether careful implementation matters. In some cases, it does not. We should then question the “expert guidance” provided by the developers.
• We learn nothing from studies with no significant effects.
FALSE. In the reading-comprehension study, we found that using supplemental programs and providing several days of training does not dramatically increase the use of productive teaching practices or improve students’ comprehension ability. Although hardly a cause for celebration, this is important information for schools to have.
With no-effects results, we have learned that much—but not all—of the conventional wisdom about best practices does not, in fact, raise student-achievement levels. For example, the results of the reading-comprehension study support the belief of prominent comprehension researchers, such as Isabel Beck, that merely having students practice “strategies”—summarization or sequencing, for example—doesn’t improve comprehension. Perhaps longer-term investments in innovative approaches to building teachers’ skill at engaging students in dialogue about text may pay off in a way that use of supplemental curricula does not.
Yet this kind of research did demonstrate that two widely available intervention programs for struggling 9th grade readers led to consistent gains, and that inquiry-based mathematics instruction led to lower achievement levels than curricula that systematically and explicitly taught math in a way that linked the arithmetic procedures to the underlying mathematical principles 1st graders learned.
Rigorous research can begin to liberate the field from its tendency to jump on every fad that claims to have some research basis. In the long term, we might even be able to discern the few widely advocated practices that actually do help students learn. In the meantime, there are likely to be more disappointing findings and more questioning of conventional wisdom—which, after all, is one key to the development of a science.
As Bob Dylan wrote 45 years ago, “There’s no success like failure.” He also bleakly noted that “failure’s no success at all.” Development of feasible programs that succeed in the real world of schools is, as the education researchers Milbrey McLaughlin and Richard F. Elmore have noted, “steady work.” It requires a mix of innovative thinking, often grueling and frustrating field-testing, and the ability to rethink approaches that don’t succeed when rigorously evaluated.