The National Academies Press

Currently Skimming:

5 Comparative Studies
Pages 96-166

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.

From page 96... ... Student placement and curricular choice are decisions that involve multiple groups of decision makers, accrue over time, and are subject to day-to-day conditions of instability, including student mobility, parent preference, teacher assignment, administrator and school board decisions, and the impact of standardized testing. This complex set of institutional policies, school contexts, and individual personalities makes comparative studies, even quasi-experimental approaches, challenging, and thus demands an honest and feasible assessment of what can be expected of evaluation studies (Usiskin, 1997; Kilpatrick, 2002; Schoenfeld, 2002; Shafer, in press) Read the entire page →
From page 97... ... , and commercially generated) in relation to their reported outcome measures. Read the entire page →
From page 98... ... Therefore, of the 95 comparative studies, 67 studies were coded as NSF-supported curricula and 28 were coded as commercially generated materials. The 11 evaluation studies of the UCSMP secondary program that we reviewed, not including White et al. Read the entire page →
From page 99... ... are not shown above since no comparative studies were reviewed. particular curricular program evaluations, in which case all 19 programs are listed separately. Read the entire page →
From page 100... ... Synthesis studies referencing a variety of evaluation reports are summarized in Chapter 6, but relevant individual studies that were referenced in them were sought out and included in this comparative review. Other less formal comparative studies are conducted regularly at the school or district level, but such studies were not included in this review unless we could obtain formal reports of their results, and the studies met the criteria outlined for inclusion in our database. Read the entire page →
From page 101... ... Definition of the outcome measures and disaggregated results by program; 6. The choice of statistical tests, including statistical significance levels and effect size; and 7. Read the entire page →
From page 102... ... Using this rubric, the committee identified a subset of 63 comparative studies to classify as at least minimally methodologically adequate and to analyze in depth to inform the conduct of future evaluations. There are those who would argue that any threat to the validity of a study discredits the findings, thus claiming that until we know everything, we know nothing. Read the entire page →
From page 103... ... Most of the studies reported on multiple grade levels, as shown in Figure 5-2. Using the seven critical design elements of at least minimally methodologically adequate studies as a design template, we describe the overall database and discuss the array of choices on critical decision points with examples. Read the entire page →
From page 104... ... DESCRIPTION OF COMPARATIVE STUDIES DATABASE ON CRITICAL DECISION POINTS An Experimental or Quasi-Experimental Design We separated the studies into experimental and quasiexperimental, and found that 100 percent of the studies were quasiexperimental (Campbell and Stanley, 1966; Cook and Campbell, 1979; and Rossi et al., 1999) .1 Within the quasi-experimental studies, we identified three subcategories of comparative study. Read the entire page →
From page 105... ... A third category of comparative study involved a comparison to some form of externally normed results, such as populations taking state, national, or international tests or prior research assessment from a published study or studies. We categorized these studies and divided them into NSF, UCSMP, and commercial and labeled them by the categories above (Figure 5-3) Read the entire page →
From page 106... ... It is also a challenge in curriculum implementation because students coming into a program do not experience its cumulative, developmental effect. Longitudinal studies also have unique challenges associated with outcome measures, a study by Romberg et al. Read the entire page →
From page 107... ... Scores are reported as the mean percentage correct for a series of tests on number computation, number concepts and applications, geometry, measurement, and data analysis. It is difficult to compare performances on different tests over different groups over time against a single longitudinal group from EM, and it is not possible to determine whether the students' performance is increasing or whether the changes in the tests at each grade level are producing the results; thus the results from longitudinal studies lacking a control group or use of sophisticated methodological analysis may be suspect and should be interpreted with caution. Read the entire page →
From page 108... ... The third category of quasi-experimental comparative studies measured student outcomes on a particular curricular program and simply compared them to performance on national tests or international tests. When these tests were of good quality and were representative of a genuine sample of a relevant population, such as NAEP reports or TIMSS results, the reports often provided one a reasonable indicator of the effects of the program if combined with a careful description of the sample. Read the entire page →
From page 109... ... . Additionally, one must be sure that the outcome measures are appropriate for the range of performances in the groups and valid relative to the curricula under investigation. Read the entire page →
From page 110... ... In the comparative studies, investigators first identified participation of districts, schools, or classes that could provide sufficient duration of use of curricular materials (typically two years or more) , availability of target classes, or adequate levels of use of program materials. Read the entire page →
From page 111... ... . To further provide a fair and complete comparison, adjustments were made based on regression analysis of the scores to minimize bias prior to calculating the difference in scores and reporting effect sizes. Read the entire page →
From page 112... ... Most results are presented with the eight identified pairs and an accumulated set of means. The outcomes of this particular study are described below in a discussion of outcome measures (Thompson et al., 2003) Read the entire page →
From page 113... ... We would argue in all cases that reports of how sites are selected must be explicit in the evaluation report. For example, one set of evaluation studies selected sites by advertisements in a journal distributed by the program and in NCTM journals (UCSMP) Read the entire page →
From page 114... ... In coding the comparative studies, we identified three types of components that help to document the character of the treatment: implementation fidelity, professional development treatments, and attention to teacher effects. Implementation Fidelity Implementation fidelity is a measure of the basic extent of use of the curricular materials. Read the entire page →
From page 115... ... If the extent of implementation was used in interpreting the results, then we classified the study as having adjusted for implementation differences. Across all 63 at least minimally methodologically adequate studies, 44 percent reported some type of implementation fidelity measure, 3 percent reported and adjusted for it in interpreting their outcome measures, and 53 percent recorded no information on this issue. Read the entire page →
From page 116... ... Evaluators contrasted the performance of students of teachers with high and low implementation quality, and showed the results on two contrasting outcome measures, Iowa Test of Basic Skills (ITBS) and Balanced Assessment. Read the entire page →
From page 117... ... performance for 1996, 1997, and 1998 by level of Everyday Mathematics implementation. Percentage of students who achieved the standard. Read the entire page →
From page 118... ... A study was coded as positive if it either reported on the professional development provided on the experimental group or reported the data on both treatments. Across all 63 at least minimally methodologically adequate studies, 27 percent reported some type of professional development measure, 1.5 percent reported and adjusted for it in interpreting their outcome measures, and 71.5 percent recorded no information on the issue. Read the entire page →
From page 119... ... Across all 63 at least minimally methodologically adequate studies, 16 percent reported some type of teacher effect measure, 3 percent reported and adjusted for it in interpreting their outcome measures, and 81 percent recorded no information on this issue. One can see that the potential confounding factors of teacher effects, in terms of the provision of professional development or the measure of teacher effects, are not adequately considered in most evaluation designs. Read the entire page →
From page 120... ... Identification of a Set of Outcome Measures and Forms of Disaggregation Using the selected student outcomes identified in the program theory, one must conduct an impact assessment that refers to the design and measurement of student outcomes. In addition to selecting what outcomes should be measured within one's program theory, one must determine how these outcomes are measured, when those measures are collected, and what Read the entire page →
From page 121... ... Furthermore, there are questions such as whether to present only gain scores effect sizes, how to link pretests and posttests, and how to determine the relative curricular sensitivity of various outcome measures. The findings of comparative studies are reported in terms of the outcome measure(s) Read the entire page →
From page 122... ... . The use of item or content strand-level comparative reports had the advantage that they permitted the evaluators to assess student learning strategies specific to a curriculum's program theory. Read the entire page →
From page 123... ... Their design of outcome measures permitted them to examine differences in performance with and without context and to conclude with statements such as "This result illustrates that CPMP students perform better than control students when setting up models and solving algebraic problems presented in meaningful contexts while having access to calculators, but CPMP students do not perform as well on formal symbol-manipulation tasks without access to context cues or calculators" (p. Read the entire page →
From page 124... ... These data sets demonstrate the challenge of selecting appropriate outcome measures, the sensitivity of the results to those decisions, and the importance of full disclosure of decision-making processes in order to permit readers to assess the implications of the choices. The methodology utilized sought to ensure that the material in the course was covered adequately by treatment teachers while finding ways to make comparisons that reflected content coverage. Read the entire page →
From page 125... ... In this small study, he randomly assigned students to treatment groups and then measured their performance on four unit tests composed of items common to both curricula and their progress on the Orleans-Hanna Algebraic Prognosis Test. Peters' study showed no significant difference in placement scores between Saxon and UCSMP on the posttest, but did show differences on the embedded assessment. Read the entire page →
From page 126... ... In Table 5-3, we document the number of studies using a variety of types of outcome measures that we used to code the data, and also report on the types of tests used across the studies. Read the entire page →
From page 127... ... . TABLE 5-3 Number of Studies Using a Variety of Outcome Measures by Program Type Total Content Test Match to Multiple Test Strands Program Test Yes No Yes No Yes No Yes No NSF 43 3 28 18 26 20 21 25 Commercial 8 1 4 5 2 7 2 7 UCSMP 7 1 7 1 7 1 7 1 A Choice of Statistical Tests, Including Statistical Significance and Effect Size In our first review of the studies, we coded what methods of statistical evaluation were used by different evaluators. Read the entire page →
From page 128... ... This question is an important one because statistical significance is related to sample size, and as a result, studies that inappropriately use the student as the unit of analysis could be concluding significant differences where they are not present. For example, if achievement differences between two curricula are tested in 16 classrooms with 400 students, it will always be easier to show significant differences using scores from those 400 students than using 16 classroom means. Read the entire page →
From page 129... ... The analysis used students as the unit of analysis and showed a significant difference, as shown in Table 5-4. To examine the robustness of this result, we reanalyzed the data using an independent sample t-test and a matched pairs t-test with class means as the unit of analysis in both tests (Table 5-5) Read the entire page →
From page 130... ... Underline indicates statistically significant differences between the mean percentage correct for each pair. in a change in finding. Read the entire page →
From page 131... ... However, the committee further noted that in conducting meta-analyses across these studies, effect size was likely to be of little value. These studies used an enormous variety of outcome measures, and even using effect size as a means to standardize units across studies is not sensible when the measures in each Read the entire page →
From page 132... ... In designing an evaluation study, one must carefully consider, in the selection of units of analysis, how various characteristics of those units will affect the generalizability of the study. It is common for evaluators to conflate issues of representativeness for the purpose of generalizability (external validity) Read the entire page →
From page 133... ... Results for Forms 1 and 2 of the test, for the experimental and norm group, are shown in Table 5-7 for 8th graders. In our coding of outcomes, this study was coded as showing no significant differences, although arguably its results demonstrate a positive set of TABLE 5-7 Comparing Iowa Algebraic Aptitude Test (IAAT) Read the entire page →
From page 134... ... Summary of Results by Student Achievement Among Program Types We present the results of the studies as a means to further investigate their methodological implications. To this end, for each study, we counted across outcome measures the number of findings that were positive, nega Read the entire page →
From page 135... ... COMPARATIVE STUDIES 135 16 14 NSF 14 13 Commercial 12 10 Studies of 8 7 6 6 5 4 Number 4 3 2 2 2 1 0 Race/Ethnicity Gender LEP Socio- Prior economic Achievement Status FIGURE 5-9 Disaggregation of subpopulations. 0.60 0.50 0.50 0.44 0.43 0.43 0.38 0.40 0.33 Studies 0.29 of 0.30 0.22 0.20 Proportion 0.13 0.10 0.00 NSF UCSMP Commercial Small (0-299) Read the entire page →
From page 136... ... We caution readers that these results are summaries of the results presented across a set of evaluations that meet only the standard of at least TABLE 5-8 Comparison by Curricular Program Types Proportion of NSF- Commercially Results Supported UCSMP Generated That Are: n=46 n=8 n=9 In favor of treatment .591 .491 .285 In favor of comparison .055 .087 .130 Show no significant difference .354 .422 .585 Read the entire page →
From page 137... ... In effect, given the limitations of time and support, and the urgency of providing advice related to policy, we offer this filtering approach as an informal meta-analytic technique sufficient to permit us to address our primary task, namely, evaluating the quality of the evaluation studies. This approach reflects the committee's view that to deeply understand and improve methodology, it is necessary to scrutinize the results and to determine what inferences they provide about the conduct of future evaluations. Read the entire page →
From page 138... ... � Significant results reflect inadequate outcome measures that focus on a restricted set of activities. � The results are due to evaluator bias because too few evaluators are independent of the program developers. Read the entire page →
From page 139... ... One could also generate reasons why the curricular programs produced results showing no significance when one program or the other is actually more effective. This could include high degrees of variability in the results, samples that used the correct unit of analysis but did not obtain consistent participation across enough cases, implementation that did not show enough fidelity to the measures, or outcome measures insensitive to the results. Read the entire page →
From page 140... ... 9. Did the outcome measures match the curriculum? Read the entire page →
From page 141... ... These findings indicate that to date, with this set of studies, there is no statistically significant difference in results when one reports or adjusts for changes in SES. It appears that by adjusting for SES, one sees increases in the positive results, and this result deserves a closer examination for its implications should it prove to hold up over larger sets of studies. Read the entire page →
From page 142... ... Studies of Commercial Materials and the Filters To ensure enough studies to conduct the analysis (n=17) , our filtering analysis of the commercially generated studies included UCSMP (n=8) Read the entire page →
From page 143... ... We hypothesize, and confirm with a separate analysis, that this is because UCSMP frequently reported on treatment fidelity in their designs while study of Saxon typically did not, and the change represents the preponderance of these different curricular treatments in the studies of commercially generated materials. Impact of Identification of Curricular Program on Probabilities The significant differences reported under specificity of curricular comparison also merit discussion for studies of NSF-supported curricula. Read the entire page →
From page 145... ... 145 for * Read the entire page →
From page 146... ... In the case of the studies of commercially generated materials, significantly different results occurred in the categories of ability and sample size. In the studies of NSF-supported materials, the significant differences occurred in ability and disaggregation by subgroups. Read the entire page →
From page 147... ... Overall, these results suggest that increased rigor seems to lead in general to less strong outcomes, but never reports of completely contrary results. These results also suggest that in recommending design considerations to evaluators, there should be careful attention to having evaluators include measures of treatment fidelity, considering the impact on all students as well as one particular subgroup; using the correct unit of analysis; and using multiple tests that are also disaggregated by content strand. Read the entire page →
From page 148... ... suggest that national tests tend to produce less positive results, and with the resulting gains falling into results showing no significant differences, suggesting that national tests demonstrate less curricular sensitivity and specificity. TABLE 5-10 Percentage of Outcomes by Test Type Test Type National/Local Local Only National Only All Studies All studies (.48, .18, .34) Read the entire page →
From page 149... ... Our analysis was conducted across the major content strands at the level of NSF-supported versus commercially generated, initially by all studies and then by grade band. It appeared that such analysis permitted some patterns to emerge that might prove helpful to future evaluators in considering the overall effectiveness of each approach. Read the entire page →
From page 150... ... These results are based on studies identified as at least minimally methodologically adequate. The quality of the outcome measures in measuring the content strands has not been examined. Read the entire page →
From page 151... ... 3 0 2 0 10 20 30 40 50 60 70 80 90 100 Percentage FIGURE 5-12 Major content strand result: All NSF (n=27) Read the entire page →
From page 152... ... No content strand analysis for commercially generated materials was possible. Evaluations Read the entire page →
From page 153... ... Equity Analysis of Comparative Studies When the goal of providing a standards-based curriculum to all students was proposed, most people could recognize its merits: the replacement of dull, repetitive, largely dead-end courses with courses that would lead all students to be able, if desired and earned, to pursue careers in mathematics-reliant fields. It was clear that the NSF-supported projects, a stated goal of which was to provide standards-based courses to all students, called for curricula that would address the problem of too few students persisting in the study of mathematics. Read the entire page →
From page 154... ... What is challenging is how to evaluate curricular programs on their progress toward equity in meeting the needs of a diverse student body. Consider how the following questions provide one with a variety of perspectives on the effectiveness of curricular reform regarding equity: � Does one expect all students to improve performance, thus raising the bar, but possibly not to decrease the gap between traditionally wellserved and under-served students? Read the entire page →
From page 155... ... Developing more effective methods to monitor the achievement of these objectives may need to go beyond what is reported in this study. Among the 63 at least minimally methodologically adequate studies, 26 reported on the effects of their programs on subgroups of students. Read the entire page →
From page 156... ... other 37 reported on the effects of the curricular intervention on means of whole groups and their standard deviations, but did not report on their data in terms of the impact on subpopulations. Of those 26 evaluations, 19 studies were on NSF-supported programs and 7 were on commercially generated materials. Read the entire page →
From page 157... ... to study disaggregation by subgroup, and two reported on comparative effect sizes. In the studies using statistical tests other than t-tests or Chi-squares, two were evaluations of commercially generated materials and the rest were of NSF-supported materials. Read the entire page →
From page 158... ... One of the two evaluations of African Americans, performance reported for the commercially generated materials, showed significant positive results, as mentioned previously. For Hispanic students, 12 of 15 reports of the NSF-supported materials were significantly positive, with the other 3 showing no significant difference. Read the entire page →
From page 159... ... Interactions Among Content and Equity, by Grade Band By examining disaggregation by content strand by grade levels, along with disaggregation by diverse subpopulations, the committee began to discover grade band patterns of performance that should be useful in the conduct of future evaluations. Examining each of these issues in isolation can mask some of the overall effects of curricular use. Read the entire page →
From page 160... ... The benefits are most consistently evidenced in the broadening topics of geometry, measurement, probability, and statistics, and in applied problem solving and reasoning. It is important to consider whether the outcome measures in these areas demonstrate a depth of understanding. Read the entire page →
From page 161... ... It suggests that student perceptions are an important source of evidence in conducting evaluations. As we examined these curricular evaluations across the grades, we paid particular attention to the specificity of the outcome measures in relation to curricular objectives. Read the entire page →
From page 162... ... , the common standardized outcome measures (Preliminary Scholastic Assessment Test [PSAT] scores or national tests) Read the entire page →
From page 163... ... It is an approach that integrates content strands; relies heavily on the use of situations, applications, and modeling; encourages the use of technology; and has a significant dose of mathematical inquiry. One could ask the question of whether this approach as a whole is "effective." It is beyond the charge and scope of this report, but is a worthy target of investigation if one uses proper care in design, execution, and analysis. Read the entire page →
From page 164... ... Sixty-nine percent of NSF-supported and 61 percent of commercially generated program evaluations met basic conditions to be classified as at least minimally methodologically adequate studies for the evaluation of effectiveness. These studies were ones that met the criteria of including measures of student outcomes on mathematical achievement, reporting a method of establishing comparability among samples and reporting on implementation elements, disaggregating by content strand, or using precise, theoretical analyses of the construct or multiple measures. Read the entire page →
From page 165... ... 4. It is essential to examine the implementation components through a set of variables that include the extent to which the materials are implemented, teaching methods, the use of supplemental materials, professional development resources, teacher background variables, and teacher effects. Read the entire page →
From page 166... ... In addition to these prototypical decisions to be made in the conduct of comparative studies, the committee suggests that it would be ideal for future studies to consider some of the overall effects of these curricula and to test more directly and rigorously some of the findings and alternative hypotheses. Toward this end, the committee reported the tentative findings of these studies by program type. Read the entire page →

From page 96...

... Student placement and curricular choice are decisions that involve multiple groups of decision makers, accrue over time, and are subject to day-to-day conditions of instability, including student mobility, parent preference, teacher assignment, administrator and school board decisions, and the impact of standardized testing. This complex set of institutional policies, school contexts, and individual personalities makes comparative studies, even quasi-experimental approaches, challenging, and thus demands an honest and feasible assessment of what can be expected of evaluation studies (Usiskin, 1997; Kilpatrick, 2002; Schoenfeld, 2002; Shafer, in press)

5 Comparative Studies Pages 96-166

5 Comparative Studies
Pages 96-166