National Academies Press: OpenBook
« Previous: 6 Evaluating Mathematics Assessment
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

COMMISSIONED PAPERS

EFFECTS OF MANDATED TESTING ON INSTRUCTION

DESIGN INNOVATIONS IN MEASURING MATHEMATICS ACHIEVEMENT

LEGAL AND ETHICAL ISSUES IN MATHEMATICS ASSESSMENT

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
This page in the original is blank.
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

EFFECTS OF MANDATED TESTING ON INSTRUCTION

LYNN HANCOCK

JEREMY KILPATRICK

UNIVERSITY OF GEORGIA

The past two decades have seen a striking increase in the use of testing in the United States by school officials and legislators attempting to determine whether funds invested in schools are yielding an educated citizenry. Testing is viewed as the major instrument for holding schools accountable for the resources they have received. It has become a vital tool of state and federal education policy. Governments and local school authorities have mandated the administration of tests, usually at the end of major phases of schooling but sometimes at the end of each grade, in the belief that test scores provide critical information on how well students are learning and how effective instruction has been.

Testing of all types seems to be on the rise in the United States, but the increase in mandated testing has been especially dramatic. In 1990, 46 states had mandated testing programs as compared with 29 in 1980. As the school population increased 15 percent from 1960 to 1989, revenues from the sales of standardized tests increased 10 times as fast.1 More than a third of the elementary school teachers in a recent survey2 saw the emphasis on standardized testing in U.S. education as strong and getting stronger. Somewhat fewer saw the same increasing emphasis in their school district, and even fewer saw it in their own school. Almost no

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

teachers, however, said that the emphasis on standardized testing was weak or diminishing.

As the amount of instructional time lost to mandated testing increases, teachers and other educators have begun to express concern about the effects of such testing on instruction. Because most of the standardized tests used in mandated testing programs are of the multiple-choice variety, particular attention has been given to the argument that these tests promote a narrow approach to teaching, passive and low-level forms of learning, and a fragmented school curriculum.

The amount of available research to address these concerns and arguments, however, is quite sparse. Much of this research consists of surveys of teachers' appraisals of the effects of mandated testing rather than direct observation or independent judgments of these effects. The findings from the research are often inconclusive and sometimes conflicting.

The purpose of this paper is to review the literature on the effects of mandated testing on school instruction. Because the climate of educational testing in the United States has changed so rapidly over the past decade, we give special attention to the most recent studies. Furthermore, although the research is not confined to mandated testing in mathematics, we have tried to draw conclusions of particular relevance to the mathematics education community.

EFFECTS ON CURRICULUM

Resnick and Resnick3 portray the process by which state legislatures and departments of education use accountability programs to control curriculum content and standards of performance as follows. Tests of desired educational objectives are mandated and administered, and the scores are widely disseminated. Because of the attention given to test results, teachers gradually adapt their instruction to the test objectives and format. Adaptation of the curriculum takes place as teachers who administer a test every year have the opportunity to see test forms and compare test content with the content they are teaching. The result is that "you get what you assess, and you do not get what you do not assess."4

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

Others have also described this process. The term WYTIWYG—what you test is what you get—was coined by Burkhardt et al.5 to describe the effects of public examination systems on the curriculum. Burkhardt6 claims that desirable changes in the mathematics curriculum can be brought about through modest, carefully planned changes in examinations. In this way, WYTIWYG can serve as a lever of educational reform.

Some contend that the power of that lever depends on the importance that has been placed on the test results. Popham7 used the expression measurement-driven instruction to describe classroom practices motivated by consequences, or stakes, attached to the test results. He identified two types of high stakes for tests. One type is characterized by the use of scores to make important decisions about students, such as promotion to the next grade, reward of course credit, or qualification for a high school diploma. The other type of high stakes is associated with news media reports of school or district test results. Thus, high-stakes tests draw their power from educators' concerns for students' welfare and for their own standing in the community. One of Madaus's principles of measurement-driven instruction8 is that high-stakes tests have the power to transfer what was once local control over the school curriculum to the agency responsible for the examination.

We begin, therefore, with an examination of evidence that externally mandated tests are influencing school mathematics curricula and, if so, what the nature of that influence is.

CHANGES IN CONTENT

Several recent studies have looked at the effects that mandated testing programs have on curriculum content. They show that, to various degrees, the WYTIWYG phenomenon is at work in classrooms. According to Stake and Theobold,9 the most frequently reported change in school conditions that teachers attributed to the increased emphasis on testing is greater pressure to teach stated goals. Darling-Hammond and Wise10 collected data from in-depth interviews with 43 randomly selected teachers from three large school districts in three mid-Atlantic states. When asked what impact standardized tests had upon their classroom behavior, the most common response was that they changed their curriculum emphasis. Some teachers reported that the emphasis on standard-

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

ized tests has caused them "to teach skills as they are tested instead of as they are used in the real world."11

Nature of the effects As part of a study of the impact of the minimum competency testing program that the state of Kansas implemented in 1980 at grades 2, 4, 6, 8, and 10, Glasnapp et al.12 solicited the opinions of school board members, superintendents, principals, and teachers. A different questionnaire was used for each group, with 1,358 teachers participating in the 1982 survey, 816 in 1983, and 1,244 in 1987. The data show that as the test objectives were more widely distributed and as the teachers reported increased encouragement to direct their instruction to the state objectives, there was a corresponding increase in teachers' reports that the test was influencing their instruction. Over half the teachers surveyed in 1987 reported that the test objectives were valuable for identifying what needed to be taught and that they had given those objectives increased emphasis in their instruction. Nearly half said they used the state-distributed minimum competency objectives to plan instructional activities, up from 38 percent in 1983 and 23 percent in 1982. The Kansas teachers also reported that the state minimum competency testing reduced the time they spent teaching skills that the tests did not cover.

Smith and Rottenberg13 interviewed 19 elementary school teachers and then observed the classes of four of these teachers for an entire semester, during which externally mandated tests were administered. The researchers noted a definite trend, which they attributed to time constraints and a packed curriculum, to neglect topics not included on the standardized tests and to focus on those that were. Mathematics beyond what was to be covered on the tests was given very little attention.

Romberg et al.14 undertook a study of eighth-grade teachers' perceptions of the impact that their state- or district-mandated testing programs had on mathematics instruction. A national sample of 552 teachers responded to the survey questionnaire. Of the 252 respondents who said they administered a state-mandated test, 34 percent reported placing a greater emphasis on the topics emphasized on the test and 16 percent reported placing less emphasis on topics not emphasized. In reaction to their state testing programs, 23 percent of the teachers were placing a greater emphasis on paper-and-pencil computation whereas only 1 percent were de-

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

creasing that emphasis. As for familiarity with their state tests, 46 percent said they looked at the test to see whether the topics were those they were teaching whereas 33 percent reported that they did not examine the test at all.

Strength of the effects Clearly, these studies demonstrate that mandated testing is having an impact on the content of mathematics instruction, but the strength of that impact is another question. Porter et al.,15 in a review of several studies of elementary school teachers' decisions about instructional content, found little evidence that standardized tests given once a year significantly influence the choices made about what to teach.

These studies did not, however, take into account the importance the elementary school teachers attached to the tests, either for their students or for themselves. Recent studies seem to bear out the claim of Madaus16 and Popham17 that the higher the stakes attached to the test results, the greater the impact of the testing program on the curriculum. In interviews conducted by Darling-Hammond and Wise,18 teachers typically reported that when tests are used to measure teacher effectiveness or student competence, incentives are created to teach the precise test content instead of underlying concepts or untested content.

Corbett and Wilson19 studied the effects of state-mandated minimum competency testing programs in Maryland and Pennsylvania. At the time of the study, Maryland students needed passing scores on the reading and mathematics tests in order to receive a high school diploma. In Pennsylvania the purpose of the minimum competency tests in language and mathematics was to identify students in need of remedial instruction. Thus the Maryland test was considered by the researchers to have higher stakes than the Pennsylvania test. The study results consistently showed that the Maryland testing program had the more powerful influence on the school curriculum. For example, in Maryland 53 percent of the educators surveyed reported a major or total change in class content resulting from their state testing program. In Pennsylvania only 7 percent reported a major change in their instructional content.

A national survey of teachers on the influence of mandated testing on mathematics and science teaching was conducted as part

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

of a larger study by the Center for the Study of Testing, Evaluation, and Educational Policy.20 The survey findings were based on over 1800 responses from teachers whose classes were given mandated tests in mathematics. The results showed that teachers with high-minority classes (greater than 60 percent minority students) perceived standardized tests to be of greater importance than did teachers with low minority classes (less than 10 percent minority students). Teachers of high-minority mathematics classes were more likely to use mandated test scores to place students in special services, to recommend students for graduation, and to evaluate student progress. These teachers also felt more pressure to improve their students' scores on mandated mathematics tests. Two thirds of the high-minority classroom teachers said their students' scores on mandated mathematics tests were below their districts' expectations, compared with only one fifth of the teachers of low minority classes. Three quarters of the teachers of high-minority classes agreed that they felt pressure from their districts to improve their students' scores on mandated mathematics tests. Asked about the influences that mandated standardized tests have on their instructional practice, teachers of high-minority classes indicated stronger curriculum effects than did teachers of low-minority classes. Teachers of high-minority classes were more likely to be influenced by mandated tests in their choice of topics and in the emphasis they gave those topics in their mathematics classes.

Direction of the effects Although some researchers have tried to determine whether mandated testing is causing a shift in curriculum content, others have tried to discern the direction of the shift. Shepard and Smith21 reported, from interviews with and observations of kindergarten and first-grade teachers, that standardized tests at third and sixth grades have served to fix requirements for the end of the first grade. In a position paper on appropriate guidelines for curriculum content and assessment programs, the National Association for the Education of Young Children and the National Association of Early Childhood Specialists in State Departments of Education22 point to the overemphasis on test results as causing a downward shift in content, so that what used to be taught in first grade is now taught in kindergarten. The impact of such testing has even trickled down into programs for 3- and 4-year-old children.

Many of the state testing directors interviewed by Shepard23 emphasized that it is the conscious purpose of state testing pro-

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

grams to ensure that essential skills are taught. Several recent studies indicate that the state programs are achieving some success in steering instruction towards basic skills.

Stake and Theobold24 surveyed 285 teachers in Illinois, Indiana, Minnesota, South Dakota, Washington, Maryland, and North Carolina. When asked for a summary judgment on formal standardized testing, 36 teachers indicated that testing is helpful in many ways and 173 said it is a generally positive factor that is more helpful than harmful. Asked for the single most positive contribution that testing makes in their school, they most often cited the increased time spent teaching basic skills. Corbett and Wilson25 found that 85 percent of the Maryland educators and 30 percent of the Pennsylvania educators surveyed perceived at least a moderate spread of basic skills instruction throughout the curriculum as a result of their state minimum competency testing programs. Nearly 80 percent of the teachers surveyed by Lomax et al.26 either agreed or strongly agreed that mandated testing influences teachers so that they spend more instructional time in mathematics classes on basic skills.

Some have concluded from these studies that mandated standardized tests are causing school curricula to move towards an emphasis on basic skills. Archbald and Porter,27 however, are not so sure. They contend that mandated testing, rather than causing instruction to focus on basic skills, is merely consistent with the instructional practice that would take place in any case. Their skepticism is supported by research findings indicating that teachers have a positive view of teaching basic skills. Research by Glasnapp et al.28 found that 89 percent of the Kansas teachers surveyed were satisfied or extremely satisfied with their district's emphasis on basic skills instruction. Even though 86 percent of the teachers who participated in the study by Romberg et al.29 characterized state tests as primarily tests of basic or essential skills, only 31 percent said they placed a greater emphasis on basic skills than they would otherwise. Whether mandated testing programs are the cause or merely a contributing factor, the important point is that the resulting emphasis on basic skills is certainly far from the mathematics curriculum called for by the National Council of Teachers of Mathematics (NCTM) in their Curriculum and Evaluation Standards.30

There is also evidence that an emphasis on problem solving and critical thinking, which is in line with the NCTM Standards, is on

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

the rise as well. Most of the teachers surveyed by Stake and Theobold,31 215 of 285, report that an increase in emphasis on problem solving and critical thinking has taken place in their schools over the last year or two. When asked what changes they thought were at least partly caused by an emphasis on testing, one of the three changes most frequently noted was a gain in emphasis on problem solving and critical thinking.

In the study by Romberg et al.,32 81 percent of the teachers reported that they knew problem-solving items were on the state test. Whereas 20 percent reported placing a greater emphasis on problem solving because of the state test, only 8 percent reported less emphasis. The researchers suspected, however, that the "teachers who consider problem solving to be on the test are probably thinking of simple word problems"33 and do not hold the broader conception of problem solving called for in the NCTM Standards.

CURRICULUM ALIGNMENT

Leinhardt and Seewald34 referred to the extent to which instructional content matches test content as overlap. They pointed out that teachers are well aware of the notion that the greater the test overlap in their instructional emphasis, the higher their students' test scores are likely to be. The result, according to Resnick and Resnick,35 is that "school districts and teachers try to maximize overlap … by choosing tests that match their curriculum. When they cannot control the test, … they strive for overlap by trying to match the curriculum to the tests, that is, by 'curriculum alignment.'"36

Though no recent studies directly address the extent to which teachers recognize and strive for overlap, various research methods have been used to measure overlap indirectly. For example, Freeman et al.37 conducted year-long case studies of several fourth-grade teachers to analyze their styles of textbook use and to determine how the different styles affected content overlap between the mathematics textbook used and five standardized tests of fourth-grade mathematics. The researchers defined five models of textbook use on the basis of their classroom observations. In every case, a substantial proportion of the problems presented during the teachers' lessons dealt with tested topics.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

As noted before, the majority of the Kansas teachers who participated in a 1987 survey38 found the state's minimum competency test objectives to be a valuable guide for their curriculum and reported changing their instructional emphasis accordingly. As teachers strive to maximize overlap, some observers have expressed concern that the curriculum will eventually narrow until instruction and learning are focused exclusively on what is tested.39 However, there is little in the way of research to support the claim that the curriculum is actually narrowing in response to mandated testing programs. Only 16 percent of those same Kansas teachers had seen indications that the school curriculum was being narrowed as a result of the state minimum competency tests.40 In fact, according to Stake and Theobold,41 199 out of 285 teachers surveyed reported that a general broadening of the curriculum had taken place in their schools over the last few years.

Perhaps these differences in perspective can be attributed to the different ways in which researchers and teachers interpret "narrowing". To researchers, narrowing refers to teaching to the test. Some teachers, however, appear to interpret "narrowing" as teaching fewer topics. That only a small fraction of Kansas teachers believed that their curriculum had narrowed can perhaps be explained by two other statistics: 45 percent of them reported adding lessons or units to the curriculum as a result of the tests, whereas only 17 percent reported sacrificing instruction in other areas or skills to teach to the state objectives. In response to pressures to alter content, teachers have demonstrated a greater willingness to add topics to the curriculum than to delete them.42

The Corbett and Wilson research43 suggests that a narrower curriculum may be welcomed by some teachers. Even though 64 percent of the Maryland educators surveyed said that there had been at least a moderate narrowing of their school curriculum as a result of the state minimum competency test, 56 percent also reported at least a moderate improvement in the curriculum. In follow-up interviews, Maryland educators said the curriculum was "structured, coordinated, more focused, more defined, sequentially ordered, more systematic, consistent, and created a consciousness (about what was being taught)."44 For these Maryland teachers, narrowing was associated with bringing an unwieldy curriculum under control.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

EFFECTS ON TEACHING PRACTICE

Mandated testing is also impinging on the teaching methods used in mathematics classrooms. The study of eighth-grade teachers by Romberg et al.45 offers some insight into the magnitude and nature of the impact. With respect to teaching methods, 18 percent of the teachers reported an increased use of direct instruction and only I percent a decreased use as a result of their state mathematics test. Alternatives to direct instruction did not prosper. Although II percent increased their use of small group instruction, another 9 percent decreased their small group instruction. Despite increased advocacy of cooperative learning, the 6 percent who reported an increase in its use were offset by the 10 percent reporting a decrease. Similarly, Lomax46 reported that mathematics teachers believed mandated testing had resulted in more time spent in whole group instruction.

Extended projects have been advocated as a way of engaging students in a deeper and more sustained encounter with mathematics. Mandated testing appears to work against such projects, perhaps because they are seen as taking class time that might be used to prepare for the tests. Only 2 percent of the teachers in the Romberg et al.47 sample reported an increase in extended project work, whereas 22 percent reported a decreased emphasis on extended projects. A similar phenomenon affects the use of technology in teaching when such technology is not incorporated into mandated testing programs. Whereas 5 percent of the eight-grade teachers reported increasing their emphasis on calculator activities, 20 percent reported a decreased emphasis.48 Only 2 percent reported an increased emphasis on computer activities whereas 16 percent reported a decrease in computer activities.

These results indicate that the magnitude of the impact of mandated testing on teachers' instructional methods is rather limited, a finding that is supported by the Glasnapp et al. study.49 Only 19 percent of the Kansas teachers surveyed in 1987 said they changed their instructional practices because of the state testing program, even though most of them said they used the test results to assess their teaching effectiveness. It is quite possible that the test results gave the Kansas teachers no reason to change their teaching because students scored well on the 1987 mathematics

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

tests. Most school principals perceived student performance on the tests to be at or above expectations. Also, how strongly a mandated testing program influences instructional methods may, as with instructional content, be a function of the test stakes. The Kansas minimum competency test was a low-stakes test.50 When Corbett and Wilson51 asked whether teachers had adopted new instructional approaches as a result of the state minimum competency testing program, they found dramatically different responses in high-stakes Maryland compared with low-stakes New Jersey. Nearly twice as many Maryland educators (82 percent) as New Jersey educators reported teachers' methods had changed at least moderately.

Although the strength of the impact of the state tests on instructional practice, as reflected in the opinions of samples of eighth-grade teachers and of Kansas teachers, may not be cause for concern, the direction in which practice is moving is. A greater reliance on direct instruction, accompanied by a de-emphasis on projects, calculators, and computers, is directly opposed to the practices envisioned by the 1991 NCTM Professional Standards for Teaching Mathematics.52

Some of the reported effects of testing on teaching practice can be considered positive. According to Glasnapp et al.,53 about half of the Kansas teachers believed that the state-mandated test of minimum competency allowed them to match their instructional methods to the performance levels of individual students. Two thirds believed that the minimum competency tests informed their instructional decisions by increasing their understanding of a student. Over two thirds of the teachers surveyed by Stake and Theobold54 perceived an increase in the attention given to differences in individual students over the last few years in their school, with many of them attributing the increase at least partly to the emphasis on testing.

TEST PREPARATION

Teachers' test preparation practices give rise to two concerns: the amount of time from regular instruction given to test preparation and the educational value of that preparation. Certainly, some attention to teaching students how to manage testing time and answer sheets is appropriate and can lead to more valid results. As Shepard55 pointed out, however, repeated practice aimed strictly at

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

the content of the test rather than the content domain of the test can increase scores without increasing student achievement. Teachers' test preparation practices have received much scrutiny and criticism.56 According to Madaus,57 teachers respond to the pressures of high-stakes tests by preparing students to meet the requirements of previous test questions, which are reduced to the level of strategies in which the students are drilled.

Three recent research studies have asked teachers about the extent of their test preparation practices. Smith and Rottenberg58 report that in the four elementary classes they observed, an average of 54 hours of class time was spent preparing for externally mandated standardized tests, which, in addition, required about 18 hours to administer. The survey questionnaire used by Romberg et al.59 asked the grade 8 teachers to indicate the preparation practices for which they set aside several days a year, several weeks a year, or time on a frequent and regular basis. Most typically, the teachers indicated that they prepared for the state tests only several days a year. In addition, 30 percent of the teachers reported allocating no instructional time for state test preparation and 46 percent reported no time for district test preparation.

The extent of test preparation is undoubtedly affected by several factors. Lomax et al.60 found the percentage of teachers who spent more than 20 hours of class time on mandated test preparation was three times higher for high-minority classes than for low-minority classes. Also, nearly three quarters of the teachers of high-minority classes began their mathematics test preparations a month or more before the test, well over twice the percentage of teachers of low-minority classes who did so.

Information is also available about the nature of teachers' test preparation practices. In the 1987 survey by Glasnapp et al.,61 40 percent of the teachers indicated that the Kansas minimum competency test had led to drills, coaching, and test item practice. Lomax et al.62 report that the most common practices teachers reported using in preparing students for mandated mathematics tests were teaching test-taking skills (73 percent), encouraging students to work hard (64 percent), teaching topics known to be on the test (50 percent), providing students with items similar to test items (47 percent), and using motivating materials (45 percent).

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

Are test preparation practices worth the valuable class time they take? Popham63 asserts that two standards must be met for test preparation activities to be considered suitable for classroom instruction: The practice must meet professional standards of ethics, and it should be of educational value to students. He considers any action that violates test security procedures to be outside the boundaries of professional ethics. Instruction aimed at increasing students' test scores without increasing their mastery of the domain of concepts or skills to be measured by the test is of no educational value.

To gauge the beliefs of educators about test preparation practices, Popham conducted a brief survey of the participants in three workshops he held in late 1989 and early 1990. The first workshop was attended by teachers and administrators from Ohio, Indiana, and Kentucky. Teachers, administrators, and school board members from Southern California took part in the last two workshops. The participants were asked to supply anonymous judgments as to whether five test preparation practices were appropriate or inappropriate: previous-form preparation, current-form preparation, generalized instruction on test-taking skills, same-format preparation, and varied format preparation. The survey included a description of the five practices and of professional ethics and educational defensibility standards. Over a quarter of the teachers said they provided test-specific materials and used practice tests to prepare their students for mandated mathematics tests.

Popham64 found substantial numbers of teachers in both of his samples who considered previous-form or current-form preparation to be appropriate. More than half of each sample deemed same-format use appropriate. Popham, on the other hand, considers all three practices to be educationally indefensible, and current-form use to be unethical. Although Popham's samples are small, his results suggest that many teachers may be willing to engage in questionable practices to improve their students' test performance.

In a study by Hall and Kleine,65 1,012 superintendents, testing coordinators, principals, and teachers responded to a questionnaire on the use of test preparation materials. These educators were from districts across the county where standardized, group-administered, norm-referenced tests were given. Of the 176 teachers surveyed, 55 percent reported using some type of test

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

preparation materials, the most common of which were locally developed.

Mehrens and Kaminski66 examined four commercially available test preparation programs to see how well they fit the content and format of the California Achievement Test (CAT). They found substantial variation. In the most extreme case, the Test-Specific Scoring High materials were matched so closely to the CAT that it "serves as a pre-test for the CAT in the same manner as if one actually used the CAT as a pretest prior to giving the same CAT at a later time."67 In the Hall and Kleine study,68 19 percent of the teachers using test preparation materials were using the Scoring High materials.

In an ideal world, all of the instructional materials teachers use to prepare students for standardized tests would be interesting tasks of sound instructional value. Research suggests, however, that there is a move toward test preparation practices that are debatable at best, from the standpoint of both professional ethics and mathematics education.

CLASSROOM ASSESSMENT PRACTICES

The question of whether externally mandated testing programs affect a teacher's own assessment practices has been considered by a few researchers. The answer depends on which assessment practice is being considered. Interviews conducted by Salmon-Cox69 with 68 elementary school teachers revealed that these teachers did not give standardized test scores much attention when assessing their students' progress. When discussing general assessment techniques, the teachers most frequently mentioned observation as well as teacher-made tests and interaction with students. Only three of the teachers spoke of standardized tests when discussing how they assess students.

Other studies, however, show that standardized tests do influence the ways in which teachers design their own tests. Madaus70 points out that teachers pay particular attention to the form of the questions on a high-stakes test. According to Romberg et al.,71 more than 70 percent of the grade 8 teachers surveyed perceive that typical district- and state-mandated test items require a single correct answer and are in a multiple-choice format. There is concern in the mathematics education community that such a format offers students no opportunity to gather, organize, or interpret data, to model, or to communicate, all of which are called for in the

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

NCTM Standards.72 The form of the questions widely used on standardized tests implies to students that their task is not to engage in interpretive activity, but to find, or guess, in a quick, nonreflective way, the single correct answer that has already been determined by others.73 If you do not know an answer immediately, there is no way of arriving at a sensible response by thought and elaboration. In the study by Darling-Hammond and Wise,74 teachers reported feeling the need to use similar types of test items and fewer essay tests in their own assessment practice.

Only 13 percent of the teachers in the study by Romberg et al.75 reported that they were not familiar with the format and style of typical test items on the state test. Over a third considered the format and style of state-mandated test items when planning their instruction. On a short one-page questionnaire sent to the teachers who did not send back the first survey, 51 percent of the 142 respondents reported considering the style and format of test items when planning their own tests.

Teachers' assessment tools provide a well-defined medium for indicating to students what it is about mathematics that is most important. Because of the importance of teachers' tests and the indications from research that teachers see standardized tests as a model of what those tests should be, further study of teachers' testing practices is critical.

EFFECTS ON TEACHERS

Research studies give conflicting reports of how teachers feel about and react to mandated testing programs. According to Glasnapp et al.,76 62 percent of the Kansas teachers surveyed in 1987 either agreed or strongly agreed that the pressure on local districts to perform well on the state minimum competency tests led to undesirable educational practices. In the same survey, however, 60 percent of the teachers indicated that they considered the Kansas minimum competency testing, overall, to be beneficial to education in the state. The survey by Lomax et al.77 also gives conflicting views from teachers. Although over half of the teachers agreed that the mandated testing program in mathematics led to teaching methods that went against what they considered good instructional practice, a substantial portion, 28 percent, agreed that mandated testing helps

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

schools achieve the goals of the current education reform movement. Most of the teachers surveyed by Stake and Theobold78 judged standardized testing to be a generally positive influence on the quality of education at their schools.

Teachers are apparently able to separate what they may see as generally favorable influences of testing on education from how they are responding in their own classes to mandated testing. The teachers in the Darling-Hammond and Wise study79 commented that they felt that teaching to standardized tests is not really teaching. Teachers committed to developing a new school curriculum at an elementary school observed by Livingston et al.80 and at a magnet high school observed by McNeil81 had a common reaction to the demands of mandated testing. Livingston et al.82 reported the experiences of Westwood School, a K–2 school in Dalton, Georgia, where the teachers undertook a revision of the state mathematics curriculum. As the changes proceeded, the teachers became concerned that the content and format of the state-mandated standardized tests might not reflect their students' experiences with the revised curriculum. The teachers believed that their desire for innovation was constrained by the need to teach to the test. The perceived conflict between the new curriculum and the curriculum based on mandated test objectives was resolved at Westwood by attempting to teach both curricula simultaneously.

McNeil83 observed the experiences of teachers at an innovative magnet high school as the district piloted a system of proficiency examinations to be administered at the end of each semester of coursework. Test results were linked to teacher merit pay and principals' bonuses. School scores were compared in the newspapers. The teachers responded by finding ways of working around the proficiencies, believing that they were too confining. The teachers coped with the pressure to teach the objectives of the proficiency-based curriculum by delivering what McNeil calls "double-entry" lessons, in which the lessons geared to the proficiency lessons were delivered in addition to the regular course instruction.84

Smith and Rottenberg85 observed some negative effects of mandated testing on teachers. The elementary school teachers in their study expressed feelings of shame and embarrassment if their students scores were low or did not meet district standards. The researchers noted a sense of alienation resulting from teacher beliefs that test scores

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

are more a function of students' socioeconomic status and effort than of classroom instruction. Also, the teachers believed that test results were not properly interpreted in the community, where low scores were attributed to weak school programs and lazy teachers.

The Westwood School curriculum committee members contended that discrepancies between teachers' judgments and students' test scores lead to a deprofessionalization of teachers because of the view by parents that test scores are absolute indicators of students' learning.86 Of the Maryland teachers surveyed by Corbett and Wilson,87 58 percent reported that mandated testing has led to at least a moderate decrease in professional judgment in instructional matters. Included in the survey instrument were questions on the effects of mandated testing on teachers' work life: 70 percent of the respondents reported a major increase in demands on their time, 66 percent a major increase in paperwork, 64 percent a major increase in pressure for student performance, 55 percent at least moderate changes in staff reassignment, and 44 percent at least a moderate increase in worry about lawsuits.

These studies do not present a positive picture of the impact of mandated testing on the teachers who administer them. Tests required by agencies outside of the classroom have added to teachers' work loads. Meanwhile, tying students' scores to promotion or course credit decisions has taken away from teachers' authority. As the research of McNeil88 and of Livingston et al.89 shows, mandated standardized testing programs can pose a real dilemma for teachers who want to implement changes in the curriculum. As long as the mathematics content of standardized tests differs from the mathematics curriculum called for in the NCTM Standards, teachers faced with mandated testing will find themselves in a difficult position. For example, even teachers who recognize the benefits of calculators often justify their reluctance to use them in their mathematics classes by arguing that students are not allowed to have them while taking standardized tests.90

EFFECTS ON STUDENTS

There is evidence that standardized test scores play a major role in determining students' educational experiences. According to Salmon-Cox,91 about one quarter of the elementary school teachers

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

she interviewed reported that they consider standardized test results, in conjunction with other information, in grouping and tracking students. Smith and Rottenberg92 also observed that test scores were used to make decisions about placement of students into groups and tracks. They note that test scores were the single most important factor used in the decision to place children into gifted programs and into an advanced junior high school curriculum. In their study, Romberg et al.93 report that 35 percent of the eighth-grade teachers surveyed indicate that district-mandated tests influence decisions about grouping students within the class for instruction, and 62 percent say the district test scores influence recommendations of students for course or program assignments.

Mandated tests can have a negative impact on the students who take them. The National Association for the Education of Young Children and the National Association of Early Childhood Specialists in State Departments of Education94 hold that many young children experience unnecessary frustration as they struggle with developmentally inappropriate standardized tests for kindergarten and first grade. Smith and Rottenberg95 note that most teachers believe the frequency and nature of the tests and the way in which they are administered cause "stress, frustration, burnout, fatigue, physical illness, misbehavior and fighting, and psychological distress,"96 particularly in younger students.

On the other hand, Corbett and Wilson97 note some positive effects of mandated testing on students. Their survey asked teachers about their perceptions of the impact of the state minimum competency tests on students' work life. When asked if students had become more serious about their classes, 40 percent of the Maryland teachers indicated that they perceived at least a moderate change in this direction. Also, most of the Maryland teachers reported at least a moderate increase in their empathy for students who are poor achievers and in their knowledge of students with serious learning problems. The Maryland teachers were able to turn what they reported as a mostly negative influence on their own work lives into what they saw as a more positive experience for their students.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

CONCLUSIONS

The movement for accountability in education that has increased the volume of mandated testing in American schools in the past two decades has come more from a desire to find out what students are learning than from a demand to change the content of that learning or to shape its acquisition. Nonetheless, raising the level of academic performance has been part of the agenda from the outset. Rewards and punishments are built into each system of mandated testing, whether what is at stake is graduation from high school or a headline in the local paper.

Because most of the mandated mathematics testing has concentrated on basic concepts and manipulative skills that can be assessed by multiple-choice or short-answer tests, any effect of that testing on the school curriculum has been to increase the already substantial attention teachers give to such concepts and skills. Teachers have not necessarily found such a narrowing of the curriculum to be bad: It allows them to direct their attention to topics some authority considers important, and it is in line with what they are likely to feel comfortable teaching. Their instructional practice has, if anything, shifted to an even greater reliance on direct instruction, which is marked by organized lessons presented through lecture and discussion to the entire class. Because teachers are familiar with direct instruction and usually feel comfortable using it, they may see the effects of mandated testing on their teaching as positive.

Teachers are not necessarily comfortable, however, with everything that mandated testing may require of them. They may feel called upon to use valuable class time preparing their students for the tests, at times engaging in practices with dubious educational or ethical value. They may feel that testing programs devalue their skills as assessors of students' learning. When the demands of mandated testing programs conflict with practices they deem more appropriate, however, they tend not to challenge these programs. Rather, they seek a middle ground in which they strive to meet both the demands of the testing program and their own view of what and how they should be teaching. Even as reports surface of some undesirable effects on students and their educational opportunities, teachers continue to see as many benefits as flaws in mandated testing. They have accommodated to the system.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

The available research does not lead to the unqualified conclusion that mandated testing is having harmful effects on mathematics instruction. The picture is both more mixed and more indistinct. One reason may be that the research has focused on teachers' views of mandated testing and has seldom pushed very far beyond what teachers have said on questionnaires or in response to interview questions. Teachers may not be all that dissatisfied with testing programs that are largely in tune with their styles and beliefs. Moreover, they may not see effects that could be detected through other means.

A hint that there may be more to the story comes from the study by Stake and Theobold.98 Although none of the teachers in their sample said that the description of their school, as presented in the survey answers, was in any way biased or misleading, Stake and Theobold drew the following conclusion: "We are not satisfied with the data presented here. We do not believe these data tell us what is happening to schooling in America."99 Stake and Theobold contend that teachers—like the rest of us—lack a language for representing the curriculum so as to distinguish personal concepts of education from the official indices provided by learning objectives and test items. In other words, teachers may not be able to tell us clearly about the effects of mandated testing. As we work to develop a richer language to describe the curriculum, we need also to consider means of investigating the effects of mandated testing that do not rely exclusively on teachers' reports.

The WYTIWYG phenomenon is clearly quite limited as an explanation of how mandated testing produces effects on mathematics instruction and, therefore, learning. Students always learn some mathematics that is not tested, and they do not always learn all the mathematics being tested. In addition to portraying the student as a faultless receptacle for instruction, WYTIWYG leaves out the teacher as a medium for turning test prescriptions into learning experiences. As Silver100 observed, "perhaps WYTIWYG should be more accurately dubbed WYGIWICT—what you get is what I can teach."101

Silver's point becomes especially important as some mandated testing programs change to incorporate reforms sought in such documents as the NCTM Standards. As these programs incorporate items with extended answers, calling upon students to

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

perform investigations of open-ended problems, write up their findings, and perhaps collect them in a portfolio, the intention is that teachers will be drawn away from a basic skills curriculum in mathematics that is delivered through direct instruction. Teachers will almost certainly find open-ended work more difficult to manage than direct instruction and an ambitious curriculum more difficult to implement than basic skills. The limitations on their ability and their willingness to teach in the ways sought by reformers will then begin to govern how the mandated testing affects their instruction. We may begin to see some teachers challenging or attempting to subvert a system of assessment that suits neither their teaching style nor their beliefs about essential mathematics content.

The effects of mandated testing on instruction have not been well studied and are not clear. Furthermore, changes currently under way in mandated testing may modify whatever effects there are. The picture given by the available research is neither so bleak as the one advanced by foes of standardized, multiple-choice testing nor so rosy as that offered by proponents of testing as the engine of school reform. It is instead a blurred picture. Improvements in research techniques and more extensive investigations may ultimately yield a more focused view. The landscape of mathematics instruction and assessment is itself changing, and even the tentative conclusions we have drawn in this paper seem unlikely to hold for long.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

ENDNOTES

1  

U.S. Congress, Office of Technology Assessment, Testing in American schools: Asking the right questions (OTA-SET-519) (Washington, DC: U.S. Government Printing Office, 1992), 3-4.

2  

R. E. Stake and P. Theobold, "Teachers' views of testing's impact on classrooms," in Advances in program evaluation: Effects of mandated assessment on teaching, ed. R. E. Stake and R. G. O'Sullivan, (Vol. 1, Part B) (Greenwich, CT: JAI Press, 1991), 189-201.

3  

Lauren B. Resnick and Daniel P. Resnick, "Assessing the thinking curriculum: New tools for educational reform," in Changing assessments: Alternative views of aptitude, achievement, and instruction, ed. B. R. Gifford and M. C. O'Connor, (Washington, DC: National Commission on Testing and Public Policy, 1991), 37-75.

4  

Ibid., 59.

5  

Hugh Burkhardt, R. Fraser, and J. Ridgway, "The dynamics of curriculum change, in Developments in school mathematics education around the world (Proceedings of the Second UCSMP International Conference on Mathematics Education, 7-10 April 1988), ed. I. Wirszup and R. Streit, (Reston, VA. National Council of Teachers of Mathematics, 1990), 3-30.

6  

Hugh Burkhardt, "Curricula for active mathematics," in Developments in school mathematics education around the world (Proceedings of the UCSMP International Conference on Mathematics Education, 28-30 March 1985), ed. I. Wirszup and R. Streit, (Reston, VA: National Council of Teachers of Mathematics, 1987), 321-361.

7  

W. J. Popham, "The merits of measurement-driven instruction," Phi Delta Kappan, 68, (1987), 679-682.

8  

George F. Madaus, "The influence of testing on the curriculum," in Critical issues in curriculum (87th Yearbook of the National Society for the Study of Education, Part 1), ed. L. N. Tanner, (Chicago: University of Chicago Press, 1988), 83-121.

9  

"Teachers' Views."

10  

Linda Darling-Hammond and Arthur E. Wise, "Beyond standardization: State standards and school improvement," Elementary School journal, 85, (1985), 315-336.

11  

Ibid., 320.

12  

D. R. Glasnapp, J. P. Poggio, and M. D. Miller, "Impact of a 'low stakes' state minimum competency testing program on policy, attitudes, and achievement,'' in Advances in program evaluation: Effects of mandated assessment on teaching, ed. R. E. Stake and R. G. O'Sullivan (Vol. 1, Part B, pp. 101-140), (Greenwich, CT: JAI Press, 1991).

13  

M. L. Smith and C. Rottenberg, "Unintended consequences of external testing in elementary schools," Educational Measurement' Issues and Practice, 10(4), (1991), 7-11.

14  

Thomas A. Romberg, E. A. Zarinnia, and S. R. Williams, The influence of mandated testing on mathematics instruction: Grade 8 teachers' perceptions,

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

   

(Madison: University of Wisconsin-Madison, National Center for Research in Mathematical Science Education, 1989).

15  

Andrew C. Porter, R. Floden, D. Freeman, W. Schmidt, and J. Schwille, "Content determinants in elementary school mathematics," in Perspectives on research on effective mathematics teaching, ed. Douglas A. Grouws and Thomas J. Cooney, (Hillsdale, NJ: Lawrence Erlbaum, 1988), 96-113.

16  

"Influence of Testing."

17  

"Measurement-driven instruction,"

18  

"Beyond Standardization."

19  

H. D. Corbett and B. L. Wilson, Testing, reform, and rebellion, (Norwood, NJ: Ablex, 1991).

20  

R. G. Lomax, The influence of testing on teaching math and science in grades 4-12, Appendix A: Nationwide teacher survey, (Chestnut Hill, MA: Boston College, Center for the Study of Testing, Evaluation, and Educational Policy, 1992); R. G. Lomax, M. M. West, M. C. Harmon, K. A. Viator, and G. F. Madaus, The impact of mandated testing on minority students, (Chestnut Hill, MA: Boston College, Center for the Study of Testing, Evaluation, and Educational Policy, 1992).

21  

L. A. Shepard and M. L. Smith, "Escalating academic demand in kindergarten: Counterproductive policies," Elementary School Journal, 89, (1988) 135-145.

22  

National Association for the Education of Young Children and the National Association of Early Childhood Specialists in State Departments of Education, "Guidelines for appropriate curriculum content and assessment in programs serving children ages 3 through 8: Position statement," Young Children, 46(3), (1991), 21-38.

23  

L. A. Shepard, "Inflated test score gains: Is it old norms or teaching the test?", (Paper presented at the annual meeting of the American Educational Research Association, San Francisco, 1989). (ERIC Document Reproduction Service No. ED 334 204).

24  

"Teachers' Views."

25  

Testing, reform, and rebellion.

26  

Impact of mandated testing.

27  

D. A. Archbald and A. C. Porter, A retrospective and an analysis of roles of mandated testing in education reform, paper prepared for the Congressional Office of Technology Assessment (Washington, DC, 1990).

28  

"Impact of 'low stakes' testing."

29  

Mandated testing.

30  

National Council of Teachers of Mathematics, Curriculum and evaluation standards for school mathematics, (Reston, VA: Author, 1989).

31  

"Teachers' views."

32  

Mandated testing.

33  

Ibid., 84.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

34  

G. Leinhardt and A. M. Seewald, "Overlap: What's tested, what's taught?" Journal of Educational Measurement, 18, (1981), 85-96.

35  

"Thinking curriculum."

36  

Ibid., 57.

37  

D. J. Freeman, G. M. Belli, A. C. Porter, R. E. Floden, W. H. Schmidt, and J. R. Schwille, "The influence of different styles of textbook use on instruc­tional validity of standardized tests," Journal of Educational Measurement, 20, (1983), 259-270.

38  

"Impact of 'low stakes' testing."

39  

"The influence of testing."

40  

"Impact of 'low stakes' testing."

41  

"Teachers' views."

42  

R. E. Floden, A. C. Porter, W. H. Schmidt, D. J. Freeman, and J. B. Schwille, "Responses to curriculum pressures: A policy-capturing study of teacher decisions about content," Journal of Educational Psychology, 73, (1981), 129-141.

43  

Testing, reform, and rebellion.

44  

Ibid., 71.

45  

Mandated testing.

46  

Influence of testing.

47  

Mandated testing.

48  

Ibid.

49  

"Impact of 'low stakes' testing."

50  

Ibid.

51  

Testing, reform, and rebellion.

52  

National Council of Teachers of Mathematics Professional Standards for Teaching Mathematics, (Reston, VA: Author, 1991).

53  

"Impact of 'low stakes' testing."

54  

"Teachers' views."

55  

"inflated test score gains."

56  

See, e.g., J. J. Cannell, How public educators cheat on standardized achievement tests, (Albuquerque, NM: Friends for Education, 1989).

57  

"The influence of testing."

58  

"Unintended consequences."

59  

Mandated testing.

60  

Impact of mandated testing.

61  

"Impact of 'low stakes' testing,"

62  

Impact of mandated testing.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

63  

W. J. Popham, "Appropriateness of teachers' test-preparation practices," Educational Measurement: Issues and Practice, 10(4), (1991), 12-15.

64  

Ibid.

65  

J. L. Hall and P. F. Kleine, Preparing students to take standardized tests: Have we gone too far? (Oklahoma City: University of Oklahoma, 1990). (ERIC Document Reproduction Service No. ED 334 249).

66  

W. A. Mehrens and J. Kaminski, "Methods for improving standardized test scores: Fruitful, fruitless, or fraudulent?" Educational Measurement; Issues and Practice, 8(1), (1989), 14-22.

67  

Ibid., 18.

68  

Preparing students.

69  

L. Salmon-Cox, "Teachers and standardized achievement tests: What's really happening?" Phi Delta Kappan, 62, (1981), 631-634.

70  

"The influence of testing."

71  

Mandated testing.

72  

Mathematical Sciences Education Board and Board on Mathematical Sciences, National Research Council, Everybody counts: A report to the nation on the future of mathematics education, (Washington, DC: National Research Council, 1989); Jean Kerr Stenmark, ed., Mathematics assessment: Myths, models, good questions, and practical suggestions, (Reston, VA: National Council of Teachers of Mathematics, 1991).

73  

"Thinking curriculum."

74  

"Beyond standardization."

75  

Mandated testing.

76  

"Impact of 'low stakes' testing."

77  

Impact of mandated testing.

78  

"Teachers' views."

79  

"Beyond standardization."

80  

C. Livingston, S. Castle, and J. Nations, "Testing and curriculum reform: One school's experience," Educational Leadership, 46(7), (1989), 23-25.

81  

L. M. McNeil, "Contradictions of control: Part 3, Contradictions of reform," Phi Delta Kappan, 69, (1988), 478-485.

82  

"Testing and curriculum reform."

83  

"Contradictions of control."

84  

Ibid, 483.

85  

"Unintended consequences."

86  

"Testing and curriculum reform."

87  

Testing, reform, and rebellion.

88  

"Contradictions of control."

89  

"Testing and curriculum reform."

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

90  

J. W. Kenelly, ed., The use of calculators in the standardized testing of mathematics, (New York & Washington, DC: College Board & Mathematical Association of America, 1989).

91  

"Teachers and standardized achievement tests."

92  

"Unintended consequences."

93  

Mandated testing.

94  

National Association for the Education of Young Children and the National Association of Early Childhood Specialists in State Departments of Education, "Guidelines."

95  

"Unintended consequences."

96  

Ibid., 10.

97  

Testing, reform, and rebellion.

98  

"Teachers' views."

99  

Ibid., 200.

100  

Edward A. Silver, "Assessment and mathematics education reform in the United States," International Journal for Educational Research, 17, (1992), 489-502.

101  

Ibid., 500.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

DESIGN INNOVATIONS IN MEASURING MATHEMATICS ACHIEVEMENT

STEPHEN B. DUNBAR

UNIVERSITY OF IOWA

ELIZABETH A. WITT

UNIVERSITY OF KANSAS

Nearly a century ago, a movement was afoot in American education, a movement with its origins in a prevailing perception among educators and public alike that our schools were failing to provide the leadership needed to prepare the next generation for the twentieth century. In a retrospective commentary on that movement and the uses and abuses of examinations in the pursuit of the educational reform efforts of that movement, McConn1 described the avowed purpose of nearly all achievement testing at the time as ensuring

the maintenance of standards, including, as already noted, the enforcement of both prescribed subject matter and of some more or less definitely envisaged degree of attainment.

If one is to raise any objections here, he must tread softly, because he is approaching what is to many educators in service, especially many of the older ones, the Ark of the Covenant. When those of us who are now in our forties and fifties were learning our trade, "Standards" was the great word, the new Gospel, in American education. To set Standards, and enforce Standards, and raise Standards, and raise them ever more, was nearly the whole duty of teachers, principals and presidents."2

McConn goes on to discuss the various unanticipated outcomes of the movement toward what he called Uniform Standards of the Nineties, the principal one being a nearly complete lack

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

of concern for individual differences and their importance in the development of differentiated standards. A second was the use of tests as exclusionary devices rather than instruments that could potentially guide individual students and teachers to more effective approaches to instruction and learning at all levels of education.

McConn's remarks set an appropriate context for the present discussion of design innovations in the development and evaluation of large-scale measures of achievement in mathematics; the clear impetus for new approaches is very similar to the concern for standards that was raised at the end of the nineteenth century. Perhaps it is mere coincidence that such issues come into clear focus at the end of a century. That is a debate for historians. What is at issue in this paper is the perspective from which we approach the very real concern that American society has voiced regarding the preparation of students for a way of life and work that relies increasingly on technological innovations and the ability to think critically and solve problems of a technical nature.

Mathematics instruction is presently seen as a principal vehicle through which American schools will prepare students in this domain, so it is clearly appropriate to consider the role of new tests in enhancing mathematics education. It is equally important to recognize the possible contradiction between the ideals of diversified approaches to assessment on the one hand and the specification of uniform standards of achievement for all students on the other. History does tell us that the primacy of the latter can completely undermine the anticipated benefits of the former, and today's rhetoric on standards is characterized by a uniformity of goals of instruction, albeit a well-intentioned uniformity.

Unlike the press for educational standards at the turn of the last century, the current movement for innovation in methods of measuring achievement in mathematics has the benefit of extensive experience in measuring achievement on a large scale. Although accountability remains the focus of many who are interested in using tests to monitor educational reform efforts, the profession is mindful of expanded definitions of the criteria traditionally used to evaluate measures of achievement. Content quality and cognitive complexity, generalizability and transfer, content coverage and meaningfulness, consequences and fairness, cost and efficiency are some of the criteria that have been recently proposed to character-

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

ize expectations for traditional and new forms of assessment and validation of their appropriateness.3 This paper addresses the problems and issues faced in attempting to satisfy these criteria when developing and implementing new mathematics assessment procedures.

The multifaceted approach to test validation advocated by Linn et al.4 encourages a less mechanistic approach to investigations of validity and forces a confrontation with what might be called the "broad brush syndrome" in educational assessment. In education and policy circles, there is a strong tendency to paint pictures of critical issues with an extremely broad brush. A given test is either valid or invalid. A performance-based approach to measuring achievement in mathematics will improve student learning of strategies for solving complex, real-world problems. Ratings of portfolios are inherently unreliable. Multiple-choice items cannot measure higher-order skills. Statements like these are symptomatic of the broad brush syndrome, a way of thinking that sees all tests of a given type as alike in their inherent attributes and influences on educational process.

To advance the debate about the role of assessment in the improvement of instruction and learning in mathematics is to approach the easel with a full palette and array of tools. This means facing the fundamental questions of validity for all types of assessments and understanding the importance of consequences, intended and unintended, in the overall evaluation of both traditional and innovative approaches to the design of instruments. Some of this effort can be exerted in the process of developing alternative measures of mathematics achievement themselves, but the effort requires that developers establish empirical grounds for the consequential and evidential bases for test use.5 Exemplary projects carried out on a national scale can provide some empirical evidence in this regard. The extent to which their results can be generalized to yet to be designed systems for large-scale assessment of mathematics achievement is a key consideration, however.

CONTENT CONSIDERATIONS FOR MATHEMATICS ACHIEVEMENT

By and large, the data available from large-scale, performance-based assessments of educational achievement come from operational assessment programs in the area of direct writing

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

assessment.6 Writing is an area with a long tradition of performance-based measures being used either to supplement or supplant multiple-choice tests of formal English grammar. Although curriculum specialists argue over the extent to which timed writing samples can support general inferences about achievement in writing, writing samples have been generally regarded as critical to comprehensive assessment programs in the language arts. Over time, the content domains sampled in direct writing assessments have become organized around traditional rhetorical modes of discourse, and the content specifications for the development of writing tasks and scoring protocols in many testing programs reflect an evolved conception of domains to be sampled.7

In considering anticipated features of innovative assessments in mathematics, the definitions of content domains should be carefully evaluated. Traditionally, mathematics has been regarded by test developers as an area in which substantial consensus existed with regard to content and the sequencing of subject matter. However, with the introduction of the National Council of Teachers of Mathematics (NCTM) Standards8 for mathematics curricula, the domain of the mathematics teacher has been expanded considerably, and what was once a clear scope and sequence subject for teachers is in the process of being redefined and, as some suggest, "conceptualized as an ill-structured discipline."9

When the NCTM Standards are used as guidelines for the design of innovative approaches to assessment, complications for measurement arise as a result of the interdisciplinary nature of the standards and the media through which certain standards may be amenable to assessment. For example, Standard 2 describes various ways in which mathematics is used to communicate ideas, and the statements of objectives include such phrases as "reflect and clarify thinking," "relate everyday language to mathematical language," and "representing, discussing, reading, writing, and listening to mathematics." Standard 3 (mathematics as reasoning) emphasizes the importance of explaining mathematical thinking and justifying answers. The goals for instruction reflected by these standards entail an integration of formal mathematical thinking with more generalized reasoning and problem solving throughout the school curriculum, generally observed by teachers through verbal interactions with learners. A more integrated approach to curriculum design is seen as critical for the development of higher-order thinking skills that

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

will be required in an increasingly technological society. The implications for measurement have to do with the role of more generalized cognitive skills in the observable outcomes of mathematics learning.

The QUASAR project described by Lane10 illustrates one formal attempt to use the NCTM Standards as the principal basis for structuring a large-scale nontraditional assessment of mathematics achievement. The project was created to demonstrate the feasibility of implementing programs based on the NCTM Standards in middle schools located in economically disadvantaged communities. Development efforts in the QUASAR project focused on the specification of a content framework for both tasks and scoring rubrics. The content framework for the performance tasks can be understood in traditional terms as a table of specifications with process and content dimensions; however, the dimensions of the QUASAR blueprint for task development are not isomorphic with those used in traditional test development for achievement in mathematics. In addition to a more detailed explication of the cognitive processes associated with mathematics content, the QUASAR frameworks included dimensions for mode of problem presentation (e.g., written, pictorial, graphic, arithmetic stimulus materials) and for task context (whether or not the task was placed in a realistic setting).

Assessment frameworks in QUASAR also incorporated carefully delineated specifications for scoring. As was done originally in the development of the Writing Supplement for the Iowa Tests of Basic Skills, the QUASAR mathematics assessment developed a focused-holistic scoring protocol11 for each task. The scoring protocols were organized with respect to three criteria for evaluating responses: mathematical knowledge, strategic knowledge, and communication. As discussed by Lane,12 these criteria were used to develop specific scoring rubrics for each task, rubrics that at once reflected the unique mathematical demands of the task and the common framework of standards that raters were to use in scoring. The influence of this structure for task development and scoring on the technical quality of results from the QUASAR assessments is discussed below.

The careful delineation of the conceptual framework for developing the QUASAR assessment instrument given by Lane13 provides a clear picture of the magnitude of a development effort that responds to the current demands of subject-matter and mea-

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

surement specialists for high degrees of content quality in new measures of mathematics achievement. The content framework alone specifies ten cognitive processes, six content categories, six modes of stimulus presentation, and two levels of task context for a total of 720 potentially distinct types of tasks for performance assessments in mathematics. Clearly it would be absurd to propose that filling all the cells in the QUASAR content framework is necessary to constitute a content-valid and appropriate set of performance tasks for purposes of measurement. In practice, a given task likely involves many cognitive processes simultaneously, and it may display information in several modes of representation. However, the QUASAR example does serve to highlight the many aspects of a large-scale performance assessment that must be monitored to ensure fidelity to the evolving content standards of the mathematics community.

Beyond the care that should go in to the test development process to ensure content quality, it is important to recognize ways in which evolving content standards may unknowingly undermine the content validity of an assessment. Baker14 described the difficulty of measuring certain complex skills with responses to extended performance tasks without placing perhaps undue weight on an examinee's facility with language in constructing the formal response that is the focus of evaluation. Standard 2 from the NCTM Standards explicitly identifies verbal components of the cognitive domain of mathematical competence. In accord with this standard, one of the QUASAR tasks asks students to study a bar graph depicting a typical day in someone's life and to respond to the graph by writing a brief story about a day in that person's life. For mathematics assessment, the variance of scores associated with linguistic factors in the evaluation of responses needs to investigated. Depending on how domain definitions from the NCTM Standards are made operational, this component of variance may represent a confounding factor in the use of results from extended samples of performance on complex assessment tasks. Whether or not it is considered a confounding factor, the variance associated with verbal aspects of the responses to mathematical problems is likely to loom larger than it has in more traditional approaches to measuring mathematics achievement.

Shifting definitions of content domains, not to mention the introduction of domains that are new from the standpoint of the classroom teacher, can be expected to affect certain characteristics of measures based on those domains.15 One sees this effect in the

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

reader and score reliabilities observed when writing samples are collected for persuasive essays in the middle elementary grades.16 Performance of both readers and writers is much less consistent in the persuasive domain, where instructional opportunities are usually limited by a curriculum that emphasizes narrative writing. Shavelson et al.17 illustrated the effect of domain misspecifications on estimates of score reliability in the case of hands-on science exercises. The Shavelson analyses in particular show that estimates of score reliability are likely to be markedly higher when assessment tasks are more narrowly defined and that unreliability is characteristic of poorly defined domains. As a result, inferences to broad content categories become problematic.

In the development of novel problem-solving activities to be used for assessment, researchers and practitioners will need to monitor the extent to which classroom practices keep pace with innovations in the assessment process. The development of school delivery standards18 is a necessary part of that monitoring, but as yet no explicit guidelines have been developed for how information about opportunity to learn can be used to provide feedback for revision of assessment tasks and evaluation of technical characteristics and provide a framework for differentiated interpretations of assessment results and policy implications to enhance validity.

BEYOND PROFESSIONAL JUDGMENT IN THE VALIDATION PROCESS

The evolution of the NCTM Standards can be understood as a reflection of changing professional judgment about the role of mathematics education in the general cognitive development of students. The development of standards in mathematics, as well as in other parts of the school curriculum, presents a new challenge to the developers of achievement measures with respect to content quality and cognitive complexity, two aspects of the validity question discussed by Linn et al.19 All major test publishers are presently engaged in efforts to revise instruments so that their content is more closely aligned with the NCTM Standards. The methods used to ascertain alignment typically involve the review of test materials by specialists in mathematics education and the classification of items according to the explicit statements of mathematics objectives.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

The heavy reliance on professional judgments of content quality and, given the nature of the new mathematics standards, cognitive complexity raises critical methodological questions about this part of the validation process. The obvious question of subjectivity in the rating process can be evaluated empirically. The empirical evidence gathered so far indicates that the judgments of content experts may not be highly reliable.20 Data that are available from content classifications of traditional test items raise questions about the fidelity of expert judges in evaluating test content. Comparisons of recent evaluations of the content of standardized achievement tests in mathematics21 with the content specifications supplied by developers (typically determined by subject matter experts' formal analysis of content and process required to obtain correct solutions) suggest that judgments of content quality may depend heavily on the point of view of the expert making the judgments. Professional judgments, then, should not serve as the sole basis of support for or against validity in traditional testing, much less in alternative assessment procedures, without due attention to the factors that influence such judgments.

Glaser et al.22 discuss the need for supplementing expert opinions with empirical evidence of cognitive validity of open-ended performance assessments. Such assessments, they argue, are usually developed on the basis of rational analysis and expert judgment and are assumed to measure higher-order reasoning because of the complexity of the tasks involved. Level of performance is often defined by psychometric difficulty and illustrative items, unaccompanied by any evidence or explanation of the underlying cognitive processes required for solutions. They also point out that what is best depicted as rational development of scoring protocols is seldom supplemented by empirical evidence indicating what knowledge and cognitive processes are actually being tested.

Potential refinements of procedures in the collection of judgment data should be considered in establishing validity of innovative assessment designs with respect to content quality and cognitive complexity. As noted by Magone et al.,23 content analyses of tasks themselves can be expected to tell only part of the story concerning the level of cognitive complexity elicited by the task. Magone et al. used logical analyses by expert judges to validate a series of open-ended prompts designed to measure conceptual understanding, problem solving, reasoning, and communication in mathematics.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

Such analyses were an integral part of the development of scoring rubrics but were also considered crucial to the validation process. However, they did not constitute the bulk of validity evidence presented to support score interpretations.

In addition to logical analyses of content, Magone et al. also examined responses to pilot and operational tasks for evidence of cognitive complexity. They coded responses to open-ended tasks in terms of solution strategy, representation and quality of written explanation, description of solution processes, and mathematical errors, arguing that empirical evidence of this sort was necessary evidence for the validity of tasks as cognitive measures of achievement in mathematics. Results were used to revise the prompts used for each task, to provide feedback for teachers about student performance, and to delete tasks from the pool to be used for later assessments.

Glaser et al.24 selected several science performance assessment tasks for examination via student protocol analysis, including extended interviews with subjects participating in these assessments. Such analyses aim to reveal the degree of correspondence between the cognitive processes and skills the tasks were intended to measure and those actually elicited. Results of studies such as this are expected to be useful in designing more innovative approaches in which the meaning of students' scores might be more explicitly described in terms of the reasoning skills and other processes underlying their performance. Preliminary results of the Glaser et al. study suggest that the same task presented in different forms (e.g., physical manipulation of objects versus computer simulation) may elicit qualitative differences in performance.25 Snow and Lohman26 report other examples in which small changes in wording (e.g., abstract mathematical terms or equivalent renderings in everyday language) can have apparently large effects on cognitive processing and performance.

Interpreting the results of a content analysis of responses is not without difficulties. For example, in the Magone et al.27 study, some tasks were discarded because responses showed little evidence of cognitive complexity. Instead, students' explanations of how they arrived at their answers used phrases such as "I used my brain," or, somewhat ironically, ''logical reasoning." Although the tasks in question may have yielded no apparent evidence of

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

"cognitively complex responses," inferring that no complex skills were required to solve the problems in question is another matter and may depend on, among other things, whether the problem was answered correctly or incorrectly.28 In fact, the authors noted that their analysis was limited by their using written responses as the sole basis for judgments regarding the complexity of a problem or response. Like Glaser et al.,29 they saw an important role for think-aloud protocols and interviews in understanding what is measured by open-ended tasks. To evaluate the mathematical complexity of a purportedly complex task or response clearly requires deeper analyses of the assessment process.

Unfortunately, one might question whether it is at all feasible to conduct in-depth analyses of student responses that would be useful in interpreting the results of a large-scale assessment. Time and cost considerations certainly make it impractical to interview a large number of students; even if it were practical, the mass of data collected would be extremely difficult—if not impossible—to summarize. Responses to protocol probes are often eccentric and may be interpretable only by someone who is familiar with the situation and with the student responding. On the other hand, such analyses may prove to be quite informative during task development and revision.

The most practical use for the results of such analyses with regard to score interpretation may be to point out the lack of generalizability of responses obtained from a simple content analysis. A thorough protocol analysis may reveal that similar statements correspond to radically different cognitive procedures for different students. If, for example, a student claims to have solved a problem by "using my brain," the meaning of this statement will depend on the student. One student may have had no clue as to how to solve the problem and therefore decided to disguise ignorance with an ambiguous response. Another may have understood the problem but lacked the verbal skills to describe the procedure. Still another may have been so skilled or knowledgeable that the problem was solvable almost instantaneously without any conscious awareness of the cognitive steps involved. Experts are often less able than novices to provide detailed descriptions of their problem-solving activities.30 These concerns about the interpretability of a think-aloud protocol raise similar questions about the interpretability of written responses to probes about solution strategies in an operational assessment program.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

INFLUENCES ON GENERALIZABILITY AND TRANSFER

The aspect of validity that has aroused by far the most attention from test specialists interested in performance-based alternatives to conventional achievement tests is the issue of generalizability of performance and the transfer of measured content to the unmeasured aspects of the content domain. The inferences of greatest interest to teachers, parents, and the public at large concern the broad objectives of instruction in mathematics, reflected in the NCTM Standards by their emphasis on problem-solving, solution strategies, communication, and the like.

The concern about generalizability and transfer is not unique to performance-based approaches to measuring mathematical skills. In fact, these concepts occupy such a prominent place in present-day discussions of test validity because of experiences in large-scale testing programs that use conventional measures of achievement.31 Teaching to the test poses difficulties for score interpretation not just because it compromises normative information that accompanies most standardized tests. It is a practice that challenges the validity of test scores as indicators of the achievement domain sampled during test construction and has been shown in high-stakes situations to distort inferences to that domain.32 In evaluating novel approaches to assessment in mathematics, the generalizability of scores over raters, tasks, formats, and even subdomains has received considerable attention.

Influences on the variability of performance assessment tasks are observed in both the response process and the scoring process. In general, issues related to consistency in the scoring process are well understood and the overall component of score variance due to the effect of raters has been generally viewed as one aspect that, for a given performance assessment, can be easily controlled with appropriate training of raters.33 However, the resources required to attain a given level of rater reliability are likely to vary across assessments for a number of reasons. Dunbar et al. noted the difference between rater reliabilities estimated under laboratory and field conditions in this regard, the former possibly giving estimates of rater consistency that are optimistic when an operational assessment is conducted under less than ideal conditions. Recent results from a state-level assessment of mathematics achievement based on

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

the scoring of writing and mathematics portfolios34 suggest possible reasons for the discrepancy between laboratory and field experiences with regard to rater reliability.

The Vermont Portfolio Assessment Program is a statewide initiative in alternative assessment unlike any other in the United States in its decentralized development of materials and emphasis on using assessment to encourage a diversity of good teaching practices.35 It provides an exemplary operational program of the sort that is sometimes envisioned as the future of large-scale assessment, a future in which assessment tasks are determined at a local or regional level and concerns for comparability are handled through some kind of linking or calibration procedure.36 In vermont, writing and mathematics portfolios were assembled by fourth- and eighth-grade students from participating school districts, with "best pieces" identified. By design, there was no attempt by the state to prescribe the range of portfolio entries. In the mathematics portfolios, responses from five to seven best pieces of classroom-based tasks were scored in terms of seven scoring criteria chosen by teachers: language of mathematics, mathematical representations, presentation, understanding of task, procedures, decisions, and outcomes. The operational definitions of the four points on each of these criterion scales were determined by committees of mathematics teachers from throughout the state. Teachers also determined the method that was used to combine scores from the separate entries into a composite score for the portfolio on each criterion scale.

Koretz et al.37 report the rater reliability coefficients for each criterion scale at each grade level. In grade four, three of the seven coefficients were below .30 (language of mathematics, outcomes, and understanding of task). The highest rater reliability coefficient (.45) was for the presentation scale. The results for grade eight were not markedly different. Although only one scale had a rater reliability below .30 (language of math), the highest coefficient (.42) was again associated with the presentation scale. The teacher-defined composite scales contained more errors due to raters than what would have been observed if a simple average or sum of scores from the separate entries had been used as the aggregate measure for individuals. Further, Koretz et al. indicate that the maximum boost to rater reliability that could be achieved with the data collected for the statewide assessment, obtained by aggregating over both portfolio entries and criterion scales, was only

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

about .57. Such results are in stark contrast to the levels of rater reliability observed in more carefully designed and structured performance assessments.38

The data from the Vermont program are revealing for several reasons. First, they reinforce the remarks made earlier about rater reliability in field settings. Second, they suggest that the teachers on the reading committees either had not reached consensus regarding the definitions of score points for each criterion scale or perhaps had not had enough experience with the criterion definitions to make consistent judgments about how levels would be revealed in actual student responses. Third, they show that the task of keeping straight four score points on seven different criterion scales (28 scale points in all) may be too demanding for even experienced and motivated teacher-raters. Finally, the Vermont results exemplify some of the procedural difficulties of using essentially unstructured tasks for large-scale assessment purposes. The only structure built into the assessment design involves the specification of best pieces on the response end and explicitly defined criteria on the rating end.

Another aspect of the Vermont data that deserves comment involves the frequency distributions of ratings, which showed an extremely high concentration of ratings at one or two points on the scale for several criteria. For such criteria the reliability coefficients demonstrate the usual effects of restriction of range on the ratings, and they are somewhat difficult to interpret as a result. Koretz et al.39 note this effect. The fact that for some scales it was revealed as a strong floor effect on the ratings—92 percent of the grade 4 sample received the lowest possible rating I on the outcomes scale—whereas for others the concentration was in the middle of the distribution raises doubts about the quality of the anchor points across the seven criterion scales. The authors argue that explaining low rater reliability in terms of statistical artifacts such as attenuation due to range restriction does not answer the more important question of what caused the reduction in variability in the first place.

Although early studies of performance-based assessment concentrated on raters in the estimation of components of score variance, recent studies of the generalizability of extended responses to complex tasks also have raised fundamental questions about the behavior of examinees during the response process. Linn40 observed

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

that "high levels of generalizability across tasks alone do not guarantee valid inferences or justify particular uses of assessment results. But low levels of generalizability across tasks … pose serious problems regarding comparability and fairness to individuals who are judged against performance standards based on a small number, and perhaps different set, of tasks."41 Several recent examples of the generalizability of performance-based tasks in mathematics characterize the challenges faced by designers of innovative large-scale assessments.

Lane et al.42 reported on the reliability and validity of the QUASAR Cognitive Assessment Instrument (QCAI), a set of open-ended tasks measuring mathematical reasoning skills of middle-school students. Conscientious selection and training of raters resulted in high interrater reliability. Variation in scores due to choice of rater or to any interaction between rater and student or rater and task was negligible. However, substantial variation due to student-by-task interaction (between 55 and 68 percent of total variance) was revealed in the generalizability study. Scores were found to depend to a substantial degree on the particular set of items administered, and the authors indicate that clear inferences about a student's mathematical reasoning ability were uncertain. The Lane et al. results are particularly relevant because their tasks and scoring rubrics were developed with considerable attention to the NCTM Standards.

Despite the large variance component due to person-by-task interaction, the overall generalizability coefficients for the nine tasks included on a given form of the assessment were in the .7 to .8 range. These values are markedly higher that many generalizability coefficients reported in the performance assessment literature, and they suggests a principle for performance assessment design that has long been recognized in the development of conventional achievement tests, namely that high levels of person-by-task interaction can be tolerated as long as the number of tasks (items) in the assessment is sufficiently large. The addition of a few more tasks in the QUASAR assessment would bring the generalizability coefficients into the range that is typical for objective tests of mathematics achievement (each performance task is worth roughly two multiple-choice items from the standpoint of reliability).

There is a growing body of evidence collected from studies of performance tasks in a variety of content domains that a substan-

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

tial number of open-ended tasks may be required to support high-stakes uses of results for decisions about individuals.43 To offer specific guidelines for the number of tasks needed to secure a reliability coefficient of, say, .90 can be quite misleading because of the many factors (quality of raters, parallelism of additional tasks, delineation of the content domain, etc.) known to influence such assessments. As noted by Ruiz-Primo et al.44 in their study of hands-on science performance exercises, "increasing the number of occasions would increase the [generalizability] coefficients (four occasions to achieve reliability .80), but it would do so at considerable cost."45

Another example of the use of open-ended tasks in large-scale mathematics assessment provides another perspective for understanding the nature of information about achievement in mathematics that is obtained by innovative approaches. Stevenson et al.46 describe the characteristics of open-ended geometry proofs administered to more than 43,000 high school students in North Carolina. In contrast to the Vermont results, reader reliabilities based on two independent readings of the same task, rated by a focused holistic approach, were above .90 for all proofs included in the assessment. Either through content definitions or training practices, geometry proofs were clearly amenable to the rating process in a way that the unstructured portfolios in Vermont were not.

The generalizability question was addressed by Stevenson et al. by the inclusion of a subsample of students who also took a multiple-choice proofs test in addition to the open-ended version. The disattenuated correlations between the multiple-choice proofs test and the open-ended problems were quite high (approximately .85). The disattenuated correlation between an individual open-ended proof and the same proof in multiple-choice format would be estimated to be nearly perfect (.99, assuming the reliability of the multiple-choice proofs test to be around .80). From these results, it is clear that performance on what are in this case highly structured, open-ended tasks in geometry does transfer to performance in a traditional format. The irony of this example is that the degree of transfer appears to be so high as to beg the question about whether the formats are measuring anything different about achievement in the relevant domain. That the formats send different messages to the audiences of an assessment about what is valued in the geometry curriculum, of course, is a separate issue.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

Given the results of the North Carolina study, a relevant question for the development of mathematics performance tasks and rating scales might be phrased as follows: When the factors that produce lack of generalizability on complex tasks in mathematics are controlled to a degree deemed necessary for large-scale applications, will the constructs measured by the tasks rank-order examinees any differently than would a conventional test of related mathematics skills?

The influence of problem format on variability in responses is an obvious consideration in understanding the levels of score reliability or generalizability that have been observed in open-ended assessments of mathematics achievement. Although the influence of format on difficulty was noted earlier, the effects of format can be quite subtle and relate more to the construct interpretations that are made of scores. Webb and Yasui47 examined the performance of seventh graders on three types of open-ended mathematics items: unembellished numerical exercises; short, one-question, word problems; and relatively complex, extended word problems. Among the three types of problems, no differences in difficulty were found; students displayed similar performance regardless of the item format. Students were especially variable in terms of their ability to set up the problem correctly regardless of the amount of verbal context supplied.

What differences were apparent across item types concerned the kinds of errors students made. Computational errors occurred less frequently for verbal items, perhaps because the context provided clues enabling students to check the reasonableness of their answers. Extended word problems elicited more misinterpretations of the question and uninterpretable responses and omitted more answers than did the shorter verbal items. The authors suggested that cognitive overload, frustration, and abated motivation may affect performance on these lengthy, "realistic" problems.

Each of the extended problems was presented in several parts or subquestions; many students failed to see the connection between the parts and recomputed figures already obtained in a previous subquestion. It was not clear to the researchers to what extent "erosion of performance"48 on the extended word problems revealed a weakness in mathematical skills. Student errors may have reflected, for example, difficulty in reading or interpreting lengthy

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

verbal passages, poor skills in organizing verbal information, or an inability to sustain interest and motivation without a new stimulus. Failure to build on previous subquestions could have been the result of previous experience with mathematics tests in which each item presents a problem unrelated to the questions preceding it.

Webb and Yasui demonstrate the substantive importance of understanding the reasons for lack of generalizability in the context of extended samples of student performance in mathematics. In so doing they reveal an important connection in the validity argument between generalizability analyses and construct interpretations of the results of a mathematics assessment. The extent to which performance transfers from one complex task to another, or perhaps from sets of related tasks to other sets, in one sense determines the specificity of the domain to which defensible inferences can be made from the results of an assessment. Evidence of poor transfer does not necessarily undermine the value of an assessment for a given purpose. However, poor transfer severely restricts the range of legitimate uses of assessment information and exacerbates the problem of how to communicate results to audiences who have developed high expectations regarding the utility of the enterprise.

TASK SAMPLING AND AGGREGATE REPORTS TO THE PUBLIC

How many tasks will be needed from new forms of assessment to secure a valid measure of achievement for a particular purpose, and how will results of locally controlled assessment programs using novel tasks and formats be combined for use in policy discussions at state and national levels? The recent report to Congress by the Office of Technology Assessment (OTA)49 reflects on the shared experience of nearly all students when the essay question just happened to cover what had been studied the night before, as well as "the time they 'bombed' on a test, unjustly they felt, because the essays covered areas they had not understood or studied well."50 The OTA report also addresses the concerns that arise when aggregate results are of most interest.

The issue of task sampling presents exactly the kind of ill-structured problem that has no generic right answer but instead has

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

many potentially right answers depending on the circumstances of test use. A broadly defined content domain such as problem solving might require extensive sampling of tasks because of the influence context can have on problem-solving behavior.51 As guideposts for the development of new kinds of assessment instruments, the NCTM Standards tend to emphasize the importance of similar broad domains of mathematical competence. Unfortunately, there is limited empirical evidence from experimental measures in such domains of the kinds of generalizability that might be expected and, hence, little empirical basis for recommendations concerning the number of tasks that might be necessary for a given use of results. What is known about content sampling from the standpoint of conventional achievement tests provides clear evidence that the meaning of a test score can be quite easily manipulated by purposeful selection of items to match the objectives of a local curriculum or policy initiative.52

Whenever there is a general concern about the sampling of tasks, there is a concomitant concern over the possibility that influences on task performance will be concentrated in subpopulations of examinees—subpopulations differentiated by race, gender, or some other correlate of opportunity to learn.53 On the subject of differential functioning of test questions by group, some specialists go so far as to argue there is no such thing as an unbiased item; rather, the responsibility of test developers is to ensure that content domains are sampled in such a way to balance out the bias, that is, to include enough variety in stimulus materials and balance in content that the assessment as a whole does not systematically favor one group over another. It is perhaps this aspect of alternative measures of achievement in mathematics that the research community knows the least about. Understanding the nature of group differences on novel measures of performance is also an aspect of instrument development that may have the greatest impact on the consequential validity of the next generation of assessments.54 The fact that new assessments of achievement in mathematics are likely to focus on new aspects of the mathematics curriculum identified by the NCTM Standards makes the monitoring of shifts in teaching practice critical to the valid use of results.

In addition to the extremely limited data available on differential item functioning (DIF) of performance task with respect to gender or ethnicity, there is also a limited understanding of how to

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

detect the phenomenon. Recently proposed methods by Welch and Hoover55 and Miller and Spray56 are in their infancy when compared with methods developed for dichotomously scored test items. In part because of the absence of suitable methodology, there are virtually no systematic investigations of DIF in the performance assessment literature in any content domain. As data from large-scale performance-based measures of mathematics achievement become more readily available, the issue of differential task functioning will need to be carefully evaluated.

Combining the results of novel assessments across disparate tasks, geographic regions, school district boundaries, or other demographic groupings is of great interest to policymakers intent on using the results of new forms of assessment to monitor educational reform efforts. Linn57 distinguishes statistical and judgmental approaches to the process of linking the results of distinct assessments, and describes a continuum of inferences that might be justified depending on the conditions under which the link across assessments was established. Generally speaking, the kinds of assessments that are currently being proposed in the context of educational reform efforts can only be linked across sites through some form of calibration, but calibration by professional judgment.

One empirical example of an attempt to link direct writing assessments across states was described by Linn et al.58 essays written by students from one state were evaluated with the scoring protocols from another state. Linn et al. found a surprising degree of consistency in the way students were rank-ordered by the panels of readers from different states. However, absolute judgments about the level of performance reflected by an essay response were quite different, revealing that readers from one state did not share the same standards for performance as did readers from another state. The implications of this finding for large-scale decentralized implementations of performance-based approaches to assessment are profound given the usual impetus for such programs, the maintenance of standards.59

Linn60 discusses the use of what he calls "social moderation" to link the results of distinct assessments. Social moderation seeks to develop consensus among educators about standards for performance and exemplars of those standards and relies on such shared understanding to provide the link that statistical procedures do for

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

more conventional assessments. As was implied previously, common understandings about a content domain as shared standards for performance do evolve over time. The extent to which social moderation can be relied on for the calibration of scores on innovative assessments in mathematics, however, is a matter of speculation. At present, no procedure for linking disparate assessments in mathematics can be recommended because of the good data it has produced to date. The Vermont results are sobering in this regard and compel us as researchers to inquire further into the methods that will be needed if a decentralized approach to the assessment process is to yield broad inferences to direct state and national education policy through the linking of locally developed instruments.

SUMMARY AND CONCLUSIONS

In exploring innovations in large-scale measures of achievement in mathematics, it is important to recognize that a diversity of assessment procedures applied in a diversity of situations precludes the production of sweeping, general statements about various types of assessments and the technical issues involved in implementing them. Traditional concerns with reliability and validity are expanded to encompass issues raised by a host of related criteria.

The domain and definition of mathematics has been expanded in the NCTM Standards to include skills containing components that might have previously been classified as nonmathematical, perhaps verbal or analytical. This expanded domain creates new problems for assessment. How can verbal skills, for example, that are part of the new mathematics domain be measured without confusing their measurement with that of verbal skills that are separate from mathematics achievement? The broadening of the domain also makes it more difficult to achieve highly reliable measurement of mathematical skills.

The classification of assessment tasks according to content and complexity relies heavily on the judgment of content experts. Consensus among experts has often proved more difficult to attain than one might expect. Evidence of content validity provided by professional judgments needs to be supplemented with empirical evidence of cognitive validity. The logistics of collecting such evidence have not been fully researched; the methods used may

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

themselves present new problems. In-depth protocol analyses, for example, are impractical to conduct on a large scale, and the mass of data they produce may be difficult to interpret and impossible to synthesize for use in the interpretation of aggregate reports of mathematics achievement.

With a variety of alternative assessment procedures, issues of generalizability and transfer necessarily become more complicated. Inferences to the domain can still be hampered by teaching to the test, although the practice may take different forms when the assessment instrument is not a traditional test. Issues related to generalizing over raters (scoring reliability) are relatively well understood. Nevertheless, raters do not always behave as expected. When well-trained raters fail to score tasks consistently, as in the case of the Vermont Portfolio Assessment Program, clearly there are aspects of the scoring process that are not yet fully understood. A deeper understanding of the scoring process is essential if we wish to avoid wasting precious resources on large-scale assessments that produce nearly useless results.

Evidence obtained thus far indicates that generalizability across tasks is often low. Students may vary greatly in their performance on mathematics assessments depending on the particular tasks by which they are tested. Apparently some aspects of performance in mathematics are highly subdomain-specific or require specified knowledge above and beyond transferable skills. The inferences that can be made about the scope of students' mathematical abilities are severely limited unless evidence of across-task generalizability can be obtained. Further research is needed to understand better the factors affecting this aspect of generalizability. Short of such understanding, broad sampling of assessment tasks from specified content frameworks will be needed to support the broad inferences to domains of mathematical achievement that have caught the fancy of education and public policymakers.

Although there is evidence that the use of different formats to assess the same knowledge and skills may have little effect on students' level of performance, the types of errors students make (and therefore perhaps the cognitive skills being tested) may depend on the format in which an assessment task is presented. This is the thrust of the argument in favor of new formats for tests; however, newly developed item types must be studied to determine whether

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

the skills they measure differ from those assessed by other formats, as well as what potential advantages and pitfalls they may present.

Like the 1890s, the 1990s are characterized by optimistic expressions of faith in new educational standards. According to the broad brush of optimism, by setting standards high and holding all students to them, leaving none behind in Hamlin, educational leaders expect to see tomorrow's students stride into adulthood fully prepared for the demands of life and work in the twenty-first century. As the last century saw the rise of new assessment procedures to support the maintenance of the educational standards of that time, so now a proliferation of new, innovative assessments is already arising to measure progress in meeting today's new standards. The literature is full of optimistic statements about the purported advantages of these new procedures. Surely optimism is good, yet naive optimism can be treacherous. It is imperative to recognize that new assessments may bring new problems. Long periods of debugging and refining new procedures may be required before alternative assessments can produce results that are meaningful and widely applicable. Moreover, it would be naive to expect that new procedures will be less amenable to abuse than traditional measures have been. Disregard for individual differences, the exclusionary use of assessment results, various forms of teaching to the test, and other undesirable outcomes are as likely to occur with today's alternative assessments as with traditional instruments.

Unlike the standards movement of the last centurial transition, many of these outcomes are not unanticipated and can be guarded against. If we can curb our optimism long enough to examine the efficacy, challenges, and potential consequences of new assessment procedures before they are implemented for high-stakes purposes, we can avoid possible negative outcomes, invest more time and resources in positive refinements, and ultimately produce better, more useful measures of achievement in mathematics.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

ENDNOTES

1  

M. McConn, "The uses and abuses of examinations," in The Construction and Use of Achievement Examinations, ed. H. E. Hawkes, E. F. Lindquist and C. R. Mann, (Boston: Houghton Mifflin Company, 1936).

2  

Ibid., 447. Emphasis in the original.

3  

R. L. Linn, E. L. Baker, and S. B. Dunbar, "Complex performance-based assessments: Expectations and validation criteria," Educational Researcher, 20, (1991), 15-21.

4  

Ibid.

5  

S. Messick, "The Interplay Between Evidence and Consequences in the Validation of Performance Assessments" (Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, April, 1992).

6  

S. B. Dunbar, D. M. Koretz, and H. D. Hoover, "Quality control in the development and use of performance assessments," Applied Measurement in Education, 4, (1991) 289-304.

7  

Cf. A. N. Hieronymus and H. D. Hoover, Iowa tests of basic skills: Writing supplement teacher's guide (Chicago: Riverside, 1987); A. N. Applebee, J. A. Langer, and I. V. S. Mullis, Writing: Trends across the decade, 1974-84 (National Assessment of Educational Progress Rep. No. 15-W-01) (Princeton, NJ: Educational Testing Service, 1986).

8  

National Council of Teachers of Mathematics, Curriculum and evaluation standards for school mathematics, (Reston, VA: Author, 1989).

9  

S. Lane, "The conceptual framework for the development of a mathematics performance assessment instrument," Educational Measurement' Issues and Practice, 12, (1993), 16-23.

10  

Ibid.

11  

Iowa tests of basic skills.

12  

"Conceptual framework."

13  

Ibid.

14  

E. L. Baker, The role of domain specifications in improving the technical quality of performance assessment, Technical Report, (Los Angeles: Center for Research on Evaluation, Standards, and Student Testing, 1992).

15  

Baker, Domain specifications; R. J. Shavelson, X. Gao, and G. P. Baxter, Content validity of performance assessments: Centrality of domain specification, Technical Report, (Los Angeles: Center for Research on Evaluation, Standards, and Student Testing, 1992).

16  

Iowa tests of basic skills.

17  

Content validity.

18  

National Council on Education Standards and Testing [NCEST], Raising standards for American education, (Washington, DC: United States Congress, 1992).

19  

"Complex performance-based assessments."

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

20  

W. J. Popham, "Appropriate expectations for content judgments regarding teacher licensure tests," Applied Measurement in Education, 5, (1992), 285-302.

21  

Cf. G. F. Madaus et al., The Influence of Testing on Teaching Math and Science in Grades 4-12: Executive Summary. (Boston, MA: Boston College, Center for the Study of Testing, Evaluation, and Educational Policy, 1992) 1; T. A. Romberg and L. D. Wilson, "Alignment of tests with the standards," Arithmetic Teacher (40) 1992, 18-22.

22  

R. Glaser, K. Raghavan, and G. P. Baxter, Cognitive theory as the basis for design of innovative assessment: Design characteristics of science assessments, Technical Report, (Los Angeles: Center for Research on Evaluation, Standards, and Student Testing, 1992).

23  

M. Magone, J. Cai, E. A. Silver, and N. Wang, "Validity evidence for cognitive complexity of performance assessments: An analysis of selected QUASAR tasks," International Journal of Educational Research, in press.

24  

Cognitive theory.

25  

G. P. Baxter, Exchangeability of science performance assessments, (Unpublished doctoral dissertation, University of California, Santa Barbara, 1991).

26  

R. E. Snow and D. L. Lohman, "Implications of cognitive psychology for educational measurement," in Educational Measurement, 3rd ed., ed. R. L. Linn, (New York: Macmillan, 1989), 263-331.

27  

Magone et al., "Validity evidence for cognitive complexity."

28  

"Implications of cognitive psychology."

29  

Cognitive theory.

30  

"Implications of cognitive psychology."

31  

R. L. Linn, M. E. Graue, and N. M. Sanders, "Comparing state and district test results to national norms: The validity of the claims that 'Everyone is above average'," Educational Measurement' Issues and Practices, 9, (1990), 5-14; L. A. Shepard, "Inflated test score gains: Is the problem old norms or teaching to the test?" Educational Measurement: Issues and Practices, 9, (1990), 15-22.

32  

D. M. Koretz, R. L. Linn, S. B. Dunbar and L. A. Shepard, "The effects of high-stakes testing on achievement: Preliminary findings about generalization across tests," (Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, April, 1991).

33  

"Quality control."

34  

D. M. Koretz, D. McCaffrey, S. Klein, R. Bell, and B. Stecher, The reliability of scores from the 1992 Vermont portfolio assessment program, Technical Report, (Washington, DC: The RAND Corporation. 1992).

35  

Ibid.

36  

Cf. R. L. Linn, "Linking results of distinct assessments," Applied Measurement in Education, 6, (1993), 83-102.

37  

Reliability of Vermont scores.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

38  

Cf. Dunbar et al., "Quality control"; S. Lane, C. A. Stone, R. D. Ankenmann, and M. Liu, "Empirical evidence for the reliability and validity of performance assessments," International Journal of Educational Research, in press).

39  

Reliability of Vermont scores.

40  

R. L. Linn, "Educational assessment: Expanded expectations and challenges,'' Educational Evaluation and Policy Analysis, 15, (1993) 1-16.

41  

Ibid., 27. Emphasis in the original.

42  

"Empirical evidence."

43  

"Educational assessment."

44  

M. A. Ruiz-Primo, G. P. Baxter, and R. J. Shavelson, "On the stability of performance assessments," Journal of Educational Measurement, 30, (1993), 41-54.

45  

Ibid., 46.

46  

Z. Stevenson, C. P. Averett, and D. Vickers, "The reliability of using a focused-holistic scoring approach to measure student performance on a geometry proof," (Paper presented at the Annual Meeting of the American Education Research Association, Boston, April, 1990).

47  

N. Webb and E. Yasui, Alternative approaches to assessment in mathematics and science: The influence of problem context on mathematics performance, Technical Report, (Los Angeles: Center for Research on Evaluation, Standards, and Student Testing, 1991).

48  

Ibid, 23.

49  

Office of Technology Assessment, Testing in American Schools: Asking the Right Questions, (Washington DC: United States Congress, 1992).

50  

Ibid., 242.

51  

J. H. Larkin, "What kind of knowledge transfers?" in Knowing, learning, and instruction: Essays in honor of Robert Glaser, ed. L. B. Resnick, (Hillsdale, NJ: Erlbaum, 1989).

52  

R. L. Linn, and R. Hambleton, "Customized Tests and Customized Norms," Applied Measurement in Education, 4, (1991), 185-207.

53  

L. Feinberg, "Multiple choice and its critics," The College Board Review, No. 157. (1990); Linn et al., "Complex performance-based assessments"; S. B. Dunbar, "Comparability of indirect measures of writing skill as predictors of writing performance across demographic groups," (Paper presented at the annual meeting of the American Educational Research Association, Washington, D.C., April, 1987).

54  

"Evidence and consequences."

55  

C. Welch and H. D. Hoover, "Procedures for extending item bias detection techniques to polytomously scored items," Applied Measurement in Education, 6, (1993), 1-19.

56  

T. R. Miller and J. A. Spray, "Logistic discriminant function analysis for DIF identification of polytomously scored items," Journal of Educational Measurement, 30, (1993) 107-122.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

57  

"Linking results of distinct assessments."

58  

R. L. Linn, V. L. Kiplinger, C. W. Chapman, and P. G. LeMahieu, "Cross-state comparability of judgments of student writing: Results from the New Standards Project," Applied Measurement in Education, 5, (1992), 89-110.

59  

Cf. National Council on Education Standards and Testing, Raising standards for American education.

60  

"Linking results of distinct assessments."

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

LEGAL AND ETHICAL ISSUES IN MATHEMATICS ASSESSMENT

DIANA C. PULLIN1

BOSTON COLLEGE

After two decades of efforts at the state and local levels to reform the nation's elementary and secondary schools, the United States moves toward the millennium firmly committed to a series of national reform initiatives. These national efforts consist of two types of often intersecting approaches to driving enhanced educational productivity. First, at the urging of the nation's governors, the federal government has initiated a series of efforts to promote national curriculum, performance, and opportunity to learn standards that would be adopted by states and local school districts on a voluntary basis. These federal initiatives seek to promote systemic state reform by creating a new federal role developing national education standards and assessments and setting benchmarks to measure progress toward attaining those goals. Although no federal mandates or sanctions are proposed, federal seed money and other financial aid, coupled with technical assistance will have a significant impact. Moreover, the power of the federal government to lead a public forum addressing the goals will also mean a strong central vision for the reforms.

At the same time as these federal initiatives are being pursued through President Bush's America 2000 initiative and President Clinton's Goals 2000: Educate America Act, several professional organizations have been pursuing similar objectives. For example, the National Council of Teachers of Mathematics (NCTM) and

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

other professional associations have defined curricular content standards or frameworks, benchmarks, and performance standards in their subject areas. Efforts such as these mark an important change within the field of education: systematic, political reform driven from within the profession rather than externally mandated change compelled by some governmental entity.

Current efforts at educational reform include the work of the National Research Council's Mathematical Sciences Education Board (MSEB), which has proposed that assessments be used to meet three goals. First, assessment should be used to support or improve teaching of important mathematics content and procedures. Second, mathematics assessment should support good instructional practice. Finally, assessment should support every student's opportunity to learn important mathematics.

MSEB's proposals to enhance educational achievement in mathematics and to increase access to educational opportunities to learn can be evaluated as part of a movement that has provoked public debate and scrutiny of our schools for over 40 years. Past efforts, particularly federal ones, to mandate equality of educational opportunity through laws, regulations, and court decisions will impact the current reform proposals, even if they are characterized as "voluntary", rather than mandatory, and "national" rather than federal. The proposed use of assessment as a tool of educational reform prompts comparison with prior efforts to enhance educational achievement with high stakes tests which have significant consequences for individual test-takers. It has been in these "high stakes" situations in which the legal impact of these types of initiatives has been most complicated.

In the years since the decision of the U. S. Supreme Court in Brown v. Board of Education,2 which found a constitutional bar to state laws segregating schools on the basis of race, there has been a large increase in the use of state and federal statutes, regulations, and court decisions to regulate educational practices and educational testing efforts. Judges and lawmakers have scrutinized educational reform proposals or imposed legal mandates upon educational practices in efforts to attain desirable social, political and educational goals, especially our commitment to equity and fairness.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

Few would dispute the desirability of a dramatic increase in educational achievement of the nation's students. However, part of the discussion of reform initiatives must focus on whether these changes can be implemented without undermining the nation's longstanding commitment to equity. These questions are rooted largely in consideration of both public policy and the law, because the law is frequently used to address efforts to maintain our traditional commitments to fair treatment of all and our aspiration to educate all children well. Although not providing answers to all of these questions, this paper will attempt to highlight the issues that should be considered by those participating in this debate.

RACE AND THE EFFECTS OF EDUCATIONAL REFORM

A precise legal analysis of assessment reform proposals based on MSEB's new principles for mathematics assessment will depend upon how those proposals are implemented. However, MSEB's dual goals of using assessment to enhance mathematics learning and promoting equity by supporting every student's opportunity to learn parallel some earlier initiatives that were subject to keen policy debates and intense legal scrutiny.

An earlier, and similar educational reform effort involved the use of minimum competency tests (MCT) to determine whether a student would receive a high school diploma. The analogy is useful if a national system of tests or assessments or what might become, de facto, a national exam results from these proposals or if a school or state or local education agency were to use mathematics assessments to determine the award of diplomas, proficiency certificates, or access to employment. Constitutional scrutiny of any educational program, as was the situation with the state or local MCT programs reviewed previously by the courts, can be triggered whenever any governmental body, be it federal, state or local, acts on a voluntary basis to sort people into groups for differential treatment. The level of this scrutiny intensifies as the stakes attached to the groupings go up; if the stakes attached to a mathematics assessment are high enough to involve high school diploma denial, limitations on access to particular curricular tracks, to higher education or the workplace, or stigmatizing labels for individuals who do not succeed on the

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

assessment, then the possibility for successful legal challenges to the program increases.

Perhaps the most noteworthy legal review of an educational reform initiative with a high stakes use of testing was Debra P. v. Turlington, a challenge to the state of Florida's program to condition the award of a high school diploma upon successful performance on a minimum competency test.3 Florida's legislative goals were to promote educational accountability and insure that every school district provided "instructional programs which meet minimum performance standards compatible with the state's plan for education… [and] information to the public about the performance of the Florida system of public education in meeting established goals and providing effective, meaningful and relevant educational experiences designed to give students at least the minimum skills necessary to function and survive in today's society".4 The test used to measure student performance in reading, writing, and mathematics was commonly known as the Functional Literacy Test. Initial failure rates on the test were high and a disproportionate number of those failing the test were black; the early failure rate among black students was approximately ten times that for white students.5

The racial impact of Florida's test required the courts to assess the program under Title VI of the Civil Rights Act of 19646 and the U.S. Constitution. In reviewing constitutional challenges to the program presented by students of all races who failed the test, the courts that decided Debra P. addressed validity and reliability issues in educational testing. In deciding the case, the courts looked, by analogy, to standards used in reviewing employment testing under Title VII of the Civil Rights Act of 19647 and to a series of teacher testing cases brought under the U.S. Constitution.8 In Debra P., the Fifth Circuit held that a state "may not constitutionally so deprive its students [of a high school diploma based upon test performance] unless it has submitted proof of the curricular validity of the test."9 The court further explained that "if the test covers material not taught the students, it is unfair and violates the Equal Protection and Due Process clauses of the U.S. Constitution."10 The constitutional protections were triggered because of the magnitude of the consequences of test failure, that is, denial of the diploma, restrictions on access to employment and higher education, disproportionate impact on minority students, and the stigmatizing effect of being regarded as "functionally illiterate."

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

In considering test performance in which black students consistently fail at rates far higher than whites, Debra P. looked at issues of test bias and also held that Title VI would require a demonstration that the test was a fair test of what was taught in schools11 and that the government had taken steps to eliminate the effects of past unlawful racial discrimination that might impact test performance.12 Further, "in attempting to justify the use of an examination having… a disproportionate impact upon one race… [the government must] demonstrate either that the disproportionate failure of blacks was not due to the present effects of past intentional segregation or, that as presently used, the diploma sanction was necessary to remedy those effects."13

Some of the same types of race discrimination identified in the Debra P. situation could occur in a mathematics assessment system. Any assessment tasks that require knowledge that might not be taught in school or might not be part of a common cultural norm could negatively impact performance on the basis of race, ethnicity, gender, or socioeconomic status.14 For example, a proposal by the 1991 Victorian (Australia) Curriculum and Assessment Board promotes the use of simulation to assess performance. One of their proposed simulations asks students to investigate "the chance of winning a tennis game after being two sets down." The success of former African American tennis pro Arthur Ashe aside, few minority youngsters would have a fair chance at succeeding on this task.15 Similarly, the use of a projects-based approach to mathematics assessment could have a deleterious effect on low-income or limited-English-proficient youngsters if a significant portion of the assignment was to be done as homework where parental assistance might play a role in successful project completion.

The courts have also invalidated the use of testing and instructional practices which resulted in dead-end educational tracking for low-performing students. In the McNeil v. Tate case,16 the use of class assignment practices which resulted in segregation of African American students in low track placements which they never left and which provided limited opportunities for educational achievement were declared unlawful.17 Similarly, in Larry P. v. Riles,18 federal courts banned the use of intelligence tests to place black students in special education classes in California due to undesirable consequences for these students. Any use of mathematics assessments to place students in educational tracks will probably not

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

withstand such legal review unless educators can demonstrate that students placed in low tracks have the opportunity for greater educational achievement as a result of the tracking.

One should assume, at least during the early period of implementation of the project, that disproportionate numbers of minority students will not succeed on the assessments, whatever form they take. If as a consequence, minority students are denied competency certificates, diplomas or access to greater educational opportunity at a rate higher than their proportion of the total population of students being assessed, then significant legal problems could ensue.

Other legal standards apply if there is any effort to link education and the workplace through the use of assessments of mathematics skills in order to certify proficiency to employers. There is currently widespread interest among politicians, business leaders, and some educators to create these closer links between schools and the workplace. Such initiatives could trigger legal provisions concerning protections against discrimination in employment on the basis of race, ethnicity, national origin, gender, and handicapping condition.19

As one example of the effort to link schools and the workplace, a group of educational, business, and labor leaders participated in the U.S. Labor Secretary's Commission on Achieving Necessary Skills (SCANS) which seeks to fulfill the mission set forth in President Bush's America 2000 initiative and ''to establish job-related… skills standards, built around core proficiencies."20 SCANS determined that proficiency levels ought to be defined at several levels, from preparatory, to work-ready, to intermediate, to advanced, to specialist.21 SCANS' perceptions of proficiency are extremely ambitious. For example, for mathematics and computational skills, SCANS concludes that "virtually all employees should be prepared to maintain records, estimate results, use spread sheets, or apply statistical process controls if they negotiate, identify trends, or suggest new courses of action."22 SCANS estimated that "less than half of young adults can demonstrate the SCANS reading and writing minimums; even fewer can handle the mathematics."23 To the extent that mathematics assessments track the SCANS' goals, there may be both education and employment-related legal consequences attached to mathematics reform initiatives. If a mathematics assess-

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

ment has the impact of determining an individual's access to employment then, for purposes of legal and public policy analyses, the assessment may become, in essence, an employment test.

Much legal scrutiny has been applied to the use of tests or assessments' both formal and informal, to determine the employment opportunities of racial and ethnic minorities and underrepresented men or women in a workforce. Title VII of the Civil Rights Act of 1964 bars discrimination in employment,24 but allows employers to use "professionally developed ability tests… [which are not] designed, intended, or used to discriminate."25 These provisions have been interpreted by the Supreme Court in Griggs v. Duke Power as a bar to the use of employment tests that have an adverse impact on protected groups unless the employer can establish that the test "… bear[s] a demonstrable relationship to successful performance of the jobs for which it [is] used."26 This interpretation of Title VII was ratified by the Congress when it enacted the Equal Employment Opportunity Act of 1972 (P.L. 92-261).27 Further, the Supreme Court has held that "Title VII forbids the use of employment tests that are discriminatory in effect unless the employer meets 'the burden of showing that any given requirement [has]… a manifest relation to the employment in question' [and showing] that other tests or selection devices, without a similarly undesirable racial effect, would also serve the employer's legitimate interest in 'efficient and trustworthy workmanship.'"28 It is important to note that the types of tests and criteria struck down by the Court under these "job relatedness'' standards have included general high school diploma requirements and standardized tests of general ability, such as the Wonderlic.29 Discriminatory tests have been found impermissible

… unless shown, by professionally acceptable methods, to be predictive of or significantly correlated with important elements of work behavior which comprise or are relevant to the job or jobs for which candidates are being evaluated.30

In pursuing the inquiry required by this legal standard, the courts have delved deeply into the technical details of employers' validity and reliability studies,31 including the accuracy of job analyses for determining content validity.32

The Supreme Court in Griggs also noted that performance tests should be keyed to higher level jobs in an employment context

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

only when an employer can demonstrate that "new employees will probably, within a reasonable period of time and in a great majority of cases, progress to a higher level."33 Further, where disparate results occur, differential validation should be done on minority and white groups whenever technically feasible.34

These technical and legal standards would, in a traditional Title VII case, be imposed on employers. However, given the potential for an interrelationship between educational institutions and employers who might wish to rely on individual assessment data, there is the possibility of new types of scrutiny of educational practices if mathematics assessment is linked to employment opportunities.

An irony associated with the current national proposals being discussed is that the effort to move to uniform competency standards for all that are beyond the minimum increases the potential for difficulty for an employer (or a school) attempting to defend the standards in a job relatedness inquiry under Title VII. The more that the skills move from minimal, basic standards, the harder it will be to establish the business necessity of performance of all of the skills in all jobs in a particular workplace (or, to set a somewhat lower goal, for all jobs to which all employees in a workplace might realistically aspire).35

The effect of the Civil Rights Act of 199136 has been both a clarification of the standards for assessing discrimination in employment and a strengthening of the legal remedies for intentional, unlawful discrimination. The act codifies a long history of U.S. Supreme Court decisions since Griggs v. Duke Power37 defining the "business necessity" defense for discriminatory acts in employment and the "job relatedness" requirement for employment requirements. The Act also bars the practice used in some employment testing programs of statistically adjusting or using different cutoff scores on the basis of race, color, religion, sex, or national origin.38

Many of those currently promoting the use of assessment to enhance educational achievement and the infusion of workplace-related skills into the assessments have proposed the use of assessment data to determine such things as the award of certificates of mastery of workplace skills. Under such proposals, schools would award competency certificates and employers would use them to make hiring, placement, and promotion decisions. This approach opens the assessments to challenge as employment tests and, as

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

such, could subject them to the "business necessity" or "job-relatedness" standards set forth by the courts if there is a disproportionate impact on the groups protected by Title VII. As consideration of the implementation of assessment reform proceeds, caution should be exercised concerning these school-to-workplace linkages in the use of mathematics assessments.

IMPACT ON PEOPLE WITH HANDICAPPING CONDITIONS

Through both the Rehabilitation Act of 197339 and the Americans With Disabilities Act of 1990,40 federal law now has a system of protections patterned after those set forth in Title VII creating a protected class for those possessing a physical or mental impairment that substantially limits one or more major life activities if such individuals are otherwise qualified to perform the essential functions of their job with the provision of reasonable accommodation for their disabilities by the employer. These protections, coupled with the provisions of Section 504 of the Rehabilitation Act governing students in educational settings receiving federal financial assistance and the protections of the Education of the Handicapped Act (now the Individuals with Disabilities Education Act),41 suggest close scrutiny of mathematics assessment proposals to determine whether they present potential problems under these statutes. Each content standard needs to be scrutinized to determine whether the standard would serve as an unlawful bar to participation by a handicapped person in either an educational program or in employment. Of greatest importance for mathematics assessment will be the extent to which authentic tasks might present an unreasonable impediment to those with physically handicapping conditions or specific learning disabilities. The mechanisms for implementing both instruction and assessment of the standards will require similar scrutiny. Finally, assurance will be needed that employers do not unlawfully employ the standards to deny access to employment to any individuals with handicapping conditions. Here, the job-relatedness and business necessity requirements will be of the utmost importance and the burden of proof will rest on employers and, perhaps, the educators making certifications to employers.

Given these standards, the potential for disproportionate impact on the handicapped is high. Even if assessments rather than

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

tests and subjective indicators such as portfolio assessments are used, legal risks are high given the potential impact of any mathematics assessment on decision-making for education or employment.

GENDER-RELATED EFFECTS

Although there is potential for gender differences in almost any testing or assessment program, the issue may be particularly troublesome in mathematics assessment because of significant gender gaps in previous mathematics testing. Performance on such indicators as the Scholastic Aptitude Tests (now known as the Scholastic Achievement Test), the National Assessment for Educational Progress, and various vocational aptitude tests is consistently lower for females than males, particularly on higher-order tasks.

Allegations of gender bias in a mathematics assessment program could be subject to several types of legal challenge. Gender discrimination in education is directly addressed by the provisions of Title IX of the Education Amendments of 197242 and its implementing regulations.43 Title IX bars discrimination on the basis of sex in all educational programs and activities conducted by recipients of federal financial aid. Many states have similar provisions.44 The legal analysis of Title IX challenges to gender disparities on mathematics assessments would probably follow the type of analysis used under Title VII of the Civil Rights Act of 196445 to assess discrimination in employment testing.46 In addition, the provisions of Title VII barring gender discrimination in employment could also apply to use of the assessments in the workplace.

Judicial review of gender-related effects of assessment programs might also occur under the Equal Protection Clause of the Fourteenth Amendment to the U.S. Constitution or under analogous state constitutional provisions. In addition, approximately sixteen states have added equal rights amendments to their state constitutions in an effort to regulate gender discrimination. The state constitutional provisions differ to some extent in their interpretation and applicability by state courts.47

The use of mathematics assessments can, if challenged on the basis of gender discrimination, result in a judicial order to terminate the program, revalidate the assessment, create new

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

assessments, or reconfigure the use of assessment results. It has also now been established that violations of Title IX can result in the award of monetary damages to provide recompense to the victims of unlawful gender discrimination.48

ACCESS TO INSTRUCTION FOR ALL STUDENTS

In addition to the legal issues that pertain to protected groups, there are legal challenges to educational reforms that can be mounted by any student. According to Debra P., under the Due Process Clause of the Constitution, a program may be struck down by the courts if it is found to be "fundamentally unfair in that it may have covered matters not taught in the schools…"49 Courts have deemed this a content validity issue.50 Further, a test must be "a fair test of that which was taught" in order to withstand scrutiny under the Equal Protection Clause. Test fairness, under the Debra P. standard, hinges at least in part on test validity and reliability, two substantial technical hurdles that are apparently far from being resolved in the current discussions of the use of assessment to improve learning. Courts to date have not generally questioned the appropriateness of the content of what is taught except when challenges have been asserted on the basis of claims of denials of liberty,51 establishment or free exercise of religion,52 invasion of privacy, or free speech grounds,53 which will be discussed briefly below.

The judicial holdings in Debra P. on curricular validity reinforced a behaviorist orientation in education at the time and fairly widespread attention began to be paid, for both educational and constitutional reasons, to requirements that teachers teach to the content of high-stakes tests. The constitutional standards set forth in Debra P., and reiterated in subsequent federal cases,54 will, for reasons to be discussed below, need to be considered in assessing the potential legal consequences of mathematics assessment, particularly if the individual stakes associated with assessment performance are high.

In 1983, the courts determined that Florida had met its burden of proving that these legal requirements were satisfied after the presentation of a massive set of surveys from local schools in which educators responded that they had addressed the competen-

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

cies covered on the test. According to subsequent commentators, in part because of the test-curriculum match issue addressed by the courts, the national impact of the minimum competency movement was measurable; states were able to effectively overcome local control by requiring accountability and pass rates did increase, although in most instances school curriculum was diluted because the standards on the tests were set so low.55 Because judicial scrutiny of governmental practices intensifies as the consequences of governmental action increase, the review of implementation efforts may be stricter than that adopted in the test-for-diploma programs if the consequences of assessment performance should include not only diploma denial, but also access to higher education or the workplace. When a scheme of national voluntary local participation in the initiative is created, legal responsibility for defending high stakes programs under the Due Process and Equal Protection Clauses of the Constitution, will rest with each participating state or local governmental entity.

Another issue worthy of further discussion is the legal consequences of moving from challenges to standardized practices, which was most often the case in the previous educational testing cases, to circumstances in which assessments involve open-ended tasks and more subjective judgments about the success of performance. Most prior test challenges were class actions brought against standardized testing practices. The use of different tasks or items that vary from school to school or state to state, with subjective assessments of performance, opens the door to thousands of potential individual cases. Courts can be expected to be reluctant to allow this expansion of litigation, particularly since it is on terrain that judges are ordinarily very hesitant to traverse. A plethora of individual discrimination cases will be difficult for members of protected groups to pursue. On the other hand, individual challenges by students and other groups on broad constitutional grounds may increase the number and rate of success as the more litigious members of our society apply their financial resources to efforts to obtain legal redress for educational grievances. This appears to be happening at present, for example, in cases involving challenges to disputed test scores from the Educational Testing Service.56

The Constitutional standards set forth in Debra P. v. Turlington require explicit recognition of the need to discuss state curriculum frameworks tied to standards.57 Further, recognition

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

that standards and assessments should be used not only to help measure progress, but also to implement progress, is another recognition of the constitutional due process standard embodied in Debra P. as well as the previous research literature indicating the extent to which a test can drive curriculum and instruction.

Finally, far more consideration needs to be given to how teachers can help to implement the educational reforms being proposed. Technical questions of assessment validity and reliability, governance, and policy implementation to one side, the initiative will work only if teachers can make it work. MSEB, as well as the National Council of Teachers of Mathematics, has already recognized that teachers will require considerable support to achieve this goal. Empowering teachers to meet the goals through teacher education and professional development will be critical to the success of the endeavor.

ISSUES OF PERSONAL CHOICE

Additional constitutional issues may arise if programs using mathematics assessments follow the lead set by SCANS in its definitions of skills for the workplace. Some of these issues touch on some of the more controversial political matters presently confronting the nation. Privacy issues have rarely arisen in the past in debates over curriculum and assessment or testing. However, efforts to assess such variables as those enumerated by SCANS under the "personal qualities" and "interpersonal skills" categories may invite significant constitutional problems.

This nation, has a tradition of judicial protection of privacy interests in the face of government attempts to collect sensitive information about individuals or to use such information in order to make determinations about how government will treat individuals. For example, one federal court vetoed a junior high school's effort to use a drug use profile questionnaire to determine student placement in a drug abuse prevention program as a violation of the right to privacy.58 There are also a series of issues regarding what government should do with potentially private information about a student once that information is acquired. For example, the Family Educational Rights and Privacy Act (FERPA) is designed to ensure parental and student access to a student's educational records and to place limitations on the release of information about a student

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

without prior consent from the family.59 The statute was passed in an era in which Congress' privacy concerns were focused on such issues as student grade transcripts, letters of reference, or school psychologist's evaluation reports and the extent to which such information was being used outside a student's school to make potentially damaging judgments about the student. But the language and intent of FERPA apply to certification of mathematics proficiency and perhaps also to the information that may be used as the basis for granting certification, such as a student's performance on a particular assessment task. Given the interests of potential employers in obtaining access to such information, clear privacy protections must be in place.

Related to these privacy issues is a concern about the possibility of challenges on religious grounds to the content and assessment of standards. Policymakers should be prepared for the fact that some religious groups will have bona fide objections that standards or assessment techniques interfere with the free exercise of their religious rights or their free exercise of speech. The U.S. Supreme Court recognized exemptions from certain public education requirements for the Amish on the basis of this religious freedom argument.60 Such a challenge could be raised against some of the more fundamental and objective skills, such as higher-level mathematics or technology, as well as some of the more subjective assessment techniques. Assessments could conceivably be designed to identify a student's attitudes toward individual responsibility, sociability, or integrity that embodies religious or cultural biases. It is also possible that assessments could be implemented in a way that curtails an individual's opportunity to engage in free speech, such as might occur if the assessments of the SCANS "sociability" or "works with diversity" skills were applied to favor behavior that is in the currently popular terminology, "politically correct''. The primary goal of the free speech clause of the First Amendment to the Constitution is, after all, the protection of all expressions of a point of view, even the most politically unpopular.61

EQUITY AND THE GOVERNANCE OF EDUCATION

With the proposal from some to establish a national assessment system that would truly be national, not federal, current reform initiatives acknowledge the long-standing tradition of state control of education. At the same time, however, the national reform move-

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

ment encourages conformity in curriculum content, performance goals, and standards of assessment across states and localities. The Tenth Amendment to the U.S. Constitution requires that "the powers not delegated to the United States by the Constitution, nor prohibited by it to the states, are reserved to the states respectively, or to the people." Because education is not a power specifically given to the federal government, this doctrine of enumeration may be seen as barring efforts to create a federally mandated system of standards for states and localities. However, one power which is explicitly given the federal government is the power to regulate interstate and foreign commerce. A proposal to use assessment to reform the educational preparation of workers who will participate in interstate commerce may fall within the purview of the federal government's constitutional powers to regulate commerce. Given the breadth and depth of political enthusiasm for national education reform, the so-called "state's rights" concerns may be minimal, particularly in comparison to some of the other issues set forth in this paper. Further, there are any number of federal initiatives that have withstood scrutiny under the Tenth Amendment and have been vigorously enforced by the federal courts against the states under the terms of other constitutional provisions. For example, the Fourteenth Amendment's Due Process and Equal Protection Clauses have been used numerous times to enforce a national policy goal. In the education context, the most notable of these were the school desegregation cases, many of which inquired deeply into matters of local school curriculum and instruction.

The failure to adopt a federal system of curriculum standards and assessments presents the potential for fifty different sets of issues concerning validity, reliability, and fairness, with the ensuing possibility of fifty different sets of legal problems. If implementation is local, rather than at the state level, each local district could confront its own set of potential legal difficulties. These legal problems are accentuated whenever assessments are used for high-stakes decisions related to high school graduation, college admission, continuing education, and certification for employment. The technical problems inherent in a proposed system of high-stakes assessment are substantial. The potential legal difficulties and the enormity of the policy questions concerning such a proposal would urge great caution on the part of the proponents of such a system. A laudable goal such as that of the National Council on Education Standards and Testing (NCEST) to create a system of "tests worth teaching

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

to"62 can be lost among all of the other possible goals for the program, including: improving classroom instruction; improving learning outcomes for all students; informing students, parents, and teachers about student progress; measuring and holding students, schools, school districts, states, and the nation accountable for educational performance; assisting education policy-makers with programmatic decisions; certifying for future employment; credentials for college admission; etc.63 The pursuit of multiple reform goals puts multiple pressures on both the psychometricians and preventive law specialists contemplating the manner in which the potential for challenges to such endeavors might be mounted. The possibility of having a separate set of practices in each governmental entity choosing to implement the program (not to mention any employer participating in the use of any resulting certification), creates the possibility for numerous legal challenges. Further, in any such legal challenge, the national or federal bodies with whom the localities are working might also face the risk of involvement. In short, it may be much wiser in the long run, particularly given that a truly national approach to these problems is being sought, to simply make this a federal effort and abandon the pretext of state and local control. Several commentators have already noted that all of the America 2000 initiatives are moving us inexorably toward a national curriculum.64

EQUITY AND ECONOMICS

Another set of potential legal issues centers around school finance and the current inequities from district to district and building to building in financial resources for education. Related to this is the potential impact of assessment information as a part of the inquiry in discussions of state takeovers of low-performing or educationally bankrupt school districts. In the past several years, state governments have become more willing to expand their exercise of responsibility for oversight of local education efforts. Some states, such as New Jersey, have implemented receiverships for certain low-performing districts. If mathematics assessment information or other educational accountability reports begin to inform state-level reviews of local district educational achievement, then such variables as the mathematics assessment will come to have very high stakes consequences not only for students but also for local school districts. As a result, such efforts might be subject to

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

local district challenge on federal constitutional bases concerning due process and equal protection; they would, in addition, invite a broad array of state constitutional and statutory challenges concerning financing of education. Local district or individual school challenges to the use of assessment results might also be mounted under any state or federal school choice scheme in which assessment data could be used to limit a student's opportunity to attend a particular school.

Another practical issue, and one fraught with legal difficulties of another sort, is the disturbing question of how to pay for these ambitious initiatives. In his powerful reflection Savage Inequalities, Jonathan Kozol warns that the decision in Brown v. Board of Education "did not seem to have changed very much for children in the schools I saw, not, at least, outside of the Deep South"65 and that "the dual society, at least in public education, seems in general to be unquestioned."66 Further, most of the urban and less-affluent suburban schools he visited were untouched by school reform initiatives and, in the few instances where some reform initiatives had been tried, they amounted to little more than ''moving around the same old furniture within the house of poverty… In public schooling, social policy has been turned back almost one hundred years."67 At the core of all of these inequities, he finds, is a system of public finance of education which subsidizes and perpetuates these gross denials of educational opportunity.

The Implementation Task Force of the National Council for Education Standards and Testing suggests that equitable distribution of resources among districts and among schools within districts is a critical component for implementation at each level of government.68 That group recognized that equity in funding is a key factor in the success of the endeavor69 and will become a major issue in all of the states.70

Federal programs in the past have been critical in providing assistance for the educationally disadvantaged. Such endeavors will need to continue but should be linked tightly to the common content and performance standards.71 NCEST in some respects seems to dismiss problems related to fiscal equity, hoping instead that national standards can create targets toward which educators can strive.72 NCEST argues that states and local districts could work together to overcome deficiencies in resources.73 Given the substantial difficulties that even one state, Texas, has had attempting to arrive at an equalization formula to

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

address constitutional deficiencies with school funding, this seems an excessively optimistic position.74

Participants in the reform debate must maintain constant awareness of the possibility of unintended legal consequences. Once government defines minimum educational outcomes for all students and creates a presumption that sufficient educational services will be provided so that all students can meet this level of proficiency, then it may create an entitlement to an education that the federal courts have never previously been in a position to recognize for constitutional protection. In San Antonio School District v. Rodriguez,75 the U.S. Supreme Court refused to recognize that education is a fundamental right under the Constitution; however, if a fundamental right is, in essence, created as the result of the creation of an entitlement, then the level of judicial scrutiny of a governmental practice may be subject to the burdensome "strict scrutiny" level of analysis of practices that work to deny citizens' fundamental interests, a burden nearly impossible for government to meet.

A related issue concerns the fact that the government will have created a legitimate expectation on the part of students that school attendance will result in attainment of a certain level of mathematics skills. This also creates a need to assess whether the doors previously closed by state court judges to claims of "educational malpractice" may be wedged open again as a result of the new national standards and goals.76

William Clune identifies four generic problems confronting efforts to enhance student achievement: poor understanding of effective practice (weak technology, that is, lack of understanding of which practices produce improved learning); serious problems of policy implementation (central control can do little to affect the activities of millions of teachers and learners across the nation); serious problems of political organization and policy formation (effective educational policy must be carefully designed and tightly coordinated); and significant cost constraints (massive infusions of new capitol would be needed to subsidize major change).77 One cannot hope, Clune asserts, to successfully pursue strong educational goals through the use of weak policy instruments; he views efforts at reform through the use of educational indicators and assessments as requiring, in particular, further development of

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

assessments that are technically defensible and efforts to influence instructional content and practice as requiring tighter systems to guide instruction, perhaps including teacher education and systems to maximize teacher participation, enthusiasm, and responsibility, and greater focus on curriculum to promote higher-order thinking and problem solving. Each of these concerns has an analog in the legal issues discussed above. Without a satisfactory solution to each of these problems, the legal consequences could be substantial. In particular, specific attention must be paid to the impact of these proposals on educationally disadvantaged students. From a policy perspective, issues of equity should be of the utmost importance. From a legal perspective, it may be those who have traditionally been the most educationally disadvantaged who will be able to bring the most successful legal challenges to the endeavor. From an economic perspective, a failure to effectively address the needs of all students will have devastating consequences for the future economic welfare of the entire nation.

CONCLUSION

This paper provides a brief summary of the principal legal and policy issues that might arise in challenges to a mathematics assessment initiative by members of protected groups traditionally underserved by the nation's schools, by any student who performs poorly on an assessment, or by individual school districts. Enhanced educational attainment in mathematics is a goal with which few could disagree. However, educators and public policymakers should take care that all schools are provided sufficient resources to allow them to effectively meet that goal and that all students, no matter their race, ethnicity, language, background, or handicapping condition are given a fair opportunity to learn and a fair opportunity to demonstrate their learning through assessments. Finally, without an adequate system of financing mathematics education and assessment in all schools, no effort at education reform will succeed.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

ENDNOTES

1  

This paper was considerably influenced by a previous paper by the author commissioned by the Secretary's Commission on Achieving Necessary Skills of the U.S. Department of Labor.

2  

Brown v. Board of Education, 347 U.S. 483 (1954).

3  

Debra P. v. Turlington, 474 F. Supp. 244 (M. D. Fla. 1979); aff'd in part, rev'd in part, 644 F. 2d 397 (5th Cir. 1981); reh. en banc den.

4  

474 F. Supp. at 247.

5  

Id. at 249.

6  

42 U.S.C. 2000d.

7  

474 F. Supp. at 252.

8  

Id. citing Armstead v. Starkville Municipal Separate School District, 461 F. 2d 276 (5th Cir. 1972).

9  

644 F. 2d 397, at 400.

10  

Id. at 402.

11  

See text accompanying endnotes 2-6.

12  

644 F. 2d 397, 406-407.

13  

644 F. 2d at 407.

14  

Although the latter is not a basis for a legal claim in most circumstances, it correlates with race and ethnicity and may thus result in a basis for a legal challenge.

15  

An item from the February 1991 Maryland School Performance Assessment Program Grade 8 Mathematics Assessment teacher's guide involves a task asking students to develop a survey plan to collect information on potential respondents to assist a developer's efforts to build a new restaurant. The lowest-scoring sample student answer is "1 would ask people in the rich part of the county." Without doubt, that response lacks a richness of detail that reflects much understanding of sampling methodology even at the eighth grade level, but for a low-income student who could never contemplate having the opportunity to be a developer, the sample answer says it all.

16  

508 F. 2d 1017 (5th Cir. 1975).

17  

See also, Hobson v. Hansen, 269 F. Supp. 401 (D.D.C 1967), aff'd sub nom Smuck v. Hansen, 408 F. 2d 175 (D.C. Cir. 1969) (en banc).

18  

Lorry P. v. Riles, 343 F. Supp. 1306 (N. D. Cal. 1972), aff'd 502 F. 2d 963 (9th Cir. 1974).

19  

There are also issues concerning both education and employment of persons with limited English proficiency (LEP); these issues are not addressed here on the assumption (perhaps erroneous) that courts will find English proficiency requirements quite acceptable for the nation's future workplaces. However, even if this assumption is true, there is another set of legal issues, unaddressed here, concerning the rights of LEP students to education that meets their special needs.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

20  

What Work Requires of Schools: A SCANS Report for America 2000. The Labor Secretary's Commission on Achieving Necessary Skills, U.S. Department of Labor (hereinafter SCANS), 1991, p. 24.

21  

SCANS, p. 25.

22  

SCANS, pp. 26-27.

23  

SCANS, p. 27.

24  

42 U.S.C. 2000e - 2(a)(1).

25  

42 U.S.C. 2000e - 2(h).

26  

Griggs v. Duke Power Co., 401 U.S. 424 at 431 (1971).

27  

See P. Patterson, "Employment Testing and Title VII of the Civil Rights Act of 1964" in Gifford and O'Connor, pp. 93-95.

28  

Albemarle Paper Co. v. Moody, 422 U.S. 405 at 425 (1975).

29  

Griggs and Albemarle.

30  

Albemarle, at 431.

31  

Albemarle, op. cir.

32  

See B. Schlei and P. Grossman, Employment Discrimination Law (1983), pp. 98-161 and 1985 Supp. p. 18; See Test Policy and the Politics of Opportunity Allocation: The Workplace and the Law, B. Gifford, ed, Klover, Boston (1989).

33  

422 U.S. at 434.

34  

422 U.S. at 435.

35  

Note also that to the extent that proposals may be implemented in a manner not analogous to a scored test, but rather as a less uniform assessment according to subjective criteria, Title VII is applicable and the Griggs standard is followed. See Schlei, B. L. & Grossman, P. (1983). Employment Discrimination Law (2nd ed.). Washington, DC: Bureau of National Affairs, Inc. pp. 162-190, and 1993-84 Cum-Supp., pp. 21-23.

36  

P.L. 102-166.

37  

Senate sponsors of the law, including Senators Danforth, Kennedy, and Dole, and the administration created a specific legislative history for the law stating that the terms "business necessity" and "job-relatedness" are intended to reflect the concepts enunciated by the Supreme Court in Griggs v. Duke Power Co., 401 U.S. 424 (1971), and in the other Supreme Court decisions prior to Wards Cove Packing Co. v. Atonio, 490 U.S. 642 (1989). When a decision-making process includes particular, functionally integrated practices that are components of the same criterion, standard, method of administration, or test, such as the height and weight requirements designed to measure strength in Dothard v. Rawlinson, 433 U.S. 321 (1971), the particular, functionally integrated practices may be analyzed as one employment practice."

38  

P.L. 102-166, Sec. 106.

39  

29 U.S.C. 701 et. seq.

40  

42 U.S.C. 12101 et. seq.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

41  

20 U.S.C. 1400 et. seq.

42  

20 U.S.C. 1681-1687, as amended by the Civil Rights Restoration Act of 1987, codified at 20 U.S.C. 1687.

43  

34 C.F.R. 86.1-86.70.

44  

See, for example, Massachusetts General Laws Ann. ch. 76, sec. 5.

45  

42 U.S.C. 2000e.

46  

See K. Connor and E. Vargyas, "The Legal Implications of Gender Bias in Standardized Testing," Berkeley Women's Law Journal, (1992), pp. 13-89 for an excellent analysis of gender discrimination law as it applies to testing.

47  

Id.

48  

Franklin v. Gwinnett County Public Schools, 112 S. Ct. 1028 (1992).

49  

644 F. 2d at 403, emphasis in original.

50  

644 F. 2d 397, 404.

51  

See Meyer v. Nebraska, 262 U.S. 390 (1923).

52  

See Wisconsin v. Yoder, 406 U.S. 205 (1972).

53  

See West Virginia State Board of Education v. Barnette, 319 U.S. 624 (1943).

54  

Brookhart v. Peoria, 697 F. 2d 182 (7th Clr. 1982). See Anderson v. Banks, 520 F. Supp. 472 (S. D. Ga., 1981), appeal from subsequent order dismissed sub nom. Johnson v. Sikes, 730 F. 2d 644 (11th Cir. 1984).

55  

E. Baker and R. Stites, "Trends in Testing in the USA" in The Politics of Curriculum and Testing, S. H. Fuhrman and B. Malen (eds.) (1991), pp. 148-149.

56  

"Court Orders Testing Service to Release Disputed Scores": The Chronicle of Higher Education, September 2, 1992.

57  

Raising Standards for American Education: A Report to Congress, the Secretary of Education, the National Education Goals Panel, and the American People. The National Council on Education Standards and Testing (hereinafter NCEST), Washington D.C. 1992, p. 7.

58  

Merriken v. Cressman, 364 F. Supp. 913 (E. D. PIL. 1973).

59  

20 U.S.C. 1232g et. seq.; 34 C.F.R. Part 99.

60  

Wisconsin v. Yoder, 406 U.S. 205 (1971).

61  

L Tribe, American Constitutional Law (2nd ed.) (1988), Mineola, NY: The Foundation Press, Inc. pp. 785-1061; M. Yudof, D. Kirp, T. VanGeel, and B. Levin, Educational Policy and the Law (2nd ed.), (1982), Berkeley, CA: McCutchan Publishing. pp. 205-212.

62  

NCEST, p. 6.

63  

NCEST, pp. 5, 6.

64  

E. Baker and R. Stites (1991). Trends in testing in the USA. Politics of Education Association yearbook 1990. (p. 152) London: Taylor & Francis.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×

65  

J. Kozol, Savage Inequalities: Children in America's Schools. New York, Crown Publishers, Inc. (1991), p. 3.

66  

Id. p. 4.

67  

Id.

68  

NCEST Implementation Task Force, p. G-7.

69  

Id. p. G-13.

70  

Id.

71  

NCEST Standards Task Force Report, p. E-13.

72  

Id. p. E-15.

73  

Id.

74  

Lonnie Harp, "Texas Finance Bill Signed Into Law, Challenges Anticipated," Education Week, 9 june 1993; Lonnie Harp, "Impact of Texas Finance Law, Budget Increase Gauged," Education Week, 16 June 1993; Millicent Lawton, "Alabama Judge Sets October Deadline for Reform Remedy," 23 June 1993.

75  

411 U.S. 1 (1972).

76  

See e.g., E. T. Connors, Educational Tort Liability and Malpractice, 1981, pp. 148-158.

77  

W. Clune, "Educational policy in a situation of uncertainty; or, how to put eggs in different baskets," in Fuhrman and Malen, op. cir., pp. 132-133.

Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
This page in the original is blank.
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 147
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 148
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 149
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 150
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 151
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 152
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 153
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 154
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 155
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 156
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 157
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 158
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 159
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 160
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 161
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 162
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 163
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 164
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 165
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 166
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 167
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 168
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 169
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 170
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 171
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 172
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 173
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 174
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 175
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 176
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 177
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 178
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 179
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 180
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 181
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 182
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 183
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 184
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 185
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 186
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 187
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 188
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 189
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 190
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 191
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 192
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 193
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 194
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 195
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 196
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 197
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 198
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 199
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 200
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 201
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 202
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 203
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 204
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 205
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 206
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 207
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 208
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 209
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 210
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 211
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 212
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 213
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 214
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 215
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 216
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 217
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 218
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 219
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 220
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 221
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 222
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 223
Suggested Citation:"Commissioned Papers." National Research Council. 1993. Measuring What Counts: A Conceptual Guide for Mathematics Assessment. Washington, DC: The National Academies Press. doi: 10.17226/2235.
×
Page 224
Next: Study Group on Guidelines for Mathematics Assessment »
  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!