Quality Standards for Performance Assessments
Standards for educational achievement have been developed that delineate the values and desired outcomes of educational programs in ways that are both transparent to stakeholders and provide guidance for curriculum development, instruction, and assessment. In addition, as described in Chapter 3, the measurement profession has developed a set of standards for the quality control of educational assessments. The Standards for Educational and Psychological Testing (American Educational Research Association [AERA] et al., 1999) provide a basis for evaluating the extent to which assessments reflect sound professional practice and are useful for their intended purposes.
This chapter highlights the purposes of assessment and the uses of assessment results that Pamela Moss presented in her overview of the Standards. The discussion then focuses on psychometric qualities examined in the Standards that must be considered in developing and implementing performance assessments. As mentioned in Chapter 3, Moss alluded to a number of measurement concepts during her workshop presentation. To assist readers who might be unfamiliar with the measurement issues included in the Standards, background information is provided on these issues.
USES FOR ASSESSMENT RESULTS
Assessments can be designed, developed, and used for different pur-
poses, two of which—accountability and instruction—are particularly relevant to this report. As noted by several participants at the workshop, these two purposes are not always compatible, as they are concerned with different kinds of decisions and with collecting different kinds of information. Assessments for classroom instructional purposes are typically low stakes, that is, the decisions to be made are not major life-changing ones, relatively small numbers of individuals are involved, and incorrect decisions can be fairly easily corrected. Assessments for accountability, on the other hand, are usually high stakes: The viability of programs that affect large numbers of people may be at stake, resources are allocated on the basis of performance outcomes, and incorrect decisions regarding these resource allocations may take considerable time and effort to reverse—if, in fact, they can be reversed.
Assessment for instructional purposes is designed to facilitate instructional decisions, but instructional decision making is not the primary focus of assessments for accountability purposes. Assessments for instructional purposes may also include tasks that focus on what is meaningful to the teacher and the school or district administrator. But these particular tasks are not generally useful to external evaluators who want to make comparisons across districts or state programs. Hence, there is a trade-off in the kinds of information that can be gleaned from assessments for instructional purposes and assessments for accountability purposes. Assessments that are designed for instructional purposes need to be adaptable within programs and across distinct time points, while assessments for accountability purposes need to be comparable across programs or states.
Assessments for these two purposes also differ in the unit of analysis. When assessments are to be used for instructional purposes, the individual student is typically the unit of analysis. The resulting reported scores need to be sensitive to relatively small increments in individual achievement and to individual differences among students. For the purpose of accountability, the primary unit of analysis is likely to be larger (the class, the program, or the state). Assessments designed for this purpose need to be sensitive, not to individual differences among students but to differences in aggregate student achievement across groups of students (as measured by average achievement or by percentages of students scoring above some level). Because of these differences, the ways in which the quality standards apply to instructional and accountability assessments also differ.
While classroom instructional assessment is important in adult literacy programs, the primary concern of this workshop was with the development
of useful performance assessments for the purpose of accountability across programs and across states because that is what the National Reporting System (NRS) requires. The discussion that follows focuses on issues raised by Moss in her presentation that are of concern in meeting quality standards in the context of high-stakes accountability assessment in adult education.
QUALITIES FOR PERFORMANCE ASSESSMENTS IN THE CONTEXT OF ADULT LITERACY
The Standards provide guidance for the development and use of assessments in general. However, discussion at the workshop focused on the ways in which these quality standards apply to, and are prioritized in, performance assessment, particularly in the context of adult education. The four qualities that were highlighted by Moss and others at the workshop are discussed in general terms and then with reference to performance assessment in adult education. These qualities are reliability, validity, fairness, and practicality.
Several points need to be kept in mind. First, the way these qualities are prioritized depends on the settings and purposes of the assessment. Thus, for a low-stakes classroom assessment for diagnosing students’ areas of strength and weakness, concerns for authenticity and educational relevance may be more important than more technical considerations, such as reliability, generalizability, and comparability. For a high-stakes external accountability assessment, higher priority should be given to technical considerations. Much greater care will need to be taken, and more resources will need to be allocated, to ensure that assessments are reliable, valid, and comparable. Nevertheless, even though the qualities may be prioritized differently, all of them are relevant and need to be considered for every assessment.
Second, these qualities need to be considered at every stage of assessment development and use. Test publishers should not wait to determine how well assessments meet these quality standards until after they are in use. Rather, consideration of these standards should inform every decision that is made, from the beginning of test design to final decision making based on the assessment results.
Finally, there are costs associated with achieving quality standards in assessment. Differences in the priorities placed on the various quality standards will be reflected in the amounts and kinds of resources that are needed
to achieve these standards. Thus, in any specific assessment situation, there are inevitable trade-offs in allocating resources so as to optimize the desired balance among the qualities.
Reliability is defined in the Standards (AERA et al., 1999:25) as “the consistency of . . . measurements when the testing procedure is repeated on a population of individuals or groups.” Any assessment procedure consists of a number of different aspects, sometimes referred to as “facets of measurement.” Facets of measurement include, for example, different tasks or items, different scorers, different administrative procedures, and different occasions when the assessment occurs. A reliable assessment is one that is consistent across these different facets of measurement. Inconsistencies across the different facets of measurement lead to measurement error or unreliability. A reliable assessment is also one that is relatively free of measurement error. The fundamental meaning of reliability is that a given test taker’s score on an assessment should be essentially the same under different conditions—whether he or she is given one set of equivalent tasks or another, whether his or her responses are scored by one rater or another, whether testing occurs on one occasion or another. For additional information on reliability, the reader is referred to Brennan (2001), Feldt and Brennan (1993), National Research Council (NRC) (1999b), Popham (2000), and Thorndike and Hagen (1977). For a discussion on reliability in the context of performance assessment see Crocker and Algina (1986); Dunbar, Koretz and Hoover (1991); NRC (1997); and Shavelson, Baxter and Gao (1993). And for information on reliability in the context of portfolio assessment, see Reckase (1995). For a discussion of reliability in the context of language testing, see Bachman (1990), and Bachman and Palmer (1996).
Evaluating the Reliability of Performance Assessments
Evaluating the reliability of a given assessment requires development of a plan that identifies and addresses the specific issues of most concern. This plan will include both logical analysis and the collection of information or data. Multiple sources of evidence should be obtained, depending on the claims to be supported. Typically, the evaluation of reliability in performance assessments aims to answer five distinct but interrelated questions:
What reliability issues are of concern in this assessment?
What are the potential sources and kinds of error in this assessment?
How reliable should scores from this assessment be?
How can the reliability of the scores be estimated?
How can reliability be increased?
Identifying Reliability Issues of Concern in Performance Assessment
In most educational settings, there are two major reliability issues of concern. One area of concern is the reliability of the scores from the assessments. Unreliable assessments, with large measurement errors, do not provide a basis for making valid score interpretations or reliable decisions. The second area of concern is the reliability of the decisions that will be made on the basis of the assessment results. These decisions may be about individual students (e.g., placement, achievement, advancement) or about programs (e.g., allocation of resources, hiring and retention of teachers). When assessments are used in decision making, errors of measurement can lead to incorrect decisions. Because these errors of measurement are not equally large across the score distribution (i.e., at every score level), the decisions that are based at the cut scores on different scales may differ in their reliability. The reader is referred to Anastasi (1988), Crocker and Algina (1986), and NRC (1999b) for additional discussion on the reliability of decisions based on test scores.
There are two types of incorrect decisions or classification errors. False positive classification errors occur when a student or a program has been mistakenly classified as having satisfied a given level of achievement. False negative classification errors occur when a student or program has been mistakenly classified as not having satisfied a given level of achievement. These classification errors have costs associated with them, but the costs may not be the same for false negative errors and false positive errors (Anastasi, 1988; NRC, 2001b). For example, what are the human and material resource costs of continuing to fund a program that is not meeting its objectives, even though, according to the assessment results, it appears to be performing very well? Alternatively, what is the cost of closing down a program that is, in fact, achieving its objectives, but, according to assessment standards, appears not to be? The potential for these and other types of errors must be considered and prioritized in determining acceptable reliability levels.
Identifying Potential Sources and Kinds of Error in Performance Assessment
Because most performance assessments include several different facets of measurement (e.g., tasks, forms, raters, occasions), a logical analysis of the potential sources of inconsistency or measurement error should be made in order to ascertain the kinds of data that need to be collected. In many performance assessments, the considerable variety of tasks that are presented make inconsistencies across tasks a potential source of measurement error (Brennan and Johnson, 1995; NRC, 1997). Another potential source of measurement error arises from inconsistencies in ratings. As mentioned previously, scoring performance assessment relies on human judgment. Inevitably, unless the individuals who are rating test takers’ performances are well-trained, subjectivity will be a factor in the scoring process. Another source of inconsistency might be administrative procedures that differ across programs or states.
Determining How Reliable Scores from Given Performance Assessments Should Be
The level of reliability needed for any assessment will depend on two factors: the importance of the decisions to be made and the unit of analysis. Because most classroom assessment for instructional purposes is relatively low stakes, lower levels of reliability are considered acceptable. Hence, relatively few resources need to be expended in collecting reliability evidence for a low-stakes assessment. On the other hand, external assessments for accountability purposes, especially for individuals or small units, are relatively high stakes. Very high levels of reliability are needed when high-stakes decisions are based on assessment results. Considerable resources need to be expended to collect evidence to support claims of high reliability for these assessments.
When students’ scores are used to make decisions about individual students, the reliability of these scores will need to be estimated. Estimating reliability is not a complex process, and appropriate procedures for this can be found in standard measurement textbooks (e.g., Crocker and Algina, 1986; Linn, Gronlund, and Davis, 1999; Nitko, 2001). Decisions about programs are usually based on the average scores of groups of students, rather than individuals. The reliability of these average scores will generally be better than that of individual scores because the errors of measurement
will be averaged out across students. Thus, when decisions about programs are based on group average scores, higher levels of reliability can be expected than would be typically obtained from the individual scores upon which the group averages are based. Again, procedures are described in standard measurement texts.
Measurement error is only one type of error that arises when decisions are based on group averages. If the evaluation of program effectiveness is based on a sample of classes or programs rather than the entire population of such groups, the amount of sampling error must be considered. Sampling error can be considerable even when the group average scores are highly reliable. This error results from variation across groups or from year to year in terms of how well the groups represent the population from which they are sampled. If the groups do not adequately represent the population, the group average scores may be biased. Even if the groups represent the populations, it may be that the sample is such that there is a great deal of variability in the results. In either case, decisions based on these group average scores may be in error.
Another issue arises when class or program average gain scores are used as an indicator of program effectiveness (AERA et al., 1999, Standard 13.17). “Gain score” refers to the change in scores from pretest to posttest. Even though the reliabilities of group gain scores might be expected to be larger than those obtained from individual gain scores, the psychometric literature has pointed out a dilemma concerning the reliability of change scores (see the discussion in Harris, 1963, for example).1 One solution to the dilemma seems to be to focus on the accuracy of change measures, rather than on reliability coefficients in and of themselves. Nevertheless, the use of gain scores as indicators of change is a controversial issue in the measurement literature, and practitioners would be well advised to consult a measurement specialist or to review the technical literature on this subject (e.g., Zumbo, 1999) before making decisions based on gain scores.
Estimating the Reliability of Scores
There is a wide range of well-defined approaches to estimating the reliability of assessments, both for individuals and for groups; these are discussed in general in the Standards, while detailed procedures can be found in measurement textbooks (e.g., Crocker and Algina, 1986; Linn et al., 1999; Nitko, 2001). These approaches include calculating reliability coefficients and standard errors of measurement based on classical test theory (e.g., test-retest, parallel forms, internal consistency), calculating generalizability and dependability coefficients based on generalizability theory (Brennan, 1983; Shavelson and Webb, 1991), calculating the criterion-referenced dependability and agreement indices (Crocker and Algina, 1986), and estimating information functions and standard errors based on item response theory (Hambleton, Swaminathan, and Rogers, 1991). In general, the specific approaches that should be used depend on the specific assessment situation and the unit of analysis and should address the potential sources of error that have been identified. No single approach will be appropriate for all situations. To determine the appropriate approach, consultation with professional measurement specialists is important.
Determining How Reliability Can Be Increased
When the estimates of reliability are not sufficient to support a particular inference of score use, this may be due to a number of factors. One set of factors has to do with the size and nature of the group of individuals on which the reliability estimates are based. If the groups used to collect data for estimating reliability either are too small or do not adequately represent the groups for which the assessments are intended, reliability estimates may be biased. If this is the case, the test developer or user will need to collect data from other larger and more representative groups. In most cases, however, low reliability can be traced directly to inadequate specifications in the design of the assessment or to failure to adhere to the design specifications in the creating and writing of assessment tasks. For this reason, the single most important step in ensuring acceptable levels of reliability is to design the assessment carefully and to adhere to this design throughout the test development process. As described in Chapter 3, the design process involves the following: clear and detailed descriptions of the abilities to be assessed and of the characteristics of test takers, clear and detailed task specifications for the assessment, clear and standardized administrative
procedures, clear and understandable scoring procedures and criteria, and sufficient and effective training and monitoring of raters. The training of raters may have an additional benefit—it may tie in with professional development for teachers in adult education programs. When reliability estimates are low, each step in the development process should be revisited to identify potential causes and ways to increase reliability.
Validity is defined in the Standards as “the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests” (AERA et al., 1999:9). Validity is a quality of the ways in which scores are interpreted and used; it is not a quality of the assessment itself. Validation is a process that “involves accumulating evidence to provide a sound scientific basis for the proposed score interpretations” (AERA et al., 1999:9). As with building support for claims about reliability, validation involves both the development of a logical argument and the collection of relevant evidence. The specific purposes for which the assessment is intended will determine the particular validation argument that is framed and the claims about score-based inferences and uses that are made in this argument. And the claims that are made in the validation argument will, in turn, determine the kinds of evidence that need to be collected. For more information, see, Messick (1989, 1995) and NRC (1999b). For an approach to framing a validation argument for language tests, see Bachman and Palmer (1996).
Three types of claims can be articulated in a validation argument. First, claims about score-based interpretations are derived from the explicit definition of the constructs, or abilities, to be measured; these claims argue that the test scores are reasonable indicators of these abilities, and they pertain to the construct validity of score interpretations. Second, claims about intended uses are twofold: they include the claim about construct validity and they argue that the construct or ability is relevant to the intended purpose, and that the assessment is useful for this purpose. Third, claims about the consequences of test use include an argument that the intended consequences of test use actually occur and that possible unintended or unfavorable consequences do not occur.
Many different kinds of evidence can be collected to support the claims made in the validation argument. The kinds of evidence that are relevant depend on the specific claims. No single type of evidence will be sufficient
for supporting all kinds of claims or for supporting a given claim for all times, situations, and groups of test takers. The Standards discusses the following sources of evidence that support a validation argument:
Evidence based on test content. Evidence that the test content is relevant to and representative of the content domain to be assessed can be collected through expert judgments and through logical and empirical analyses of assessment tasks and products.
Evidence based on response processes. Evidence that the assessment task engages the processes entailed in the construct can be collected by observing test takers take assessment tasks and questioning them about the processes or strategies they employed while performing the assessment task, or by various kinds of electronic monitoring of test-taking performance.
Evidence based on internal structure. Evidence that the observed relationships among the individual tasks or parts of the assessment are as specified in the construct definition can be collected through various kinds of quantitative analyses, including factor analysis and the investigation of dimensionality and differential item functioning. (See Comrey and Lee, 1992; Crocker and Algina, 1986; Cureton and D’Agostino, 1983; Gorsuch, 1983.)
Evidence based on relations to other variables. Evidence that the scores are related to other indicators of the construct and are not related to other indicators of different constructs needs to be collected. The relationship between test scores and these other indicators provides criterion validity information. When the indicators reflect performance at the same time as the testing, this provides evidence of concurrent validity. When the indicators are gathered at some future time after the test, this provides evidence of predictive validity. When data for these analyses are collected, the accuracy and relevance of the indicators used in the analyses are of primary concern. An additional consideration in some situations is the extent to which evidence based on the relationship between test scores and other variables generalizes to another setting or use. That is, the evidence has been gathered for a particular group or setting, and it cannot be assumed that it will generalize to other groups or settings.
Evidence based on consequences of testing. Evidence that the assessment will have beneficial outcomes can be collected by studies that follow test takers after the assessment or that investigate the impact of the assessment and the resulting decisions on the program, the education system, and society at large. Evidence about unintended consequences of assess-
ment can also be collected in this way. In this context, for example, accountability requirements may well impede program functioning, or they may conflict with client goals. Another kind of consequence that needs to be considered is impact on the educational processes—teaching and learning. One of the arguments made in support of performance assessments is that they are instructionally worthy, that is, they are worth teaching to (AERA et al., 1999:11-14).
Specific Validity Concerns in Performance Assessment in Adult Education
In addition to these general validity considerations, a number of specific concerns arise in the context of accountability assessment in adult education: (1) the comparability of assessments across programs and states, (2) the relative insensitivity of the reporting scales of the NRS to small gains, and (3) difficulties in interpreting gain scores.
Comparability of Accountability Assessments
If performance assessments are to be used to make comparisons across programs and states, these assessments must themselves be comparable. That is, if assessments are to be compared, an argument needs to be framed for claiming comparability, and evidence in support of this claim needs to be provided. Several general types of comparability and associated ways of demonstrating comparability of assessments have been discussed in the measurement literature (e.g., Linn, 1993; Mislevey, 1992; NRC, 1999c). These ways of making assessment results comparable are referred to as linking methods. The descriptions below draw especially on the presentation by Wendy Yen and are further described in Linn (1993), Mislevy (1992), and NRC (1999c).
Equating is the most demanding and rigorous, and thus the most defensible, type of linking. It is reserved for situations in which two or more forms of a single test have been constructed according to the same blueprint. The forms adhere to the same test specifications, are of about the same difficulty and reliability, are given under the same standardized conditions, and are to be used for the same purposes. Scores and score interpretations from assessments that are equated can be used interchangeably so that it is a matter of indifference to the examinee which form or
version of the test he or she receives. Equating is carried out routinely for new versions of large-scale standardized assessments.
Calibration is a less rigorous type of linking. If two assessments have the same framework but different test specifications (including different lengths) and different statistical characteristics, then linking the scores for comparability is called calibration. The tests measure the same content and skills but do so with different levels of accuracy and different reliability. Unlike equating, which directly matches scores from different test forms, calibration relates scores from different versions of a test to a common frame of reference and thus links them indirectly. Calibration is commonly used in several situations. Sometimes a short form of a test is used for screening purposes, and its scores are calibrated with scores from the longer test. For example, calibration could be used to estimate, on the basis of a short assessment, the percentage of students in a program or in a state who would achieve a given standard if they were to take a longer, more reliable assessment. Sometimes tests designed for different grade levels are calibrated to a common scale, a process referred to as vertical equating.
Projection, or prediction, is used to predict scores for one assessment based on those for another. There is no expectation that the content or constructs assessed on the two tests are similar, and the tests may have different levels of reliability. The statistical procedure for projection is regression analysis. It is important to note that projecting test A onto test B produces a different result from projecting test B onto test A. A limitation of projection is that the predictions that are obtained are highly dependent on the specific contexts and groups on which they are based. Additional studies to cross-validate these predictions are necessary if they are to be used with other groups of examinees because the relationships can change over time or in response to policy and instruction.
Statistical moderation is used to align the scores from one assessment (test A) to scores from another assessment (test B). There is no expectation that tests A and B measure the same content or constructs, but the desire is to have scores that are in some sense comparable. Moderation is the process for aligning scores from two different assessments. With statistical moderation, the aligning process is based on some common assessment taken by both groups of examinees (test A and test B test takers).
Social moderation is a nonstatistical approach to linking. Like statistical moderation, it is used when examinees have taken two different assessments, and the goal is to align the scores from the two assessments. Unlike statistical moderation, the basis for linking is the judgment of ex-
perts, common standards, and exemplars of performance that are aligned to these standards. Social moderation replaces the statistical and measurement requirements of the previous approaches with consensus among experts on common standards and on exemplars of performance. The resulting links (e.g., that a score of a on test A is roughly comparable to a score of b on test B) are only valid for making very general comparisons. The approach is often used to align students’ ratings on performance assessment tasks. More relevant to this report is the use of social moderation to verify samples of student performances at various levels in the education system (school, district, state) and to provide an audit function for accountability.
Equating, calibration, or statistical moderation is typically used in high-stakes accountability systems. Social moderation is generally not considered adequate for assessments used for high-stakes accountability decisions.
The extent to which states’ programs are aligned with the NRS standards is not known and was not the primary focus of this workshop. Further, although there may be states in which programs are consistent across the state, there is also the potential for lack of comparability of assessments across adult education programs and between states. This potential lack of comparability prompted workshop participants to raise a number of concerns, including the following:
the extent to which different programs and states define and cover the domain of adult literacy and numeracy education in the same way;
the consistency with which different programs and states are interpreting the NRS levels of proficiency;
the consistency, across programs and across states, in the kinds of tasks that are being used in performance assessments for accountability purposes; and
the extent to which these different kinds of assessments are aligned with the NRS standards.
These potential differences in the assessments used in adult education programs mean that none of the statistical procedures for linking described above are, by themselves, likely to be possible or appropriate. Social moderation, however, may provide a basis for framing an argument and supporting a claim about the comparability of assessments across programs and states.
Linn (1993) provides examples of uses of social moderation that are relevant to the context of accountability assessment in adult education, while Mislevy (1995) discusses approaches to linking, including social moderation, in the specific context of assessments of adult literacy. In his workshop presentation, Henry Braun gave two examples of what he calls “cross-walks” that use social moderation as an approach to linking scores from different assessments so they can support claims for comparability. He provided some specific suggestions for how this might be accomplished through the collaboration of various stakeholders, including publishers and state adult education departments.
All three experts call for certain elements to be present if the social moderation process is to gain acceptance among stakeholders. First, there must be an agreed-upon standard, or set of criteria, which provides the substantive basis for the moderation (i.e., for the process of aligning scores from different assessments). Second, there needs to be a pool of experts who are familiar with the content and context, the moderation procedure, and the criteria. Third, there must be a pool of exemplar student performances or products (benchmark performances) that the experts agree are aligned to different levels on the standard.
In the adult education context, the NRS can be considered the common standard, and the group of experts might include adult education teachers, program directors, state adult education administrators, test publishers, and external experts in the areas of adult education, literacy, and measurement. Braun suggested that the quality and comparability of the assessments could be improved by relying on test publishers’ help. Publishers or states interested in developing assessments for adult education could be asked to state explicitly how the assessments relate to the framework, whether it is the NRS framework or the Equipped for the Future (EFF) framework, and to clearly document the measurement properties of their assessments.
Large Bands and Small Gains: The Relative Insensitivity of NRS Scales to Small Gains in Proficiency
The effectiveness of adult education programs is evaluated in terms of the percentages of students whose scores increase at least one NRS level from pretest to posttest. But, as Braun pointed out, two characteristics of the NRS scales create difficulties for their use in reporting gains in achieve-
ment. First, the NRS is essentially an ordinal scale2 that breaks up what is, in fact, a continuum of proficiency into six levels that are not necessarily evenly spaced. An ordinal scale groups people into categories, and Braun cautioned that when this happens, there is always the possibility that some people will be grouped unfairly and others will be given an advantage by the grouping. In addition, although many students may make important gains in terms of their own individual learning goals, these gains may not move them from one NRS level to the next, and so they would be recorded as having made no gain. Indeed, given the breadth of the NRS scale intervals, the average gain may turn out to be zero unless many more scale points are differentiated within levels. Furthermore, the criterion for program effectiveness is a certain percentage of students who gain at least one NRS level, but many students are likely to achieve only relatively small gains in their limited time in adult education programs. This situation may result in individual programs devising ways in which to “game” the system; for example, they might admit or test only those students who are near the top of an NRS scale level. As Braun said, “We need to begin to develop some serious models for continuous improvement so we avoid the rigidity of a given system and the inevitable gamesmanship that would then be played out in order to try to beat the system.”
Braun raised another complicating issue: The NRS educational functioning levels are not unidimensional but are defined in terms of many skill areas (literacy, reading, writing, numeracy, functional and workplace). Although a student might make excellent gains in one area, if he or she makes less impressive gains in the area that was lowest at intake, the student cannot increase a functioning level according to the DOEd guidelines (2001a). Braun noted that the levels can also affect program evaluation. For example, because of a program’s particular resources and teaching expertise or the particular needs of its clientele, it may do an excellent job at teaching reading, but the students’ overall progress is not sufficient to move them from one NRS level to the next. As a result, the program would receive no credit for its students’ impressive gains in reading.
An additional concern is that the kinds of performance assessments that might be envisioned may be even less sensitive to tracking small developmental increments than some assessments already being used. Performance assessment tasks tend to be more cognitively demanding, educationally relevant, and authentic to real-life situations, which means they are not usually designed to focus either on small increments or on the component skills and abilities that may contribute to successful performance on the task as a whole.
Difficulties in Interpreting Gain Scores Due to the Effects of Instruction
Some of the measurement issues in using gain scores as indicators of student progress have been discussed above. In addition to these measurement issues, a number of other problems make it difficult to attribute score gains to the effects of the adult education program. Braun explained that the fundamental problem is that there are a number of factors in the students’ environment, other than the program itself, which might contribute to their gains on assessments. Most students who are English-language learners are living in an environment in which they are surrounded by English. Many are also working at jobs where they are exposed to materials in English and required to process both written language and numerical information in English. The amount of this exposure varies greatly from student to student and from program to program. Thus, it is difficult to know the extent to which observed gain scores are due to the program rather than to various environmental factors.
To rigorously study the effects of adult education on literacy, it would be necessary to distinguish its effects from those of the environment. Furthermore, differences in the home environments of students, as well as any preexisting individual differences in students as they enter an adult education program, would need to be controlled. This would mean that an experiment would be conducted in which individuals from the adult population were selected at random, and some were chosen at random to be placed in adult education classes, while the others (the comparison group) would merely continue with their lives and not pursue adult education. Although a few experimental studies have been conducted (St. Pierre et al., 1995), there are obvious reasons—practical, pedagogical, and ethical—for not implementing this kind of experimental control. First, students in adult education programs are largely self-selected, and it would be imprac-
tical to try to obtain a random sample of adults to attend adult education classes. Second, if the adult education classes included students who were randomly selected rather than people who had chosen to take the classes, there would be major consequences for the ways in which the adult education classes were taught. Finally, denying access to adult education to the individuals in the comparison group would raise serious ethical questions about equal access to the benefits of our education system. Thus, it is neither possible nor desirable to conduct studies in educational settings with the level of experimental control expected in a laboratory. This lack of control makes it extremely difficult to distinguish between the effects of the adult education program and the effects of the environment.3
The Standards discusses four aspects of fairness: (1) lack of bias, (2) equitable treatment in the testing process, (3) equality in outcomes of testing, and (4) opportunity to learn (AERA et al., 1999:74-76).
Lack of Bias
The Standards defines bias as occurring when scores have different meanings for different groups of test takers, and these differences are due to deficiencies in the test itself or in the way it is used (AERA et al., 1999:74). Bias may be associated with the inappropriate selection of test content; for example, the content of the assessment may favor students with prior knowledge or may not be representative of the curricular framework upon which it is based (Cole and Moss, 1993; NRC, 1999b). Potential sources of bias can be identified and minimized in a variety of ways including: (1) judgmental review by content experts, and (2) statistical analyses to identify differential functioning of individual items or tasks or to detect systematic differences in performance across different groups of test takers.
Equitable Treatment in the Testing Process
All test takers should be given a comparable opportunity to demonstrate their level on the skills and knowledge measured by the assessment (NRC, 1999b). In most cases, standardization of assessments and administrative procedures will help ensure this. However, some aspects of the assessment may pose a particular challenge to some groups of test takers, such as those with a disability or those whose native language is not English. In these cases, specific accommodations, or modifications in the standardized assessment procedures, may result in more useful assessments. All test takers need to be given equal opportunity to prepare for and familiarize themselves with the assessment and assessment procedures. Finally, the reporting of assessment results needs to be accurate and informative, and treated confidentially, for all test takers.
Equality in Outcomes of Testing
Unequal performance across different population groups on a given assessment is not necessarily the result of unfair assessment. Differential test performance across groups may, in fact, be due to true group differences in the skills and knowledge being assessed; the assessment simply reflects these differences. Alternatively, differential group performances may reflect bias in the assessment. When differences occur, there should be heightened scrutiny of the test content, procedures, and reporting (NRC, 1999b). If there is strong evidence that the assessment is free of bias and that all test takers have been given fair treatment in the assessment process, then conditions for fairness have been met. The reader is referred to Bond (1995) and Cole and Moss (1993) for additional information on bias and fairness in testing in general and to Kunnan (2000) for discussions of fairness in language testing.
Opportunity to Learn
In educational settings, many assessments are intended to evaluate how well students have mastered material that has been covered in formal instruction. If some test takers have not had an adequate opportunity to learn these instructional objectives, they are likely to get low scores. These low scores differ in meaning from low scores that result from a student’s having had the opportunity to learn and having failed to learn. Interpret-
ing both types of low scores as if they mean the same thing is fundamentally unfair. In the context of adult literacy, where there are extreme variations in the amount of time individual students attend class (e.g., 31 hours per student per year in the 10 states with the lowest average and up to 106 hours per student among the 10 states with the highest average), the fairness of using assessments that assume attendance over a full course of study becomes a crucial question.
Three problematic issues need to be considered with respect to this conception of fairness. First, opportunity to learn is a matter of degree. In addition, in order to measure some outcomes, it may be necessary to present students with new material. Second, even though the assessment may be based on a well-defined curricular content domain, it will nonetheless be only a sample of the domain. It may not be possible to determine the exact content coverage of a student’s assessment. Finally, in many situations, it is important to ensure that any credentials awarded reflect a given level of proficiency or capability.
In the context of adult literacy assessment, the issues discussed above— comparability of assessments, insensitivity of the NRS functioning levels to small increments in learning, and the use of gain scores—are also fairness issues. If different assessments are used in different programs and different states, one may well question whether they favor some test takers over others, and whether all test takers are given comparable treatment in the testing process. If gain scores are used to evaluate program effectiveness, the relative insensitivity of the NRS levels may be unfair to students and programs that are making progress within but not across these levels.
Several of the workshop participants pointed out that issues of fairness, as with validity, need to be addressed from the very beginning of test design and development. In addition, there is considerable potential for professional development in educating teachers to the fact that fairness includes making learners aware of the kinds of assessments they will be encountering and ensuring that these assessments are aligned with their instructional objectives.
Finally, an overriding quality that needs to be considered is practicality or feasibility. Attaining each of the above quality standards in any assessment carries with it certain costs or required resources. To the extent that the resources are available for the design, development, and use of an assess-
ment, the assessment can be said to be practical or feasible. Practicality concerns the adequacy of resources and how these are allocated in the design, development, and use of assessments. Resources to be considered are human resources, material resources, and time. Human resources are test designers, test writers, scorers, test administrators, data analysts, and clerical support. Material resources are space (rooms for test development and test administration), equipment (word processors, tape and video recorders, computers, scoring machines), and materials (paper, pictures, audio-and videotapes or disks, library resources). Time resources are the time that is available for the design, development, pilot testing, and other aspects of assessment development; assessment time (time available to administer the assessment); and scoring and reporting time. Obviously, all these resources have cost implications as well.
In most assessment situations, these resources will not be unlimited. Thus, there will be inevitable trade-offs in balancing the quality standards discussed above with what is feasible with the available resources. Braun discussed a trade-off between validity and efficiency in the design of performance assessments. There may be a gain in validity because of better construct representation, as well as authenticity and more useful information. However, there is a cost for this in terms of the expense of developing and scoring the assessment, the amount of testing time required, and lower levels of reliability. The reader is referred to Bachman and Palmer (1996) for a discussion of issues in assessing practicality and balancing the qualities of assessments in language tests.
Bob Bickerton spoke about practicality issues in the adult education environment. He noted that the limited hours that many ABE students attend class have a direct impact on the practicality of obtaining the desired gains in scores for a population that is unlikely to persist long enough to be posttested and, even if they do, are unlikely to show a gain as measured by the NRS. John Comings said his research indicated that for a student to achieve a 75 percent likelihood of making a one grade level equivalent or one student performance level gain, he or she would have to receive 150 hours of instruction (Comings, Sum, and Uvin, 2000). Bickerton added that Massachusetts has calculated that it takes an average of 130 to 160 hours to complete one grade level equivalent or student performance level (see SMARTT ABE http://www.doe.mass.edu/acls [April 29, 2002]). The NRS defines six ABE levels and six ESOL levels. A comparison of the NRS levels with currently available standardized tests indicates that each NRS level spans approximately two grade level equivalents or student perfor-
mance levels. Bickerton noted that it could take up to double the 150 hours mentioned above to complete one NRS level for students who, on average, are receiving instruction for a total of just 66 to 86 hours (DOEd, 2001c). These issues of practicality or feasibility are of particular concern in the development and use of performance assessments in adult education. Chapters 5 and 6 discuss these issues in greater detail.