Skip to main content

Currently Skimming:

7 Judging the Quality and Utility of Assessments
Pages 181-232

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 181...
... Next we review methods for evaluating the fairness of instruments, and finally we present three scenarios illustrating how the process of selecting assessment instruments can work in a variety of early childhood care and educational assessment circumstances. Many tests and other assessment tools are poorly designed.
From page 182...
... A special kind of validity evidence relates to the consistency of the assessment -- this may be consistency over repeated assessment or over different versions or forms of the assessment. This is termed reliability evidence.
From page 183...
... It is not always easy, however, to identify a suitable or adequate criterion. When one considers criterion-related validity evidence, for example, the size of the correlation between test scores and criterion can differ across settings, contexts, or populations, suggesting that a measure be validated separately for every situation, context, or population for which it may be used.
From page 184...
... The content model of validation seeks to provide a basis for validation without appealing to external criteria. The process of establishing content validity involves establishing a rational link between the procedures used to generate the test scores and the proposed interpretation or use of those scores (American Educational Research Association, American P ­ sychological Association, and National Council on Measurement in Education, 1999; Cronbach, 1971; Kane, 2006)
From page 185...
... Other forms of validity evidence -- such as empirical evidence based on relationships between scores and other variables -- are also essential. The current shift in emphasis toward learning standards and aligned assessments does not alter this necessity for additional forms of validity evidence, and the growing consequences of assessments increase the importance of empirical evidence (Koretz and Hamilton, 2006)
From page 186...
... Following the Standards, we use the term "construct" more broadly as "the concept or characteristic that a test is designed to measure" (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, p.
From page 187...
... An interpretive argument specifies the proposed interpretations and uses of test results. This argument consists of articulating the inferences and assumptions that link the observed behavior or test performance to the conclusions and decisions that are to be based on that behavior or performance.
From page 188...
... The interpretive argument may also involve highly technical inferences and assumptions (e.g., scaling, equating)
From page 189...
... for the developers to have a confirmationist bias since they are trying to make the assessment system as good as it can be. The development stage thus has a legitimate confirmationist bias: its purpose is to develop an assessment procedure and a plausible interpretive argument that reflects the proposed interpretations and uses of test scores.
From page 190...
... If the development stage has not delivered an explicit, coherent, detailed interpretive argument linking observed b ­ ehavior or performance to the proposed interpretation and uses, then the development stage is considered incomplete, and thus a critical evaluation of the proposed interpretation is premature (Kane, 2006)
From page 191...
... In a very fundamental sense, as is the case in science, one never "proves" or "confirms" the assessment hypothesis -- rather, the successful assessment hypothesis is tested and escapes being disconfirmed. (The term assessment hypothesis is used here to refer to the hypothesis that specifies what the intended meaning of the observed score is, i.e., what the assessment instrument is intended to measure.)
From page 192...
... To compose the evi E dence based on an assessment's content, the measurer must engage in "an analysis of the relationship between a test's content and the construct it is intended to measure" (American Educational Research Association, American Psychological Association, and National Council on Measurement in Educa tion, 1999, p.
From page 193...
... We refer to this internal structure as the construct. This is what has been described above in the section on construct validity.
From page 194...
... Rather than using assessment instruments to evaluate the effectiveness of interventions, psychometricians use interventions as one means to evaluate the validity of assessments. For example, evidence of validity for a specific instrument of social skills is obtained when inter vention effects on that instrument emerge from interventions designed to improve social skills.
From page 195...
... . They can also possibly have a corrupting influence, since the motivation to misuse or misrepresent test scores can be compelling.
From page 196...
... Such impact, if it occurs, may not in itself necessarily diminish the validity of an assessment score, although it raises issues surrounding test use. If, however, a consequence of an assessment is the result of a threat to assessment validity -- for example, if there is constructirrelevant variance, such as children's language skills, affecting their performance on a test intended to measure only quantitative reasoning, a situation resulting in English language learners scoring as a group lower than other children on that test -- then the social consequence is clearly linked to validity.
From page 197...
... The program may or may not specify targets for attaining particular developmental levels on its intended outcomes. If the program has specific developmental outcome targets, then questions that should be asked in relation to the assessment instrument include (a)
From page 198...
... for polytomous responses. As described above, there are many sources of measurement error beyond a single administration of an instrument. Each such source could be the basis for calculating a different reliability coefficient.
From page 199...
... The two alternate copies of the instrument are administered, and the two sets of scores are then correlated to produce the alternate forms reliability coefficient. This coefficient is particularly useful as a means of evaluating the consistency with which the test has been developed.
From page 200...
... It may be better to investigate false positive and false negative rates in a region near the cut score. Measurement Choices: Direct Assessment and Observation-based Assessment Choosing what type of assessment to use is a critical decision for the design of an early childhood program evaluation or an accountability system.
From page 201...
... Direct assessments, however, have been used far more frequently in large-scale research p ­ rojects, such as the Early Childhood Longitudinal Study; program evaluations, such as the evaluation of Early Head Start; and accountability efforts, such as the Head Start National Reporting System. Consequently, there is more known about both the strengths and weaknesses of this approach.
From page 202...
... • Measurement error may not be randomly distributed across programs if some classrooms typically use more direct questioning, like that found in a standardized testing situation. These problems may not be shown in traditional ways of assessing validity, which compare children's performance on one type of direct assessment with their performance on a similarly structured test -- so-called external validity evidence.
From page 203...
... (Assessors of direct assessments need to be trained as well, but the protocol may be more straightforward.) • The assessment needs to contain well-defined rubrics and scoring guides.
From page 204...
... To ensure the outcomes data were valid and reliable, the evaluation team provided initial, booster, and follow-up training until mastery was reached; supervised caregiver assessments during a set week each quarter; and once a year conducted random, authentic assessments on children as a concurrent validation of teacher and parent assessments. Although we have presented direct assessments and o ­ bservation-based assessments as distinct choices in the paragraphs above, a more recent perspective sees them as constituting different parts of an assessment system or net (Wilson, 2005; Wilson and Adams, 1996)
From page 205...
... The judicious deployment of such a combination allows the different assessment types to "bootstrap" one another in terms of validity, going a long way to helping establish (a) whether the direct assessments did indeed suffer from problems of unfamiliarity and (b)
From page 206...
... That is, the items should show no evidence of bias due to DIF (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, p.
From page 207...
... Second, one must be careful to distinguish between DIF and item bias. For one thing, it is possible that a test may include two items that exhibit DIF between two groups, but in opposite directions, so that they tend to "cancel out." Also, DIF may not always be a flaw, since it could be due to "a kind of multidimensionality that may be unexpected or may conform to the test framework" (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, p.
From page 208...
... children (at the same overall cognitive development status) to get the comparative item correct.
From page 209...
... -- that is, for respondents at the same level of cognitive development, approximately 1 U.S. child for every 3 Chinese children would be predicted to get the item correct.
From page 210...
... Note that a technical solution is available here -- the measurer can use the two different calibrations for the two groups, but this is seldom a chosen strategy, as it involves complex issues of fairness and interpretation. Validity Generalization As described earlier, the validity of inferences based on test data is critical to the selection and use of tests. Test users need to know the extent to which the inferences that they make on the basis of test performance would still apply if the child had been tested by a different examiner, on a different day of the week, in a different setting, using an alternate form of the same test, or even using a different assessment of the same skill or ability.
From page 211...
... Interest often centers on the role of specific domains of assessment in predicting job performance more than on the validity evidence for specific tests. However, the techniques of validity generalization can also be used to study the validity evidence for specific tests and the use of specific tests in different populations.
From page 212...
... Practically speaking, this is based on a test to determine that the variability in observed validity c ­ oefficients is not different from zero once sampling error and other methodological artifacts have been controlled. Thus, the inference of validity generalization is tantamount to accepting a null hypothesis in statistical hypothesis testing.
From page 213...
... If the number of studies in the meta-analysis is small, or the magnitude of the variability in validity coefficients is small, or the sample sizes in the included studies in the meta-analysis are small, power may be low for the test of variability in the validity coefficients. A complete discussion of the validity generalization literature or the use of meta-analysis to study validity generalization is beyond the scope of this volume.
From page 214...
... Finally, we consider the situation in which the local school board of a large urban school district has decided to incorporate child assessments into its evaluation of the district's new preschool initiative aimed at improving children's school readiness, socioemotional development, and physical health. All of the scenarios are fictitious and any resemblance to actual people or programs is entirely coincidental.
From page 215...
... She is especially concerned to know the overall language skills, not just the English language skills, of the English language learners. This will help her teachers provide the necessary visual and linguistic supports to their children and opportunities to develop language skills through their interactions with the teacher, the environment, and the other children, as well as to measure their progress See Appendix D for a list and descriptions of useful sources of information on instruments.
From page 216...
... They discuss who will administer and score the assessments, who will interpret the assessments, what specific decisions will be made on the basis of the assessment results, when these decisions will need to be made and how often they will be reviewed and possibly revised, which children will participate in the assessments, and what the characteristics of these children are: their ages, their race/ethnicity, their primary language, their socioeconomic status, and other aspects of their background and culture that might affect the assessment of their language skills.
From page 217...
... They arrange the criteria in the following order: (1) measures some or all of the language skills of interest, (2)
From page 218...
... (Two committees confronted with the same information may make different decisions about the disposition of such tests, and there is no single right answer to the number of tests to consider for more detailed review.) Thus, at this stage there are at least three groups of assessments: those for which additional review information will be sought, those that have been clearly rejected because of one or more "No" responses on the primary dimensions, and those that are seen as less desirable than tests in the top group but that are nevertheless not clearly rejected.
From page 219...
... The review materials on each test will be examined to ensure that the test supports the kinds of inferences that Ms. Conway and her teachers wish to make about their children's language skills and development.
From page 220...
... For example, a test that appears strong on all other criteria may have no information on its functioning for language-minority children. Specifically, the published information may not discuss the issue of test bias, and there may be no normative information or validity studies that focus on the use of the test with this population.
From page 221...
... This review will typically involve a thorough and direct examination of test items and administration procedures, review of the rationale behind the format of the test and the construction of test items, and a complete reading of the administration guidelines and scoring procedures and information on the interpretation of test scores. The committee may also
From page 222...
... Selecting Tests for Multiple Related Entities In this scenario we consider a consortium of early childhood programs that seeks to establish an assessment system to guide instructional decisions that can be used across all programs in the consortium. The process is similar in many respects to the process followed by Ms.
From page 223...
... It is critical that the committee that will clarify the purposes of assessment, gather and review test information, and ultimately select the test should be expanded to include representation from across the consortium. It may not be possible to have representation from each member on the committee, but some process should be put in place to ensure that the differing needs and populations across the member programs of the consortium are adequately represented on the committee.
From page 224...
... This list is not exhaustive, but it highlights some of the additional challenges that arise when more than one entity is involved in the testing enterprise. Another major difference between the current scenario and the Honeycomb scenario is the focus on using assessment results to guide instructional decisions.
From page 225...
... Finally, unlike the Honeycomb scenario, which focused on status at entry relative to national norms, the focus on using assessment to guide instruction suggests that the members of the consortium might well be interested in, and best be served by, a locally developed assessment. To the extent that the standards and instructional decisions are mostly local, then it is far more likely that a locally developed assessment, tailored to reflect local standards and approaches to instruction, will meet the needs of the consortium.
From page 226...
... Because the process of gathering information, reviewing it, and selecting among the tests is essentially the same as in the first scenario, that information is not repeated here. Selecting Tests in a Program Evaluation Context Finally, we consider Novatello School District, a large urban school district in which the school board has decided to incorporate child assessments into the evaluation of its new preschool initiative, which is aimed at improving children's school readiness, socioemotional development, and physical health.
From page 227...
... Reliance on child assessments in program evaluations ­carries an explicit assumption that differences between programs in child outcomes at the end of the year can be attributed to differences in
From page 228...
... Failure to account for such differences will negatively affect the validity of inferences about differences in program q ­ uality that are based on differences in child outcomes. In the current context, two factors that could affect the validity of inferences about programs based on child assessment results are the primary language of the child and the language of instruction used in the preschool program.
From page 229...
... They must consider whether they are collecting child assessments for purposes other than program evaluation, such as to assess the different educational needs of entering children, to monitor learning and progress, and to make instructional decisions regarding individual children. If their singular purpose is program evaluation, then it is not necessary to assess all children at all occasions; rather, a sampling strategy could be employed to reduce the burden of the assessment on children and programs, while still ensuring accurate estimation of the entry characteristics of the child population and program outcomes.
From page 230...
... Unlike the consortium context, in which aggregation of data and centralized reporting were an option to be discussed and decided on by the members of the consortium, the program evaluation context by definition requires that child assessment results will flow to a centralized repository and reporting authority. Precisely what information will be centralized and stored and the process whereby such information will flow to the central agency can be a matter of discussion, but clearly there must be some centralization of child assessment results.
From page 231...
... Regular review of the stated purposes of assessment, along with regular review of the strengths and weaknesses of the assessment system and consideration of alternatives -- some of which may not have been available at the time of the previous review -- can ensure that the individual assessments and the entire assessment system remain effective and efficient for meeting the organization's current purposes. If the process for selecting tests in the first place is rigorous and principled, the review and evaluation process will be greatly simplified.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.