The National Academies Press

Currently Skimming:

5 Measurement Considerations
Pages 93-106

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.

From page 93... ... Packet Tracer, and the scenariobased learning strategy described by Louise Yarnall -- are designed pri marily for formative purposes. That is, the assessment results are used to adapt instruction so that it best meets learners' needs. Read the entire page →
From page 94... ... A number of assessments used for high-stakes decisions were discussed by workshop presenters, including the Multistate Bar exam used to award certification to lawyers, the situational judgment test used for admitting Belgian students to medi cal school, the tests of integrity used for hiring job applicants, and some of the assessment center strategies used to make hiring and promotion decisions. For the workshop, the committee arranged for two presentations to focus on technical measurement issues, particularly as they relate to highstakes uses of summative assessments. Read the entire page →
From page 95... ... The construct definition helps the test developer to determine how to measure it. She cautioned that for the skills covered in this workshop, developing a definition and operationalizing these definitions in order to produce test items can be challenging. Read the entire page →
From page 96... ... Response modes might include choosing from among a set of provided options, providing a brief written answer, providing a longer written answer such as an essay, providing an oral answer, performing a task or demonstrating a skill, or assembling a portfolio of materi als. Response modes are typically categorized as "selected response" or "constructed-response," and constructed-response items are further categorized as "short-answer constructed-response," "extended-answer constructed-response," and "performance-based tasks." Response modes also include behavior checklists, such as those described by Candice Odgers to assess conduct disorders, which may be completed by the test taker or by an observer. Read the entire page →
From page 97... ... Despite the resource required, several currently operating large stan dardized testing programs make use of performance-based tasks. As described in Chapter 2, the Multistate Bar Exam includes a performancebased component with a written response and is administered to approximately 40,000 candidates each year. Read the entire page →
From page 98... ... A third example is the portfolio component of the assessment used to award advanced level certification for teachers by the National Board for Professional Teaching Standards (NBPTS) Read the entire page →
From page 99... ... The concern of reliability is the precision of test scores, and, as explained in more detail later in this section, the level of precision needed depends on the intended uses of the scores and the consequences associated with these uses (see also American Educational Research Association, Ameri can Psychological Association, and National Council on Measurement in Education, 1999, pp. Read the entire page →
From page 100... ... When humans score examinee responses, they must make subjective judgments based on comparing the scoring guide and criteria to a particular test taker's performance. This introduces the possibility of scoring error associated with human judgment, and it is important to estimate the impact of this source of error on test scores. Read the entire page →
From page 101... ... . Additional information about classification consistency can be found in the Standards (American Educational Research Association, American Psychological Association, and National Council on Measure ment in Education, 1999, p. Read the entire page →
From page 102... ... Others may find inappropriate short cuts that work to invalidate the test results, such as finding out the test questions beforehand, copying from another test taker, or bringing disallowed materials, such as study notes, into the test administration. These types of behaviors can produce scores that are not accurate representations of the students' true skills. Read the entire page →
From page 103... ... A related issue is construct irrelevant variance. Problems with construct irrelevant variance occur when something about the test questions or administration procedures interferes with examinees' ability to assess the intended construct. Read the entire page →
From page 104... ... The Standards (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, pp. Read the entire page →
From page 105... ... Is it more important to have authentic test items or to meet high reliability standards? Test developers are often faced with competing priorities and will need to make tradeoffs. Read the entire page →

From page 93...

... Packet Tracer, and the scenariobased learning strategy described by Louise Yarnall -- are designed pri marily for formative purposes. That is, the assessment results are used to adapt instruction so that it best meets learners' needs.

Read the entire page →

From page 94...

... A number of assessments used for high-stakes decisions were discussed by workshop presenters, including the Multistate Bar exam used to award certification to lawyers, the situational judgment test used for admitting Belgian students to medi cal school, the tests of integrity used for hiring job applicants, and some of the assessment center strategies used to make hiring and promotion decisions. For the workshop, the committee arranged for two presentations to focus on technical measurement issues, particularly as they relate to highstakes uses of summative assessments.

Read the entire page →

From page 95...

... The construct definition helps the test developer to determine how to measure it. She cautioned that for the skills covered in this workshop, developing a definition and operationalizing these definitions in order to produce test items can be challenging.

Read the entire page →

From page 96...

... Response modes might include choosing from among a set of provided options, providing a brief written answer, providing a longer written answer such as an essay, providing an oral answer, performing a task or demonstrating a skill, or assembling a portfolio of materi als. Response modes are typically categorized as "selected response" or "constructed-response," and constructed-response items are further categorized as "short-answer constructed-response," "extended-answer constructed-response," and "performance-based tasks." Response modes also include behavior checklists, such as those described by Candice Odgers to assess conduct disorders, which may be completed by the test taker or by an observer.

Read the entire page →

From page 97...

... Despite the resource required, several currently operating large stan dardized testing programs make use of performance-based tasks. As described in Chapter 2, the Multistate Bar Exam includes a performancebased component with a written response and is administered to approximately 40,000 candidates each year.

Read the entire page →

From page 98...

... A third example is the portfolio component of the assessment used to award advanced level certification for teachers by the National Board for Professional Teaching Standards (NBPTS)

Read the entire page →

From page 99...

... The concern of reliability is the precision of test scores, and, as explained in more detail later in this section, the level of precision needed depends on the intended uses of the scores and the consequences associated with these uses (see also American Educational Research Association, Ameri can Psychological Association, and National Council on Measurement in Education, 1999, pp.

Read the entire page →

From page 100...

... When humans score examinee responses, they must make subjective judgments based on comparing the scoring guide and criteria to a particular test taker's performance. This introduces the possibility of scoring error associated with human judgment, and it is important to estimate the impact of this source of error on test scores.

Read the entire page →

From page 101...

... . Additional information about classification consistency can be found in the Standards (American Educational Research Association, American Psychological Association, and National Council on Measure ment in Education, 1999, p.

Read the entire page →

From page 102...

... Others may find inappropriate short cuts that work to invalidate the test results, such as finding out the test questions beforehand, copying from another test taker, or bringing disallowed materials, such as study notes, into the test administration. These types of behaviors can produce scores that are not accurate representations of the students' true skills.

Read the entire page →

From page 103...

... A related issue is construct irrelevant variance. Problems with construct irrelevant variance occur when something about the test questions or administration procedures interferes with examinees' ability to assess the intended construct.

Read the entire page →

From page 104...

... The Standards (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, pp.

Read the entire page →

From page 105...

... Is it more important to have authentic test items or to meet high reliability standards? Test developers are often faced with competing priorities and will need to make tradeoffs.

Read the entire page →

← Previous Chapter Skim

Next Chapter Skim →

This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.

5 Measurement Considerations Pages 93-106

5 Measurement Considerations
Pages 93-106