National Academies Press: OpenBook

Evaluation of the Voluntary National Tests, Year 2: Final Report (1999)

Chapter: Item Quality and Readiness

« Previous: Purpose and Use
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

3
Item Quality and Readiness

The primary focus of this section is the extent to which the VNT test items are likely to provide useful information to parents, teachers, students, and others about whether students have mastered the knowledge and skills specified for basic, proficient, or advanced performance in 4th-grade reading and 8th-grade mathematics. The information provided by any set of items will be useful only if it is valid, meaning that the items measure the intended areas of knowledge and do not require extraneous knowledge or skills. In particular, test items should not require irrelevant knowledge or skills that might be more available to some ethnic, racial, or gender groups than to others: that is, they should not be biased. Test information also will be useful only if it is reliable, meaning that a student taking alternate forms of the test on different occasions is very likely to achieve the same result.

The committee's review of the quality of the VNT items thus addresses four of Congress' charges for our evaluation: (1) the technical quality of the items; (2) the validity, reliability, and adequacy of the items; (4) the degree to which the items provide valid and useful information to the public; and (5) whether the test items are free from racial, cultural, or gender bias. The NRC's Phase I report (National Research Council, 1999b) included only a very limited evaluation of item quality. No empirical data on item functioning were available, and, indeed, none of the more than 3,000 items that had been written had been through the contractor's entire developmental process or NAGB's review and approval process. Our review of items in relatively early stages of development suggested that considerable improvement was possible, and the contractor's plans called for procedures that made further improvements likely.

This review of VNT items initially addressed two general questions related to item quality:

  1. Does it seem likely that a sufficient number of items will be completed in time for inclusion in a spring 2000 pilot test?

  2. Are the completed items judged to be as good as they can be prior to the collection and analysis of pilot test data? Are they likely to provide valid and reliable information for parents and teachers about students' reading or math skills?

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

In addressing these questions, the committee was led to two additional questions relating to item quality:

  1. Do the NAEP descriptions of performance for each achievement level provide a clear definition of the intended domains of test content?

  2. How completely will the items selected for each test form cover the intended test content domains?

To answer these questions, the committee reviewed the following documents from NAGB and the prime contractor, American Institutes for Research (AIR):

  • Reading and math test specification matrices (National Assessment Governing Board, 1998b; 1998c)

  • Report on the Status of the Voluntary National Tests Item Pools (American Institutes for Research, 1999f)

  • Flowchart of VNT New Item Production Process (American Institutes for Research, 1999d)

  • VNT: Counts of Reading Passages Using Revised Taxonomies, June 24, 1999 (American Institutes for Research, 1999k)

  • Final Report of the Study Group Investigating the Feasibility of Linking Scores on the Proposed VNT and NAEP (Cizek et al., 1999)

  • VNT in Reading: Proposed Outline for the Expanded Version of the Test Specifications (American Institutes for Research, 1999n)

  • VNT in Mathematics: Proposed Outline for the Expanded Version of the Test Specifications (American Institutes for Research, 1999m)

  • Cognitive Lab Report: Lessons Learned (American Institutes for Research, 1999a)

  • Training Materials for VNT Protocol Writing (American Institutes for Research, 1999j)

  • VNT: Report on Scoring Rubric Development (American Institutes for Research, 1998o)

  • Cognitive Lab Report (American Institutes for Research, 1998d)

  • VNT Interviewer Training Manual (American Institutes for Research, 1999o)

  • Technical Specifications, Revisions as of June 18, 1999 (American Institutes for Research,, 1999i)

In addition, committee and staff members examined item folders at the contractor's facility and reviewed information on item status provided by AIR in April. During our April meeting, committee members and a panel of additional reading and mathematics assessment experts reviewed and rated samples of 120 mathematics items and 90 reading items. Updated item status data, including more specific information on the new items being developed during 1999, were received in July and discussed at our July meeting. The committee's review of item quality did not include separate consideration of potential ethnic or gender bias. The contractor's process for bias review in year 1 was reviewed in the Phase I report (National Research Council, 1999b) and found to be satisfactory, and no new bias reviews have been conducted. (The committee does have suggestions in Chapter 4 for how pilot test data might be used in empirical tests of ethnic and gender bias.)

The remainder of this chapter describes the committee's review, findings, and recommendations relative to each of the four item quality questions listed above.

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

ITEM DEVELOPMENT

As noted above, the committee reviewed item development status at two different times in 1999. In April we received information on the status of items that were developed in prior years for use in selecting a sample of completed items for our review. In July we received updated information, including information on the new items written in 1999 to supplement the previous item pool.

Item Status as of April 1999

The VNT Phase I evaluation report suggested a need for better item tracking information. At our February 1999 workshop, the contractor presented plans for an improved item status tracking system (American Institutes for Research, 1999f). We subsequently met with NAGB and the contractor's staff to make arrangements for identifying and obtaining access to the items needed for our review. The contractor provided additional information on the item tracking database and a copy of key information in the database for our use in reviewing the overall status of item development and in selecting a specific sample of items for review. We also visited the contractor facilities and were allowed access to the system for storing hard-copy results of the item development and review for each item. We examined the item folders for a small sample of items and found that the information was generally easily found and well organized.

Our primary concern in examining the item status information was to determine how far along each item was in its development process and how far it had yet to go. We were interested in identifying a sample of completed items so that we could assess the quality of items that had been through all of the steps in the review process. We also wanted to assess whether it was likely that that there would be a sufficient number of completed items in each content and format category in time for a spring 2000 pilot test.

The contractor suggested that the most useful information about item status would be found in two key fields in the database for each item. The first field indicated whether consensus had been reached in matching the item to NAEP achievement levels: if this field was blank, the item had not been included in the achievement-level matching and was not close to being completed. The second field indicated whether the item had been through a "scorability review" and, if so, whether further edits were indicated. The scorability review is a separate step in the contractor's item development process that involves expert review of the scoring rubrics developed for open-ended items to identify potential ambiguities in the rules for assigning scores to them. A third key field was added to the database, at our request, to indicate whether or not the item had been reviewed and approved by NAGB's subject area committees.

The committee reviewed the revised database to determine the number of items at various levels of completeness for different categories of items. Table 3-1 shows levels of completeness for mathematics items by item format and content strand. Table 3-2 shows the same information for reading items, by stance and item format. As of April 1999, only one-sixth (16.6%) of the required mathematics items and one-eighth (12.3%) of the required reading items were completed. In addition, at least 161 new mathematics items were required to meet item targets for the pilot test. The contractor indicated that 200 new mathematics items were being developed in 1999; however, they could not, at that time, give us an exact breakdown of the number of new items targeted for each content and item format category.

For reading, the situation is more complicated. Current plans call for 72 passages to be included in the pilot test. Each passage will be included in two distinct pilot test forms with different sets of questions about the passages in each of the forms. This design will increase the probability that at least

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

TABLE 3-1 Mathematics Item Status (as of April 1999)

Items Formata

Content Strand

Needed for pilot

Fully Ready

Awaiting NAGB Review

Awaiting Ach. Level Matching

In 1999 Cog Labs

Awaiting Scoring Edits

Total Items Written

Item Needed

ECR

Algebra and functions

18

1

0

0

0

6

7

11

 

Geometry and spatial sense

18

0

1

0

3

4

8

10

 

Other

None

1

1

0

5

13

20

0

 

Subtotal

36

2

2

0

8

23

35

21

SCR/3 points

Algebra and functions

18

6

1

0

4

15

26

0

 

Data analysis, statistics, and probability

18

1

5

0

11

8

25

0

 

Geometry and spatial sense

18

0

2

0

8

16

26

0

 

Measurement

18

8

10

1

13

9

41

0

 

Number

36

7

10

1

11

14

43

0

 

Subtotal

108

22

28

2

47

62

161

0

SCR/2 points

Algebra and functions

18

1

1

0

1

1

4

14

 

Data analysis, statistics, and probability

18

0

6

0

2

1

9

9

 

Geometry and spatial sense

18

2

4

0

4

7

17

1

 

Measurement

None

2

4

0

4

1

11

0

 

Number

18

1

2

0

3

1

7

11

 

Subtotal

72

6

17

0

14

11

48

35

GR

Algebra and functions

None

1

7

1

1

0

10

0

 

Data analysis, statistics, and probability

18

6

21

2

4

0

33

0

 

Geometry and spatial sense

18

5

21

1

2

0

29

0

 

Measurement

36

0

14

5

7

0

26

10

 

Number

36

5

25

1

3

0

34

2

 

Subtotal

108

17

88

10

17

0

132

12

MC

Algebra and functions

198

26

99

15

4

0

144

54

 

Data analysis, statistics, and probability

108

11

71

1

4

0

87

21

 

Geometry and spatial sense

126

38

64

1

5

0

108

18

 

Measurement

126

11

137

1

8

0

157

0

 

Number

198

46

222

1

13

0

282

0

 

Subtotal

756

132

593

19

34

0

778

93

Total

 

1,080

179

728

31

120

96

1,154

161

a ECR = extended constructed response; SCR = short constructed response; GR = gridded; MC = multiple choice.

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

TABLE 3-2 Reading Item Status (as of April 1999)

Items

Needed for Pilot

Fully Ready

NAGB Review

Cognitive Labs

Scoring Rubric Edits

Total Written

New Items Neededa

By Stance

Initial understanding

130

15

125

29

6

175

 

Develop interpretation

572

77

597

62

42

778

 

Reader-text connection

108

5

67

23

29

124

 

Critical stance

270

36

219

33

27

315

 

Subtotal

1,080

133

1,008

147

104

1,392

0

By Item Formatb

ECR

48

1

23

19

31

74

 

SCR

192

20

150

53

55

278

 

MC

840

112

835

75

18

1,040

 

Subtotal

1,080

133

1,008

147

104

1,392

0

a See text and Table 3-3.

b ECR = extended constructed response; SCR = short constructed response; MC = multiple choice.

TABLE 3-3 Reading Passage Review Status (as of April 1999)

 

Completed NAGB Review

Completed NAGB and Edits

Needs More Items

 

 

 

Passage Type

Both Sets

One Sets

Both Sets

One Sets

Both Sets

One Sets

Total Passages Written

Passages Length Issuesa

Additional Passages Needed

Long literary

2

5

11

5

7

5

23

3

0

Medium literary

0

3

8

2

0

2

10

0

2

Short literaryb

6

0

10

1

0

1

11

7

1

Medium informationc

0

9

9

5

0

5

14

11

0

Short information

5

3

11

0

0

0

11

10

1

Total

13

20

49

13

7

13

69

31

4

a The seven long literary passages needing more items for both sets appear to have been developed as medium literary passages.

b One short literary passage is too short (< 250 words) and six are between short and medium length. All of the short information passages with length problems are between 300 and 350 words, which is neither short nor medium. Two additional short information passages we classed as medium information due to length, but they have no pairing nor intertextual items.

c Medium information entries we passage pairs plus intertextual questions.

one set (or perhaps a composite of the two different sets) will survive item screening in the pilot test. As of April, there were no passages for which both item sets had completed the review and approval process. Table 3-3 shows the number of passages at each major stage of review and development, the number of passages for which additional items will be needed, and the number of additional passages that will be needed. One further issue in reading is that many of the passages have word counts that are outside the length limits indicated by the test specifications. In most cases, these discrepancies are not

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

large, and NAGB may elect to expand the limits to accommodate these passages. Alternatively, NAGB might elect to enforce limits on the total length of all passages in a given test form, allowing somewhat greater variation in the length of individual passages than is implied by current specifications. Strict adherence to current length limits would mean that considerably more passage selection and item development would be needed in reading.

Updated Status, Including New Items

NAGB commissioned a group of scholars, designated as the Linkage Feasibility Team (LFT), to provide advice on how best to link scores on the VNT to the NAEP score scale and achievement level cutpoints (see discussion in Section 4). The LFT report, which was presented to NAGB at its May 1999 meeting, included a number of recommendations for changing the VNT test and item specifications to increase consistency with NAEP. For reading, the report recommended:

  • increasing passage lengths;

  • using text mapping procedures to ensure reading questions assess appropriate skills, not just surface level information;

  • including more constructed response questions; and

  • editing reading passages to eliminate "choppiness."

For mathematics, the recommendations included:

  • increasing the number of constructed-response items to ensure that higher-order thinking skills are assessed;

  • making the decision about calculator use and about use of gridded and drawn-response items; and

  • redoing the content classifications of items.

Subsequently, AIR issued revised test specifications with updated counts of the number of items by content and format category to be included in each section of each test. The most significant change was that "gridded" items were eliminated from the mathematics tests because NAEP tryouts of this format type indicated that students had difficulties in filling out the grids appropriately. Gridded items developed for the VNT are being revised to be either 2- or 3-point constructed-response items, or distractors are being created to convert them into multiple-choice items. Other issues, most notably passage length limits, had not been fully resolved as this report is being completed, but further changes in the item and test specifications appear unlikely.

Mathematics

New information on the status of the mathematics items was received in July. The new file contained information on 202 items that were not included in the file received in April. Of these, 178 had "development year" set to 1999 and 24 had development year values of 1997 or 1998. One item from the April 1999 file had been dropped. In total, the number of active mathematics items had increased from 1,154 to 1,355.

The July file contained flags indicating which reviews had been completed, but it did not have information on the outcome of each review. In April, 217 items had been approved by NAGB "as is"

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

and another 152 had been approved "with edits." Of the 217 not requiring further edits, 12 were scheduled for cognitive labs, and 26 had been flagged for edits in the scorability review, leaving 179 fully completed items (Table 3-1). The July file shows that 10 of the additional (pre-1999) items had been reviewed by NAGB, but the outcome of the review was not indicated. The April file also showed 5 items flagged as "drop" and 1 flagged as "revise and review again" in the NAGB review. These items are still on the current version of the file, but it is unclear whether they have been reviewed again

Of the 1,355 active mathematics items in the July file, 179 were fully complete and 1,176 items required further review. At the August 1999 NAGB meeting, the contractor indicated that 1,100 mathematics items would be reviewed by NAGB's appropriate subject-area committee between September and November of 1999. This plan suggests that virtually all of the 1,344 currently active items that had not been fully approved were expected to survive remaining AIR reviews and pass to NAGB for its final review. Table 3-4 shows the distribution of the "currently active" items by content strand and item format, compared with the number required for the pilot test. These results are subject to change depending on NAGB decisions regarding test specifications and on how the gridded items are rewritten and reclassified.

Reading

The number of reading passages has been increased from 95 to 108; see Table 3-5. However, there is still considerable lack of clarity over passage length requirements with many of the medium-length information passages flagged as either too short or too long. It is likely that NAGB will consider the length of passage pairs so that combining short and long passages may be acceptable. Also, six of the long literary passages were reclassified as information passages and as such are unusable under the current test specifications. Overall, there are 85 fully acceptable passages. This leaves a shortage of three passages in the medium literary category, but there are eight additional literary passages that are just a few words over the medium length limit.

The number of reading items has been increased from 1,392 to 1,848. The new items have not yet been extensively reviewed, so it is not possible to update the completion figures included in Table 3-2. Table 3-6 shows the number of active items by stance and item format. NAGB has reviewed all of the active passages and plans to review approximately 1,650 items between September and November 1999 in order to have 72 passages and a total of 1,104 appropriately distributed items for use in the pilot test.

In reviewing updated item information, the committee also noted that, as shown in Table 3-6, virtually all of the items designed to measure the "initial understanding" stance were multiple choice, while almost all of the items measuring the "reader/text" stance were constructed response. While this may be a logical approach, the committee has not seen a rationale for this differential use of item formats by reading stance and is not aware that this design has been specifically reviewed by reading content experts.

Findings and Recommendations

The item tracking system has been significantly improved since it was reviewed in the Phase I evaluation report (National Research Council, 1999b). Information on the new (1999) items and information on the results (or at least the occurrence) of various reviews for all items has been added to the database.

The committee is concerned, however, that the information in the database is not being used effectively by NAGB and its contractor. A key example of our concern is that the item development

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

TABLE 3-4 Mathematics Item Status (as of July 7, 1999)

Item Formata and Content Strand

Needed for Pilot

Active as of April

Active as of July

Additional Needed

ECR

Algebra and Function

18

7

9

9

Geometry and Spatial

18

8

10

8

Other

None

20

21

0

Subtotal

36

35

40

17

SCR

(3 points)

Algebra and Function

18

26

41

0

Data, Statistics, Probability

18

25

33

0

Geometry and Spatial

18

26

33

0

Measurement

18

41

46

0

Number

36

43

44

0

Subtotal

108

161

197

0

SCRb

(2 points)

Algebra and Function

18

4

19

0

Data, Statistics, Probability

36

9

46

0

Geometry and Spatial

18

17

46

0

Measurement

18

11

39

0

Number

36

7

45

0

Subtotal

126

48

195b

0

GRb

Algebra and Function

None

10

0

0

Data, Statistics, Probability

None

33

0

0

Geometry and Spatial

None

29

0

0

Measurement

None

26

0

0

Number

None

34

0

0

Subtotal

None

132

0

0

MC

Algebra and Function

180

144

199

0

Data, Statistics, Probability

108

87

113

0

Geometry and Spatial

162

108

119

43

Measurement

162

157

185

0

Number

198

282

307

0

Subtotal

810

778

923

43

Total

1,080

1,154

1,355

60

a ECR = extended constructed response; SCR = short constructed response; GR = gridded; MC = multiple choice

b All of the gridded items were combined with the 2-point SCR items. Some of these items may be converted to MC items; however, there would still be a shortage of at least 15 geometry and spatial items for MC and 2-point SCR combined.

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

TABLE 3-5 Count of VNT Reading Passages by Type and Length (July 1999)

Type and Length

Total Needed

Total Written

Satisfactory Item Sets

Questionable Item Sets

Short Literary

12

13

13

0

Medium Literary

12

17

9

8a

Long Literary

12

13

13

0

Short Information

12

25

20

5a

Medium Information Pairs

(1st of 2)

12

17

15

2b

Second Medium Information Pairs

(2nd of 2)

12

17

15

2

Long Information

0

6

0

6c

Total Passages

72

108

85

23

a Word count exceeds the limit

b Too few extended constructed response or multiple choice items

c Items previously classified as "long literary" passages

TABLE 3-6 Reading Items by Stance and Format (as of July 1999)

 

Formata

Stance

MC

SCR (2 points)

SCR (3 points)

ECR

Total

Needed for Pilotb

Ratioc

Initial Understanding

221

0

0

0

221

132.5

1.67

Developing and Interpretation

789

116

52

45

1,002

585.1

1.71

Reader/Text Interaction

11

122

32

34

199

110.4

1.80

Critical Stance

289

114

17

6

426

276.0

1.54

Total

1,310

352

101

85

1,848

1,104

1.67

Needed for Pilot Testd

876

120

60

48

1,104

 

 

Ratioc

1.50

2.93

1.68

1.77

1.67

 

 

a ECR = extended constructed response; SCR = short constructed response; MC = multiple choice.

b Distribution by stance is specified in the framework as initial understanding 12%; developing and interpretation 53%; reader/text interaction 10%; critical stance 25%.

c Ratio = total/needed for pilot

d Distribution by format is based on revised table of specifications, distributed at August 1999 NAGB meeting

subcontractors were given specifications for additional items without reference to item bank information on shortages in specific content and format categories. As a consequence, it appears that the contractor will still be a few items short of goals for the pilot test in one or two of the mathematics item categories. For reading, the contractor has not been able to (or not asked to) produce status counts that reflect the ties between items and passages. For each passage, NAGB will need to know where all of the associated items are in the review process. Currently, there is no field in the database for passages that shows whether one or both of two distinct item sets have passed each review stage.

RECOMMENDATION 3.1 NAGB should require regular item development status reports that indicate the number of items at each stage in the review process by content and

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

format categories. For reading, NAGB should also require counts at the passage level that indicate the status of passage reviews and the completeness of all of the associated items.

There are a large number of items scheduled for content, readability, achievement level, sensitivity, bias, and final NAGB review between August and November 1999. For each test, the contractor has developed more than the minimum number of required items in the event that some items do not survive all of these reviews. For the mathematics test, 1,080 of the current 1,344 items need to survive; for the reading test, 72 of the 126 current passages need to survive with two distinct item sets for use in the pilot. Plans are in place to complete each of the required review steps. In our interim report (National Research Council, 1999c), we recommended that the review process be accelerated to allow more time for AIR to respond to the reviews, and NAGB is now prepared to start its final review sooner than previously planned (September rather than November).

There is a sufficient overage of items for each test so that, assuming that the reviews are completed as scheduled, it should be possible to assemble 18 distinct forms of the mathematics test and 24 distinct forms of the reading test from the items surviving these reviews. Given that the number of mathematics items in some categories is already less than 18 times the number specified for each form, it is unlikely that each of the pilot test forms will exactly match the specifications for operational VNT forms, unless some items are included in multiple pilot test forms. In Chapter 4, we raise a question of whether additional extended constructed-response items should be included in the pilot test. Small shortages in the number of pilot test items in some item content and format categories might be tolerated or even planned for in order to accommodate potentially greater rates of item problems in other categories. However, the contractor has no basis for estimating differential rates at which items of different types will be dropped on the basis of pilot test result.

RECOMMENDATION 3.2 The rates at which each of the different item types survives each stage from initial content reviews through analyses of pilot test data should be computed. This information should be used in setting targets for future item development.

The contractor expects that, because of cognitive laboratory review, the survival rate for extended constructed-response items will be similar to that for other item types. Information from the current reviews and from the pilot test about the survival rates for different item types will provide both VNT and other test developers a better basis for estimating item survival rates in the future.

ITEM QUALITY

Assessing the quality of the VNT items was central to the committee's charge. The committee conducted a thorough study of the quality of VNT items that had been completed, or were nearly completed, at the time of our April 1999 workshop. Our review involved sampling available items, identifying additional content experts to participate in reviewing the items, developing rating procedures, conducting the item rating workshop, and analyzing the resulting data. A brief description of each of these steps is presented here, followed by the committee's findings and recommendations. More complete details of our item quality study can be found in Hoffman and Thacker (1999).

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

Evaluation Process

Sampling Completed Items

Using the item status information available for April 1999, we selected items to review, seeking to identify a sample that closely represented the content and format requirements for an operational test form. To assure coverage of the item domains, we sampled twice as many items as required for a form. Our sample thus included 120 mathematics items and 12 reading passages with 90 reading items, plus a small number of additional items to be used for rarer practice sessions. Within each content and item format category, we sampled first from items that had already been approved ''as is" by the NAGB review; in some cases, we had to sample additional items that had not yet been reviewed by NAGB but had been through the other review steps. We concentrated on items that had been included in the 1998 achievement-level matching exercise, did not have further edits suggested by the scorability review, and were not scheduled for inclusion in the 1999 cognitive laboratories. For reading, we first identified passages that had at least one completed item set. For medium-length informational passages, we had to select passage pairs together with intertextual item sets that were all relatively complete.

Table 3-7 shows the numbers of selected mathematics and reading items by completion status. Given the two-stage nature of the reading sample (item sets sampled within passage), we ended up with a smaller number of completed reading items than mathematics items. In our analyses, we also examined item quality ratings by level of completeness. (Additional details on the procedures used to select items for review can be found in Hoffman and Thacker [1999].) The items selected for review are a large and representative sample of VNT items that were then ready or nearly ready for pilot testing, but they do not represent the balance of the current VNT items, which are still under development.

Expert Panel

Our overall conclusions about item quality are based primarily on ratings provided by panels of five mathematics experts and six reading experts with a variety of backgrounds and perspectives, including classroom teachers, test developers, and disciplinary experts from academic institutions:

TABLE 3-7 Items for Quality Evaluation by Completion Status

 

Current Item Status (Completeness)

Subject

Approved by NAGB

Awaiting NAGB Review

Awaiting Edits or Cognitive Labs

Total Items Sampled

Mathematics

100

17

3

120

Reading

31

50

9

90

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

Mathematics

Pamela Beck

Test Developer; New Standards Mathematics Reference Exam, University of California, Oakland

Jeffrey Choppin

Teacher; Benjamin Banneker Academic High School, Washington, DC

Thomas Cooney

Committee Member and Professor of Mathematics, University of Georgia, Athens

Anna Graeber

Disciplinary Expert; Department of Curriculum and Instruction, University of Maryland, College Park

Catherine Yohe

Teacher; Williamsburg Middle School, Arlington, Virginia

Reading

Gretchen Glick

Test Developer; Defense Manpower Data Center, Seaside, California

John Guthrie

Committee Member and Professor, Department of Human Development, University of Maryland, College Park

Marjorie Lipson

Committee Member and Professor, Department of Education, University of Vermont, Burlington

Rosemarie Montgomery

Teacher/Disciplinary Expert; Retired English Teacher, Pennsylvania

Gale Sinatra

Disciplinary Expert; Department of Educational Studies, University of Utah, Salt Lake City

John Tanner

Test Developer; Assessment and Accountability, Delaware Department of Education, Dover

We allocated a total of 6 hours to the rating process, including initial training and post-rating discussion. Based on experience with the 1998 item quality ratings, we judged that this time period would be sufficient for each expert to rate the number of items targeted for a single VNT form, 60 math items or 45 reading items with associated passages.

Comparison Sample of NAEP Items

In addition to the sampled VNT items, we identified a supplemental sample of released NAEP 4th-grade reading and 8th-grade mathematics items for inclusion in the rating process, for two reasons. First, content experts will nearly always have suggestions for ways items might be improved. A set of items would have to be truly exemplary for a diverse panel of experts to have no suggestions for further improvement. Use of released and final NAEP items provides a reasonable baseline against which to compare the number of changes suggested for the VNT items. Second, NAGB has been clear and consistent in its desire to make the VNT as much like NAEP as possible; NAEP items thus provide a very logical comparison sample, much more appropriate than items from other testing programs. We also note that NAEP items provide the basis for a fairly stringent comparison because they have been administered to large samples of students, in contrast to the pre-pilot VNT items. In all, we sampled 26 NAEP math items and 3 NAEP reading passages with a total of 30 reading items.

We used released NAEP items, but we masked the identity of all items so that raters would not know which items were NAEP and which were VNT. Several of our raters were sufficiently familiar

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

with NAEP that it may not have been possible for them to be fully blind to item source, but, we did make every possible effort to remove clues to each item's source.

Rating Booklet Design

In assigning items to rater booklets, we tried to balance the desire to review as many items as possible with the need to provide raters with adequate time for the review process and to obtain estimates of rater consistency levels. We assigned items to one of three sets: (a) those rated by all raters (common items), (b) those rated by two raters (paired items), and (c) those rated by only one rater (single items). Booklets were created (a different one for each rater) so as to balance common, paired, and single items across the books. Common item sets were incorporated into the review process in order to obtain measures of rater agreement and to identify outliers, those who consistently rated higher or lower than others.

For mathematics, each booklet contained three sets of common VNT items, targeted for three time slots: the beginning of the morning session (five items), the end of the morning session (ten items), and the end of the afternoon session (five items). For reading, the need to present items within passages constrained the common set of items to two VNT passages. These were targeted for presentation at the beginning (6 items) and end (11 items) of the morning rating sessions. The remaining VNT and NAEP items were assigned to either one or two raters. We obtained two independent ratings on as many items as possible, given the time constraints, in order to provide further basis for assessing rater consistency. The use of multiple raters also provided a more reliable assessment of each item, although our primary concern was with statistical inferences about the whole pool of items and not about any individual items. The items assigned to each rater were balanced insofar as possible with respect to content and format categories. (Further details of the booklet design can be found in Hoffman and Thacker [1999].)

Rating Task

The rating process began with general discussion among both rating panels and committee members to clarify the rating task. There were two parts of the rating task. First, raters were asked to provide a holistic rating of the extent to which the item provided good information about the skill or knowledge it was intended to measure. The panels started with a five-point scale, with each level tied to a policy decision about the item, roughly as follows:

  1. flawed and should be discarded;

  2. needs major revision;

  3. acceptable with only minor edits or revisions;

  4. fully acceptable as is; or

  5. exceptional as an indicator of the intended skill or knowledge.

The panel of raters talked, first in a joint session, and later in separate sessions by discipline, about the reasons that items might be problematic or exemplary. Two kinds of issues emerged during these discussions. The first concerned whether the content of the item matched the content frameworks. For the mathematics items, the panel agreed that when the item appeared inappropriate for the targeted content strand, it would be given a code no higher than 3. For reading, questions about the target ability would be flagged in the comment field but would not necessarily constrain the ratings.

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

The second type of issue was described as craftsmanship. Craftsmanship concerns whether the item stem and response alternatives are well designed to distinguish between students who have or do not have the knowledge and skill the item was intended to measure. Items with obviously inappropriate incorrect choices are examples of poor craftsmanship.

The second part of the rating task involved providing comments to document specific concerns about item quality or specific reasons that an item might be exemplary. Major comment categories were identified in the initial panel discussion, and specific codes were assigned to each category to facilitate and standardize comment coding by the expert panelists.

After working through a set of practice items, each panel discussed differences in the holistic ratings or in the comment categories assigned to each item. Clarifications to the rating scale categories and to the comment codes were documented on flip-chart pages and taped to the wall for reference during the operational ratings. Table 3-8 lists the primary comment codes used by the panelists and provides a count of the frequency with which each of the codes was used by each of the two panels.

TABLE 3-8 Comment Coding for Item Rating

 

 

Frequency of Usea

Code Issue

Explanation

Mathematics

Reading

Content

AMM

Ability mismatch (refers to mathematics content ability classifications)

17

0

CA

Content category is ambiguous: strand or stance uncertain

4

4

CAA

Content inappropriate for target age group

2

2

CE

Efficient question for content: questions gives breadth within strand or stance

3

0

CMM

Content mismatch: strand or stance misidentified

19

24

CMTO

More than one content category measured

8

2

CR

Rich/rigorous content

4

13

CRE

Context reasonable

0

3

CSLb

Content strand depends on score level

0

0

S

Significance of the problem (versus trivial)

12

1

Craftsmanship

ART

Graphic gives away answer

0

1

B

Bias; e.g., gender, race, etc.

5

0

BD

Back-door solution possible: question can be answered without working the problem through

16

0

DQ

Distractor quality

32

55

II

Item interdependence

0

1

MISC

Miscellaneous, multiple

1

1

RR

Rubric, likelihood of answer categories: score levels do not seem realistically matched to expected student performance

6

4

STEM

Wording in stem

0

16

TD

Text dependency: question and text are too closely or loosely associated

3

13

TL

Too literal (correct answer matches a text sentence)

0

17

TQ

Text quality

14

1

VOC

Vocabulary: difficulty

0

3

a Used for VNT items only.

b Used only on two NAEP items.

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

Comment codes were encouraged for highly rated items as well as poorly rated items; however, the predominant usage was for items rated below acceptable. (See Hoffman and Thacker [1999] for a more complete discussion of the comment codes.)

Item Quality Rating Results

Agreement Among Panelists

In general, agreement among panelists was high. Although two panelists rating the same item gave the same rating only 40 percent of the time, they were within one scale point of each other approximately 85 percent of the time. In many of the remaining 15 percent of the pairs of ratings where panelists disagreed by more than one scale point, quality rating differences stemmed from different interpretations of test content boundaries rather than from specific item problems. In other cases, one rater gave the item a low rating, apparently having detected a particular flaw that was missed by the other rater.

Overall Evaluation

The results were generally positive: 59 percent of the mathematics items and 46 percent of the reading items were judged to be fully acceptable as is. Another 30 percent of the math items and 44 percent of the reading items were judged to require only minor edits. Only 11 percent of the math items and 10 percent of the reading items were judged to have significant problems.

There were no significant differences in the average ratings for VNT and NAEP items. Table 3-9 shows mean quality ratings for VNT and for NAEP reading and mathematics items and the percentages of items judged to have serious, minor, or no problems. Average ratings were 3.4 for VNT mathematics and 3.2 for VNT reading items, both slightly below the 3.5 boundary between minor edits and acceptable as is. For both reading and mathematics items, about 10 percent of the VNT items had average ratings that indicated serious problems. The proportion of NAEP items judged to have similarly serious problems was higher for mathematics (23 percent) and lower for reading (3 percent).

TABLE 3-9 Quality Ratings of Items

 

 

 

 

Percentage of Items with Scale Means of

Subject and Test

Number of Items Rateda

Mean

S.D.

Less Than 2.5b

2.5 to 3.5c

At Least 3.5d

Mathematics

VNT

119

3.4

0.7

10.9

30.3

58.8

NAEP

25

3.1

0.9

23.1

30.8

46.2

Reading

VNT

88

3.2

0.7

10.2

44.3

45.5

NAEP

30

3.2

0.5

3.3

50.0

46.7

a Two VNT reading items, one VNT mathematics item, and one NAEP mathematics item were excluded due to incomplete ratings.

b Items that at least need major revisions to be acceptable.

c Items that need minor revisions to be acceptable.

d Items that are acceptable.

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

The relatively high number of NAEP items flagged by reviewers as needing further work, particularly in mathematics, suggests that the panelists had high standards for item quality. Such standards are particularly important for a test such as the VNT. In NAEP, a large number of items are included in the overall assessment through matrix sampling. In the past, items have not been subjected to large-scale tryouts prior to inclusion in an operational assessment, and it is not uncommon for problems to be discovered after operational use so that the item is excluded from scoring. By contrast, a relatively small number of items will be included in each VNT form, and scores for individuals will be based on those few items, so the standards must be high for each one.

Evaluation of Different Types of Items

There were few overall differences in item quality ratings for different types of items, that is, by item format or item strand or stance. For the reading items, however, there was a statistically significant difference between items that had been reviewed and approved by NAGB and those that were still under review, with the items reviewed by NAGB receiving higher ratings. Table 3-10 shows comparisons of mean ratings by completeness category for both mathematics and reading items.

Specific Comments

The expert raters used specific comment codes to indicate the nature of the minor or major edits that were needed for items rated as less than fully ready (see Hoffman and Thacker, 1999). For both reading and math items, the most frequent comment overall, particularly for items judged to require minor edits, was "distractor quality" for both NAEP and VNT items. In discussing their ratings, the panelists were clear that this code was used when one or possibly more of the incorrect (distractor) options on a multiple-choice item was highly implausible and likely to be easily eliminated by respondents. This code was also used if two of the incorrect options were so similar that if one were correct, the other could not be incorrect. Other distractor quality problems included nonparallel options or other features that might make it possible to eliminate one or more options without really understanding the underlying concept.

TABLE 3-10 VNT Item Quality Means by Completeness Category

 

 

 

 

Percentage of Items with Scale Means of

Subject and Test

Number of Items Rated

Meana

S.D.

Less Than 2.5b

2.5 to 3.5c

At Least 3.5d

Mathematics Review completed

99

3.4

0.8

12.1

31.3

56.6

Review in progress

20

3.5

0.5

5.0

25.0

70.0

Reading

Review completed

31

3.4

0.6

3.2

41.9

54.8

Review in progress

57

3.1

0.7

14.0

45.6

40.3

a Reading means are significantly different at p < .05.

b Items that need major revisions to be acceptable.

c Items that need minor revisions to be acceptable.

d Items that are acceptable.

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

For both reading and mathematics items, the second most frequent comment code was "content mismatch." In mathematics, this code might indicate an item classified as an algebra or measurement item that seemed to be primarily a measure of number skills. In reading, this code was likely to be used for items classified as critical stance or developing an interpretation that were relatively literal or that seemed more an assessment of initial understanding. Reading items that were highly literal were judged to assess the ability to match text string patterns rather than gauging the student's understanding of the text. As such, they were not judged to be appropriate indicators of reading ability. In both cases, the most common problem was with items that appeared to be relatively basic although assigned to a more advanced content area.

For mathematics items, another frequent comment code was "backdoor solution," meaning that it might be possible to get the right answer without really understanding the content that the item is intended to measure. An example is a rate problem that is intended to assess students' ability to convert verbal descriptions to algebraic equations. For example, suppose two objects are travelling in the same direction at different rates of speed, with the faster object following the slower one, and the difference in speeds is 20 miles per hour, and the initial difference in distance is also 20 miles. Students could get to the answer that it would take 1 hour for the faster object to overtake the slower one without ever having to create either an algebraic or graphical representation of the problem. The expert mathematics panelists also coded a number of items as having ambiguous ability classifications. Items coded as problem solving seemed sometimes to assess conceptual understanding, while other items coded as tapping conceptual understanding might really represent application. By agreement, the panelists did not view this as a significant problem for the pilot test, so many of the items flagged for ability classifications were rated as fully acceptable.

For reading items, the next most frequent code was "too literal," meaning that the item did not really test whether the student understood the material, only whether he or she could find a specific text string within the passage.

Conclusions and Recommendations

With the data from item quality rating panels and other information provided to the committee by NAGB and AIR, the committee reached a number of conclusions about current item quality and about the item development and review process. We stress that there are still no empirical data on the performance and quality of the items when they are taken by students, and so the committee's evaluation is necessarily preliminary.

Most testing programs collect empirical (pilot test) item data at an earlier stage of item development than has been the case with the VNT. The key test of whether items measure intended domains will come with the administration of pilot test items to large samples of students. Data from the pilot test will show the relative difficulty of each item and the extent to which item scores provide a good indication of the target constructs as measured by other items. These data will provide a more solid basis for assessing the reliability and validity of tests constructed from the VNT item pool.

We conclude that the quality of the completed items is as good as a comparison sample of released NAEP items. Item quality is significantly improved in comparison with the items reviewed in preliminary stages of development a year ago.

As described above, the committee and other experts reviewed a sample of items that were ready or nearly ready for pilot testing. Average quality ratings for these items were near the boundary between "needs minor edits" and "use as is" and were as high as or higher than ratings of samples of released NAEP items. The need for minor edits does not affect the readiness of items for pilot testing.

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

Although quality ratings were high, the expert panelists did have a number of suggestions for improving many of the items. One frequent concern was with the quality of the distractors (incorrect options) for multiple-choice items. While distractor problems were coded as a minor editorial problem, such problems can seriously degrade the quality of information obtained during pilot testing. One example that typifies the kinds of problems that stem from poor distractor quality would be not including "2a" as an option for a question asking the value of "a times a." Clearly, such an omission would affect item difficulty estimates and might lead to a conclusion that the item was easy, when in fact it was not.

The match to particular content or skill categories was also a frequent concern. More serious flaws included reading items that were too literal or mathematics items that did not reflect significant mathematics, possibly because they had back-door solutions. However, the rate for flagged items was not higher than the rate at which released NAEP items were similarly flagged. Many of the minor problems, particularly distractor quality issues, are also likely to be found in the analysis of pilot test data.

RECOMMENDATION 3.3 Item quality concerns identified by reviewers, such as distractor quality and other "minor edits," should be carefully addressed and resolved by NAGB and its contractor prior to inclusion of any items in pilot testing.

In the best of circumstances, items to be pilot tested should be as perfected as possible so that the student response data will lead to minimal changes. The uncertainty surrounding the VNT and the rapid development schedules provide very little time for further testing of edited items or for evaluating the effects of changes in items on the test forms as a whole. It is reasonable to assume that the more perfected the piloted items are, the higher the item survival rate will be. It will also be easier to assemble all operational VNT test forms to meet the same statistical specifications if items are not revised following pilot testing.

MATCHING VNT ITEMS TO NAEP ACHIEVEMENT-LEVEL DESCRIPTIONS

In the interim Phase I evaluation report (National Research Council, 1998a:6), the NRC recommended "that NAGB and its contractors consider efforts now to match candidate VNT items to the NAEP achievement-level descriptions to ensure adequate accuracy in reporting VNT results on the NAEP achievement-level scale." This recommendation was included in the interim report because it was viewed as desirable to consider this matching before final selection of items for inclusion in the pilot test. The recommendation was repeated in the final Phase I report (National Research Council, 1999b:34): "NAGB and the development contractor should monitor summary information on available items by content and format categories and by match to NAEP achievement-level descriptions to assure the availability of sufficient quantities of items in each category."

Although the initial recommendation was linked to concerns about accuracy at different score levels, the Phase I report was also concerned about the content validity of achievement-level reporting for the VNT. All operational VNT items would be released after each administration, and if some items appeared to measure knowledge and skills not covered by the achievement-level descriptions, the credibility of the test would suffer. There will also be a credibility problem if some areas of knowledge and skill in the achievement-level descriptions are not measured by any items in a particular VNT form, but this problem is more difficult to address in advance of selecting items for a particular form. Finally, there also would be validity questions if a student classified at one achievement level answered

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

correctly most questions matched to a higher level or missed most questions that matched the achievement description for a lower level.

Contractor Workshop

In fall 1998 the test development contractor assembled a panel of experts to match then-existing VNT items to the NAEP achievement levels. Committee and NRC staff members observed these ratings, and the results were reported at the committee's February workshop (American Institutes for Research, 1999b). The contractor's main goal in matching VNT items to NAEP achievement levels was to have an adequate distribution of item difficulties to ensure measurement accuracy at key scale points. The general issue was whether item difficulties matched the achievement-level cutpoints. However, there was no attempt to address directly the question of whether the content of the items was clearly related to the "descriptions" of the achievement levels. The expert panelists were asked which achievement level the item matched, including a "below basic" level for which there is no description; they were not given an option of saying that the item did not match the description of any of the levels.

In matching VNT items to achievement levels, the treatment of constructed-response items with multiple score points was not clarified. The score points do not correspond directly to achievement levels, since scoring rubrics are developed and implemented well before the achievement-level descriptions are final and the cutpoints are set. Nonetheless, it is possible, for example, that "basic" or "proficient" performance is required to achieve a partial score, while ''advanced" performance is required to achieve the top score for a constructed-response item. Committee members who observed the process believed that multipoint items were generally rated according to the knowledge and skill required to achieve the top score.

The results of the contractor's achievement-level matching varied by subject. For reading, there was reasonably good agreement among judges, with two of the three or four judges agreeing on a particular level for 94 percent of the items. Only 4 of the 1,476 reading items for which there was agreement were matched to the "below basic" level. About half of the items matched the proficient level, a quarter of the items were matched to the basic level, and a quarter to the advanced level. Based on these results, the contractor reports targeting the below basic level as an area of emphasis in developing further reading items.

For mathematics, there was much less agreement among the judges: the three or four panelists each selected a different achievement level (of the four possible). In addition, roughly 10 percent of the mathematics items were matched to the "below basic" level, for which there was no written description.

In an effort to begin to address the content validity concerns about congruence of item content and the achievement-level descriptions, we had our reading and math experts conduct an additional item rating exercise. After the item quality ratings, they matched the content of a sample of VNT items to descriptions of the skills and knowledge required for basic, proficient, or advanced performance. The descriptions used in this exercise were a tabular arrangement of the words in the descriptions approved by NAGB. Appendix B shows the achievement-level descriptions for 4th-grade reading and for 8th-grade mathematics that have been approved by NAGB and the reorganization of these descriptions used by the panelists. Panelists were asked whether the item content matched any of the achievement level descriptions and, if so, which ones. Thus, for multipoint items it was possible to say that a single item tapped basic, proficient, and advanced skills.

In general, although the panelists were able to see relationships between the content of the items and the achievement-level description, they had difficulties in making definitive matches. In math-

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

ematics, the few items judged not to match any of the achievement-level descriptions were items that the panelists had rated as flawed because they were too literal or did not assess significant mathematics.

The panelists expressed concern about the achievement-level descriptions to which the VNT items were matched. The current descriptions appear to imply a hierarchy among the content areas that the panelists did not endorse. In reading, for example, only the advanced achievement-level description talked about critical evaluation of text, which might imply that all critical stance items were at the advanced level. A similar interpretation of the descriptions could lead one to believe that initial interpretation items should mostly be at the basic level. The panelists pointed out, however, that by varying passage complexity and the subtlety of distinctions among response options, it is quite possible to construct very difficult initial interpretation items or relatively easy critical stance items. They noted, for example, an item that used a very simple literary device (capitalization of all letters of one word); the item would have to be classified as advanced, because literary devices are limited to the advanced achievement level. Raters were dismayed at the prospect of categorizing such a simplistic item as advanced. Perhaps a better approach for the VNT would be to develop descriptions of basic, proficient, and advanced performance for each of the reading stances and to provide a description of complexity and the fineness of distinctions that students would be expected to handle at each level. This approach would provide more useful information to parents and teachers about students' skills.

For mathematics, there were similar questions about whether mastery of concepts described under advanced performance necessarily implied that students also could perform adequately all of the skills described as basic. For these, too, the panelists suggested, it would be useful, at least for informing instruction, to describe more specific expectations within each of the content strands rather than relying on relatively "content-free" descriptions of problem-solving skills.

The committee was concerned about the completeness with which all areas of the content and achievement-level descriptions are covered by items in the VNT item pool. Given the relatively modest number of completed items, it is not possible to answer this question at this time. In any event, the primary concern is with the completeness of coverage of items in a given test form, not with the pool as a whole. The current content specifications will ensure coverage at the broadest level, but assessment of completeness of coverage at more detailed levels must await more complete test specifications or the assembly of actual forms.

The committee did not attempt to address the issue of the validity of the achievement-level descriptions as they are used in NAEP. A number of prior reviews have questioned the process used to develop the NAEP achievement levels-both the scale points that operationally divide one level from the next and the description of the knowledge and skills associated with performance at the basic, proficient, and advanced levels (National Research Council, 1999a; National Academy of Education, 1993). Other experts have defended the process used in developing NAEP achievement levels (see, e.g., Hambleton et al., 1999). The issue of whether the standards are too high or too low is a matter of NAGB policy and not something the committee considered within its charge. Rather, the committee focused on whether the content of the VNT items appeared to match the descriptions developed by NAGB for reporting results by achievement levels.

Conclusions and Recommendations

In reviewing efforts by NAGB and its contractor to match VNT items to NAEP achievement-level descriptions, the committee's overall conclusion is that these efforts have been helpful in ensuring a reasonable distribution of item difficulty for the pilot test item pool, but they have not yet to begun to

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

meet the need to ensure a match of item content to the descriptions of performance at each achievement level.

As described above, the achievement-level matching conducted by the development contractor focused on item difficulty and did not allow the raters to identify items that did not match the content of any of the achievement-level descriptions. Also, for mathematics, there was considerable disagreement among the contractor's raters about the achievement levels to which items were matched.

The committee's own efforts to match item content to the achievement-level descriptions led to more concern with the achievement-level descriptions than with item content. The current descriptions do not provide a clear picture of performance expectations within each reading stance or mathematics content strand. The descriptions also imply a hierarchy among skills that does not appear reasonable to the committee.

The match between item content and the achievement-level descriptions and the clarity of the descriptions themselves will be particularly critical to the VNT. Current plans call for releasing all of the items in each form, immediately after their use. Unlike NAEP, individual VNT scores will be given to students, parents, and teachers, which will lead to scrutiny of the results to see how a higher score might have been obtained. The achievement-level descriptions will have greater immediacy for teachers seeking to focus instruction on the knowledge and skills outlined as essential for proficiency in reading at grade 4 or mathematics at grade 8. Both the personalization of the results and the availability of the test items suggest very high levels of scrutiny and the consequent need to ensure that the achievement-level descriptions are clear and that the individual items are closely tied to them.

RECOMMENDATION 3.4 The contractor should continue to refine the achievement-level matching process to include the alignment of item content to achievement-level descriptions, as well as the alignment of item difficulty to the achievement-level cutpoints.

RECOMMENDATION 3.5 The achievement-level descriptions should be reviewed for usefulness in describing specific knowledge and skill expectations to teachers, parents, and others with responsibility for interpreting test scores and promoting student achievement.

The most justifiable scientific model of reading at grade 4 consists of a set of lower-level and higher-level processes operating together. Basic reading comprehension requires both higher and lower processes (Kintsch, 1998). The processes are interactive. Processes such as word recognition, recalling word meanings, and understanding sentences are necessary prerequisites for comprehension and construction of knowledge from text (Lorch and van den Broek, 1997). In addition, higher-level processes of using background knowledge, making inferences, and evaluating new information are central to comprehension (Graesser, Singer, and Trabasso, 1994). Furthermore, these higher-level processes can also increase lower-level processes. Higher and lower processes influence each other in top-down and bottom-up mechanisms (Anderson and Pearson, 1984). Therefore, tests should represent the higher-level processes of using knowledge, making inferences, and judging critically at all levels. These higher-level processes should be present at the basic achievement level as well as the proficient and advanced levels of the VNT and NAEP.

It is not justified to state that students at the basic level of NAEP have sentence comprehension or initial understanding, but not critical evaluation in reading. Rather, students at the basic level have relatively less developed competencies in all processes, including word recognition, making inferences, knowledge use, and critical evaluation, which can be applied to relatively simple texts. Students at the

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

advanced level have acquired these same reading competencies to an expert level. They possess more complex forms of these competencies, and they can use them to comprehend more complex texts. The descriptions of achievement levels should reflect the widely accepted interactive model of reading.

DOMAIN COVERAGE

A very key question about the quality of the VNT items, in addition to their individual fit to the test frameworks, is whether in the aggregate they cover completely the intended frameworks. Given the committee's concern about unintended implications about content categories and proficiency levels, we decided to assess whether there is good coverage of each content, process, and proficiency category. Are there relatively advanced items on developing initial interpretations in reading or computation in mathematics? Are there more basic items on developing a critical stance (reading) or in probability and statistics (mathematics)? Unfortunately, these are questions that the committee cannot answer at this time due to the time constraints on our work. New items, written to fill in perceived gaps in the domain coverage, had not yet been reviewed, and a relatively high proportion of the original items were also not fully completed.

Importance of Coverage

The question of domain coverage that concerns the committee is not just a matter of whether the item bank, as a whole, covers all of the intended content, process, and proficiency categories. The key question is whether each and every test form includes a reasonable sampling of items from each of these categories. This is an important question because the planned release of all of the test items after operational use will communicate the intended domain to teachers, parents, curriculum developers, and others much more concretely than the more general descriptions included in the test frameworks and test and item specifications.

At its final meeting, the committee reviewed a document from AIR entitled "Technical Specifications, Revisions as of June 18, 1999." This document outlines criteria and procedures for selecting items to be administered and for assembling forms from these items. In this document, the contractor specifies the acceptable ranges for p-values (item difficulty estimates) and biserial correlations (for item scores with total test scores and for distractors with the total test score). The test blueprint is also specified for reading and math. The contractor notes: "After all forms are assembled, the final evaluation is conducted for all forms at the form level to determine whether all the forms are parallel and meet die form assembly criteria" (American Institutes for Research, 1999i:11).

The committee stresses that such an evaluation of forms is an essential part of the process and should be given a substantial amount of time, expertise, and resources. It may be advisable to have an expert panel with content and psychometric members as well as teachers evaluate the forms for both the reading and the mathematics tests. Content panels involved in item revision and form construction should include psychometricians, curricular specialists, and teachers. For mathematics items, the panel should also include mathematics educators and college or university mathematics faculty. For reading items, reading educators and reading researchers should be included. The cognitive labs might be considered as sources for review and revision of forms. Multiple forms should be examined simultaneously to ensure that the content frameworks and achievement levels are comparably represented.

The stated purpose (National Assessment Governing Board, 1999e:5) of the VNT is "to measure individual student achievement in 4th grade reading and 8th grade mathematics, based on the content and rigorous performance standards of the National Assessment of Educational Progress (NAEP), as set

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×

by the National Assessment Governing Board (NAGB)." The intended use (p. 9) is "to provide information to parents, students, and authorized educators about the achievement of the individual student in relation to the content and the rigorous standards for the National Assessment, as set by the National Assessment Governing Board for 4th grade reading and 8th grade mathematics."

There is reason to be concerned that the VNT, in its emerging development, may result in an assessment that is not challenging enough to meet the stated purpose and intended use of the test. This concern was expressed by NAGB's Linking Feasibility Team (Cizek et al., 1999:60): "Compared to the NAEP, the VNT-R [VNT reading test] appears to have a disproportionate number of questions that ask for trivial or insignificant information." Furthermore, "more constructed response questions should be added to the VNT-R. This will increase the number of higher order thinking items on the test" (Cizek et al., 1999:91).

Conclusions and Recommendations

The results of our item review pointed to similar areas of concern about domain coverage (see Hoffman and Thacker, 1999). Although the ratings of the VNT items and NAEP were generally similar, 14.5 percent of the panelists' comments were coded as "too literal," while none of the NAEP items were coded this way. The majority of items for the stances labeled "reader/text connection" and "critical stance" were rated as involving at least some difficulty (67% and 59%, respectively; see Hoffman and Thacker, 1999:Table 14). Similarly, in the qualitative reviews, reading panelists noted that many items were merely fact-finding from the passage and did not really match any of the stances (see Hoffman and Thacker, 1999:32). For reading, when items were problematic, the greatest frequency of comments were those having to do with "content rigor" (46.7% of those rated "2'', and 33.3% of those rated "3") and "too literal" (26% of those rated "2" and 70.4% of those rated "3"). The other most frequently named concern was "distractor quality" (25.9% of those rated "2" and 70.4% of those rated "3").

For mathematics, panelists commented that it seemed like there were a lot of easy items with 4 ratings-the pool of completed items seems either easy or not significant mathematics (Hoffman and Thacker, 1999:31). If the VNT is to be a useful assessment, it must provide information not otherwise available, particularly in areas where there have been challenges to the rigor of the state standards.

RECOMMENDATION 3.6 Test blueprints should be expanded to indicate the expected number of items at each achievement level for each content area (reading stance or mathematics content strand) for each form of the test. Insofar as possible, items at each achievement level should he included for each content area.

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 21
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 22
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 23
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 24
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 25
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 26
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 27
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 28
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 29
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 30
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 31
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 32
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 33
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 34
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 35
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 36
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 37
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 38
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 39
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 40
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 41
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 42
Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.
×
Page 43
Next: Technical Issues in Test Development »
Evaluation of the Voluntary National Tests, Year 2: Final Report Get This Book
×
Buy Paperback | $40.00 Buy Ebook | $31.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

In his 1997 State of the Union address, President Clinton announced a federal initiative to develop tests of 4th-grade reading and 8th-grade mathematics that could be administered on a voluntary basis by states and school districts beginning in spring 1999. The principal purpose of the Voluntary National Tests (VNT) is to provide parents and teachers with systematic and reliable information about the verbal and quantitative skills that students have achieved at two key points in their educational careers. The U.S. Department of Education anticipated that this information would serve as a catalyst for continued school improvement, by focusing parental and community attention on achievement and by providing an additional tool to hold school systems accountable for their students' performance in relation to nationwide standards.

Shortly after initial development work on the VNT, Congress transferred responsibility for VNT policies, direction, and guidelines from the department to the National Assessment Governing Board (NAGB, the governing body for the National Assessment of Educational Progress). Test development activities were to continue, but Congress prohibited pilot and field testing and operational use of the VNT pending further consideration. At the same time, Congress called on the National Research Council (NRC) to assess the VNT development activities. Since the evaluation began, the NRC has issued three reports on VNT development: an interim and final report on the first year's work and an interim report earlier on this second year's work. This final report includes the findings and recommendations from the interim report, modified by new information and analysis, and presents our overall conclusions and recommendations regarding the VNT.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!