Page 37 Cite

Suggested Citation:"4 Item Development." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

4

Item Development

This chapter reviews NAEP’s costs for item development and then considers two ways to reduce item development costs: automated and structured item development and changing the mix of item types. The pilot administration costs that are a large component of item development costs are partially addressed in the next two chapters, which cover test administration.

CURRENT COSTS

Test item development for NAEP is expensive. The costs for item creation and review range from $1,000 to $2,500 for selected-response items, from $1,500 to $3,500 for constructed-response items, and from $6,000 to $20,000 for scenario-based task items.¹ With a typical distribution across these three types and taking the midpoint of the ranges, average per-item costs for creation and review are about $3,700.²

___________________

¹ NCES response to Q68a. The panel follows NCES in describing cost differences in item development in terms of these three types of items. However, it has been noted that scenario-based tasks are not actually an item type, but are instead a way of grouping and contextualizing a set of items, each of which may require either selected or constructed responses. Thus, the panel’s references to the cost associated with “scenario-based task items” should be understood to refer to the cost of items that are developed as part of a contextualized group of items in a scenario-based task that may require either selected or constructed responses.

² Taking the midpoint of each range implies an average cost of $1,750 for selected-response items, $2,500 for constructed-response items, and $13,000 for scenario-based items. The NCES response to Q68a suggests the following rough distribution of item types: 45 to 55 percent selected-response items, 30 to 40 percent constructed-response items, and 12 to 17 percent scenario-based items. Using a distribution of 50, 35, and 15 percent, respectively, for the three types of items (roughly the midpoints of the three ranges) produces the weighted average item creation and review cost of $3,700.

Page 38 Cite

Suggested Citation:"4 Item Development." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

NAEP’s item costs are substantially higher than those in other testing programs. Published figures are generally not available, but the few that the panel found and the experience of several of the panel members suggests that typical industry experience for creating, reviewing, and pilot testing items range from hundreds of dollars per item for selected-response or short constructed-response items to less than $3,000 (also see Rudner, 2007). It is not surprising that the more unusual scenario-based task items are more expensive than selected-response or short constructed-response items, but the high cost of the more common item types suggests an unusually high overall cost structure that is separate from NAEP’s use of innovative item types.

These costs for NAEP’s items do not include the cost of pilot administration to test the items before use, which ranges from $25,000 to $35,000 for selected-response items, from $35,000 to $45,000 for constructed-response items, and from $45,000 to $55,000 for scenario-based task items.³ Again, with a typical distribution across these three types and taking the midpoint of the ranges, these average per-item costs are roughly $36,500 for pilot administration.⁴ These costs are also much higher than piloting testing costs for other assessments (which are addressed in Chapter 5).

Although there is variation across subjects and grades, NAEP assessments include about 200 items, which are typically used across four administrations.⁵ Thus, a typical assessment on average will require 50 new items each time it is given. As noted in Chapter 2, NAEP administers roughly 22 assessments in a 4-year period, but 6 of these will be long-term trend NAEP in reading and mathematics, for which no new items are developed. Thus, a 4-year period typically involves developing roughly 50 new items for each of 16 assessments, or 800 items. In addition, sometimes extra items need to be developed, which can be required, for example, when a new framework requires a new type of item or area or content that was not previously covered.⁶ Over the next few years, a somewhat higher proportion of new items may be required, if the items in long-term trend NAEP are updated in its transition to digital administration and if the scheduled framework updates

___________________

³ These per-item pilot administration costs likely include apportionment of some fixed program costs (such as planning and equipment setup). While these per-item costs serve a useful discussion purpose, readers are cautioned against assuming that addition or removal of items will add or save costs in the full increments suggested by the unit costs.

⁴ NCES response to Q68a.

⁵ NCES responses to Q11, Q54, and Q55.

⁶ NCES response to Q55.

Page 39 Cite

Suggested Citation:"4 Item Development." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

result in new construct demands.⁷ NCES suggests an estimated additional 100 items per year to support new frameworks and other special purposes.⁸ As a result, the panel estimates that NAEP needs to develop roughly 300 new items per year across all assessments. Finally, it is necessary to develop twice as many new items as needed—roughly 600 items per year—since roughly half of new items are rejected during piloting.⁹

Item development is covered by one contract in the NAEP Alliance.¹⁰ The estimated annual average cost of this contract is $16.3 million, which is 9.3 percent of NAEP’s budget.¹¹ At $3,700 per item, the creation of 600 items per year will cost $2.2 million. In addition, there will be pilot administration costs of roughly $21.9 million. However, only 10 percent of the pilot administration costs are covered in the item development contract, with the remaining costs for piloting new items supported by a variety of other contracts.¹² Thus, the item creation and pilot administration costs attributed to the item development contract are roughly $4.4 million, 2.5 percent of NAEP’s budget, and roughly $11.9 million of the item development contract is not reflected in the per-unit costs of developing items.

NCES reports that there are other activities in the item development contract, including “preparation work prior to, during and after operational administration (e.g., Block Assembly), translating assessment content for the Bilingual accommodations and the mathematics Puerto Rico assessment, survey questionnaire development, Alliance-wide collaboration and planning, NAEP Integrated Management Systems (IMS) support, support for Governing Board meetings, and administrative costs.”¹³ The panel does not understand how these other activities can account for the vast majority of the costs in the item development contract.

___________________

⁷ NCES response to Q66a. The NAGB schedule calls for “new frameworks for mathematics and reading in 2026, science in 2028 and civics, U.S. history and writing in 2030.”

⁸ NCES response to Q66a.

⁹ NCES communication at the panel’s June 7, 2021 meeting. Sometimes items are rejected after piloting, but when this happens the item “will remain in the item inventory for revision and potential future pilot.” NCES response to Q12. This point was added after a prepublication version of the report was provided to the Institute of Education Sciences, NCES, and NAGB. The correction altered the estimates of item creation and pilot administration costs, as well as the estimates of potential savings to administration costs from local administration and longer testing time. These changes were made throughout the report.

¹⁰ The item development contract covers the following activities: “Develops cognitive items, scoring rubrics and survey questions; assists in the training of scorers; conducts cognitive interviews/small-scale pilots of items, rubrics, and survey questions; translates items and survey questions; and conducts item reviews” (NCES response to Q33).

¹¹ See Table 2-2 in Chapter 2.

¹² NCES answers to follow-up questions about evidence-centered design task models and item development costs (personal communication, June 24, 2021).

¹³ NCES response to Q68g.

Page 40 Cite

Suggested Citation:"4 Item Development." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

RECOMMENDATION 4-1: The National Center for Education Statistics should examine the costs and scope of work in the item development contract that are not directly related to item development and pilot administration and explore possibilities for changes that would reduce costs.

AUTOMATED AND STRUCTURED ITEM DEVELOPMENT

Automatic item generation refers to the use of computer-based algorithms that produce test items or assist in their production (Gierl and Haladyna, 2013; Irvine, 2002).¹⁴ The item generation process involves three steps. In step 1, the content for item generation is identified using design principles, guidelines, and data that highlight the knowledge, skills, and abilities required to solve problems and perform tasks in a specific domain. The content needs to be organized and structured in a logical manner that can promote item generation. In step 2, an item model is developed to specify where the content must be placed to generate new items. In step 3, computer-based algorithms place the content specified in step 1 into the item model developed in step 2 to generate items. Selected-response questions tend to be more suited to automatic item generation than constructed-response questions, and more traditional selected-response questions are more suitable than complex selected-response questions.

Applications in a variety of contexts show that automatic item generation can lead to cost savings in developing traditional selected-response items (Bejar, 2019; Embretson and Kingston, 2018; Irvine, 2014; Kosh et al., 2019). Many of these efforts focus on mathematics items, but some of the work has included such domains as vocabulary or spatial ability.

Although NAEP includes some traditional selected-response items for which automatic item generation might be applied, those items are more prevalent in long-term trend NAEP, where new items are not generally created. Main NAEP, where new items are needed, often uses more complex item types, which are less amenable to automatic item generation. The program’s interest in scenario-based tasks further adds to the complexity of items and the resulting difficulties in trying to apply automatic item generation.

The deployment of automatic item generation procedures for a given item type requires a significant effort. It is more likely to lead to cost savings when a small number of item models are expected to be used, and

___________________

¹⁴ A term often used in connection with automated item generation is “cloning.” However, the term is based on an analogy that is not applicable to test development. A clone, by definition, is a duplicate or exact copy. The goal of automatic item generation is to produce psychometrically equivalent items, not exact duplicates.

Page 41 Cite

Suggested Citation:"4 Item Development." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

reused, well into the future, as is the case in K–12 state assessments or in high-stakes admissions or credentialing exams. In an example of the threshold for cost effectiveness, Kosh and colleagues (2019) found that the investment in automatic item generation could be worthwhile within a narrow content area if more than 173–247 items were needed. This number of items is feasible for high-stakes assessments, such as admissions tests, that are administered frequently, and where reducing the exposure of items is necessary for security purposes. It is not feasible for NAEP, which needs only a few items in each narrow content area, though NAEP’s cost structure might result in a somewhat different number of items needed for automatic item generation to be cost effective.

Though the current state of the art in automatic item generation has limited applicability to NAEP, there are other options, some of which NCES has been considering. For example, NCES has been using principled approaches, such as evidence-centered design, to systematically lay out the chain of claims and evidence needed to build tasks that elicit the targeted knowledge and skills. The agency is using this approach to create task models for measuring the intended skills. NCES is also creating a library of reusable assessment components as part of its Benchmark Design System with hopes for operational use in the 2024 assessment. The reusable components will provide the building blocks and guidelines for generating new items and tasks.¹⁵

NCES could push this work further by applying some additional assessment design and engineering principles. Among them are the ideas of drawing from the detailed achievement-level descriptions to specify intended inferences and claims; better integrating the work of the experts who create NAEP frameworks with the experts who write items (as noted in Chapter 3); and applying many of the quality control processes to standardized item models instead of individual items to reduce review and pilot testing costs. We explain each below.

Drawing from Detailed Achievement-Level Descriptions

As described above, a principled approach begins by laying out the intended claims and inferences to be based on assessment results. The intended claims and inferences are then recast in terms of the types of evidence needed to support them. NAEP currently has two versions of achievement-level descriptions: a brief one- or two-sentence version that is typically reported with assessment results (and that most users are familiar with) and a longer, more detailed one that is used in test development. The longer version provides the kind of information needed for a principled approach since it specifies

___________________

¹⁵ NCES response to question related to evidence-centered design (personal communication, June 24, 2021).

Page 42 Cite

Suggested Citation:"4 Item Development." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

what students should know and be able to do at each of the tested grade levels in each subject area. The descriptions reflect a progression from basic performance to advanced performance for each grade. They are intended to be cumulative within grade and coherent across grades. The evidence claims can then be used to outline the scope of knowledge and skills to be elicited from students at each achievement level.¹⁶

Integrating Framework Development and Item Creation

As described in Chapter 3 and recommended by other experts (e.g., Glaser, Linn, and Bohrnstedt, 1997; NCES, 2012), the content experts who create the assessment frameworks and the test development experts who write items have a great deal to offer each other. Working together and iteratively, the item developers can bring information about the art and science of measurement to framework development and the framework developers can bring information about the intentions of the frameworks to item development. As proposed in Recommendation 3-2 (in Chapter 3), implementing a change to integrate framework development and item creation will require NAGB and NCES to work together to create a structure that allows such collaboration. To some extent, NCES and NAGB already collaborate in this way,¹⁷ but they could refocus their work on task models, rather than individual items.

Thinking in Terms of Task Models

NAEP has made headway in defining task models. This approach to item development is appealing. It offers the potential to both decrease costs and increase the quality of item development, even without use of fully automatic item generation. Item review and other aspects of the quality control process can be streamlined. New items can be pre-calibrated without the cost of pilot testing. In addition, task models could be used to build in accessibility and address other issues of fairness and equity (see, e.g., Winter et al., 2018). Finally, items generated using task models can be evaluated for their ability to assess examinee knowledge and skills, providing evidence of the quality of the task model itself. NAEP’s use of automated processes of item generation could then evolve as the state of the art in automatic item generation evolves.

RECOMMENDATION 4-2: The National Assessment Governing Board and the National Center for Education Statistics should move toward using more structured processes for item development to both

___________________

¹⁶ For examples of the detailed achievement-level descriptions, see, for mathematics, https://nces.ed.gov/nationsreportcard/mathematics/achieve.aspx; for science, https://nces.ed.gov/nationsreportcard/science/achieve.aspx; and for reading, https://nces.ed.gov/nationsreportcard/reading/achieve.aspx.

¹⁷ This collaboration is implied in the documentation about the detailed achievement level. For example, see https://nces.ed.gov/nationsreportcard/mathematics/achieve.aspx.

Page 43 Cite

Suggested Citation:"4 Item Development." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

decrease costs and improve quality. This work should include drawing from the detailed achievement-level descriptions to specify intended inferences and claims, better integrating the work of framework development and item creation, and carrying out critical aspects of review and quality control at the level of task models rather than at the level of individual items.

CHANGING THE MIX OF ITEM TYPES

NAEP currently uses a range of item types, including selected-response, constructed-response, and scenario-based tasks, as well as others. Using different item types is well suited to certain cognitive levels and content specifications, with more complex item types used to assess more complex skills. This alignment can be seen in the 4th-grade science item map, where seven of the eight items listed as above the NAEP advanced cut scores are constructed-response items.¹⁸

Despite this association between item types and the cognitive level and content of the items, the relation is not exact. As is often pointed out, selected-response or simpler constructed-response items can be used to assess cognitively complex material, even though there are many examples when this is not the case.¹⁹ It is important to consider the full range of item types that can potentially be used to assess the different cognitive and content areas specified in the frameworks, rather than focusing on particular item types in the abstract.

The choice of item types is also influenced by factors other than the cognitive and content areas to be assessed, such as testing time and development, administration, and scoring costs. Changing the mix of item types could potentially change NAEP’s average costs for item creation, pilot testing, test administration, and scoring. The average costs of the three item types discussed above imply that increasing the proportion of scenario-based items increases item development costs and increasing the proportion of selected-response items decreases item development costs. There are likely to be similar relationships with respect to test administration and scoring costs.

RECOMMENDATION 4-3: The National Assessment Governing Board should commission an analysis of the value and cost of different item types when multiple item types can measure the construct of

___________________

¹⁸ See https://www.nationsreportcard.gov/itemmaps/?subj=SCI&grade=4&year=2019.

¹⁹ There is recent research showing that selected-response items for which the selections are sourced from prior students’ constructed responses can produce items of comparable quality in some cases (Wang et al., 2019).

Page 44 Cite

Suggested Citation:"4 Item Development." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

interest. A full range of potential item types should be included in this analysis. The analysis should develop a framework for considering the tradeoff between value and cost. The value considered should include both the item’s contribution to a score and its signal about the relevant components of the construct. The costs considered should include item development (both item creation and pilot administration), administration time, and scoring.

In addition to its implications for the cost of item development, this recommendation also relates to the costs for test administration and scoring, which are discussed in Chapters 5–7.