Page 67 Cite

Suggested Citation:"7 Item Scoring." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

7

Item Scoring

Item scoring for NAEP is expensive because of the extensive use of constructed-response and scenario-based tasks that have, to date, required human scoring. This chapter first provides an overview of NAEP’s current costs for scoring and then discusses automated scoring, which is an innovation being pursued by NAEP’s sponsors to reduce scoring costs for constructed-response and scenario-based tasks.

CURRENT COSTS¹

Scoring for the state-level samples of reading and mathematics costs about $2.5 million per grade or about $8 per assessed student.² Thus, the cost for scoring the average annual rate of 400,000 students is about $3.2 million across all assessments, which is 1.8 percent of NAEP’s budget.

Test scoring is covered by one contract in the NAEP Alliance, which is for scoring and dissemination.³ The estimated annual average cost of this contract is $8.3 million (see Table 2-2 in Chapter 2).

___________________

¹ After a prepublication version of the report was provided to Institute of Education Sciences, NCES, and NAGB, this section was edited to clarify the description of the costs of the scoring and dissemination contract that are not related to scoring.

² NCES response to Q57b.

³ NCES response to Q33. The scoring and dissemination contract, also referred to as the materials, distribution, processing, and scoring contract, includes the following activities: “Prepares and packages all assessment and auxiliary materials; distributes assessment booklets and materials to the test administrators for each school; receives the materials from the schools; with [item development] and [design, analysis and reporting] contractor develops scoring training materials; and scores all assessments.”

Page 68 Cite

Suggested Citation:"7 Item Scoring." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

The estimated annual average cost for the other activities in the contract is $5.1 million. As indicated in the title of the contract, it includes a set of activities related to materials, distribution, and processing.⁴ It is reasonable to expect that computer-based administration (discussed in Chapters 5 and 6) will largely eliminate most of these activities in the future. The contract also includes six activities related to management and reporting⁵ and one activity to support assessment administration.⁶ Finally, the contract includes an optional activity for unspecified special studies. The panel does not have a breakdown between these different types of costs.

AUTOMATED SCORING OF CONSTRUCTED-RESPONSE ITEMS

Automated scoring⁷ refers to the “assignment of a score to a constructed response, produced by a test taker in response to a task or prompt, by means of a computational algorithm” (Bejar, 2011, p. 319).⁸ Automated scoring makes use of statistical and computational linguistic methods in order to model scores assigned by human raters. The model focuses on specific features in students’ responses and uses those features to generate a score, intended to mimic the process used by human scorers. Automated scoring has been widely adopted in K–12 assessment, licensure, and certification programs and is one of the most recognized applications of machine learning in educational measurement (Foltz, Yan, and Rupp, 2020).⁹

___________________

⁴ NCES response to Q69e. There are six contract activities listed related to materials, distribution, and processing: acquiring materials and supplies; spiral and bundle materials; distribute assessment materials to schools; track and receive assessment materials from schools; receipt control; and data capture and processing.

⁵ NCES response to Q69e. The six contract activities related to management and reporting are listed as follows: administrative reports; quality control; contractor meetings; information collections requests for Office of Management and Budget approval; technical documentation web page; and NAGB attendance, preparation, and support.

⁶ NCES response to Q69e. The activity to support assessment administration is described as follows: “State Service Center, State NAEP Coordinators, and State Testing Directors support.”

⁷ “Automated scoring” and “machine scoring” are sometimes used as equivalent terms. However, in this section, we distinguish automated scoring from machine scoring: automated scoring deals with unstructured input, such as unconstrained text, and machine scoring deals with structured input (e.g., math equations) or technology-enabled item inputs (e.g., ordered elements, drop-down boxes, and machine-enabled plots or graphs). With this distinction in mind, this section addresses automated scoring, not machine scoring, because NAEP already uses machine scoring for items that can be scored with other techniques; it is the items that allow unconstrained constructed responses that are still often routed for human scoring.

⁸ The implementation of that algorithm is referred to as a scoring engine.

⁹ See also An Overview of the Use of Automated Scoring Systems in Operational Assessments, AIR-ESSIN technical memorandum for task 14, 2020. Internal document provided to the panel by NCES and available in the project’s Public Access File.

Page 69 Cite

Suggested Citation:"7 Item Scoring." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

Automated-scoring models have displayed comparable performance relative to humans¹⁰ when scoring short and long essays and constructed responses in reading comprehension and mathematics (Cahill et al., 2020; McGraw-Hill Education CTB, 2014; Partnership for Assessment of Readiness for College and Careers, 2015; Shermis and Hamner, 2013),¹¹ though there is some evidence that the comparability between human and machine scoring is weaker for some groups, such as English-language learners.¹² Automated scoring has also been successfully applied to mathematical expressions and equations entered using an equation editor or to graphing items using a graph interface (Fife, 2017).

NAEP has conducted studies to evaluate the feasibility of automated scoring of assessments of writing, reading, history, and civics.¹³ These studies found that currently available scoring engines can successfully mimic human scoring in writing—but not yet in the other subjects—using standards widely accepted within the field (Williamson, Xi, and Breyer, 2012).

The incorporation of automated scoring into NAEP offers a number of likely benefits, including faster scoring, improved score consistency within and across administrations, higher-quality scoring of items when combined with human scoring, increased information about student responses, and potentially cost savings. Importantly, automated-scoring models do not drift and can help ensure that the scoring rubrics are applied consistently across years to support the estimates of trend. However, automated scoring models require human monitoring to examine performance, and models may need recalibration. Automated scoring also offers the potential for collecting additional diagnostic information about student responses beyond a score. Spelling, coherence, syntactic variation, and other linguistic features collected during the scoring process can provide more insights about student knowledge and skills. This is especially significant for a program that provides data that can support population-level inferences.

Many NAEP items in mathematics, reading, and writing may be machine scorable with available technologies. Importantly, the rapid improvements in recent years in computer algorithms and available data have the potential to further improve automated-scoring performance for existing

___________________

¹⁰ Human scoring performance is typically used as the standard for evaluating the performance of automated scoring engines because it is the obvious alternative.

¹¹ See also Gregg, N., Young, M., and Lottridge, S. (2021, June). Examining Fairness in Automated Scoring, paper presented at the National Council on Measurement in Education, available in the project’s Public Access File.

¹² Most of the work on group bias has been conducted on international university students who are non-native English speakers (Bridgeman, Trapani, and Attali, 2012; Burstein and Chodorow, 1999; Ramineni and Williamson, 2018).

¹³2018 Auto Scoring Report in Reading, History, and Civics, Grades 4 and 8. Internal NCES Report provided to the panel and available in the project’s Public Access File.

Page 70 Cite

Suggested Citation:"7 Item Scoring." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

and future item types (Ghosh, Klebanov, and Song, 2020; Mathias and Bhattacharyya, 2020; Riordan et al., 2020; Young et al., 2017). NAEP can expect to benefit from these improvements.

But these probable benefits come with complications. Using automated scoring would add another layer to the scoring process that requires technical oversight. It is also generally viewed with skepticism by the public (Wood, 2020), and it requires a program of validation to examine its effects on overall scoring and reporting. Careful planning related to technical oversight, public acceptance, and validation of its effects would be critical to the successful implementation of automated scoring. NAEP, given its national significance, is uniquely suited to leverage industry and academic expertise to lead the United States as an exemplar in how to incorporate automated scoring into an assessment program.

Evaluating Items for Feasibility of Automated Scoring

Automated scoring may not be appropriate for all NAEP items, for which a human-only scoring approach (“hand scoring”) may be needed. The performance of current scoring engines varies across items across and within item types: that is, models may not meet performance criteria for all items (McGraw-Hill Education CTB, 2014). Recognizing this limitation, NCES was in the process of conducting an open challenge to compare the performance of multiple scoring engines on NAEP reading assessment items at the time this report was being completed.¹⁴

Factors that influence both engine and human ability to score items include the depth of knowledge assessed by the item, the number of elicited concepts and the nature of the relationship between those concepts, the degree of variation in how concepts are described by examinees, the level of alignment between the item and the rubric, whether items stand alone or have dependencies, and the clarity of the item prompt and rubric (DiCerbo, Lai, and Ventura, 2020; Leacock and Zhang, 2014; Leacock, Messineo, and Zhang, 2013; Lottridge, Wood, and Shaw, 2018; Raczynski, Choi, and Cohen, 2021). Consideration of these factors during item creation can result in items that can be scored more successfully by both humans and automated scoring engines. The degree to which humans can score with high quality and agree with one another is also a driver of the level of agreement between scoring engines and humans (Patz, Lottridge, and Boyer, 2019; Wind et al., 2017).

___________________

¹⁴ See https://github.com/NAEP-AS-Challenge/info.

Page 71 Cite

Suggested Citation:"7 Item Scoring." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

Time and Cost Savings

The addition of automated scoring in NAEP can reduce the number of responses being hand scored, thereby decreasing both scoring time and cost. However, automated scoring does come with its own costs. Engine-related costs include obtaining a high-quality hand-scored sample (typically double-scored and resolved), engine training and validation, engine setup fees, and per-response or per-test scoring fees (Topol, Olson, and Roeber, 2014). It is also important to consider that some hand-scoring activities will still need to be done, especially around developing the rubric and training scorers to use it, hand scoring a subset of responses to train the model and monitor its performance, and monitoring the overall pattern of scores over time. There also may be additional costs for recalibrating models, special studies on model performance, and the costs for replacing any other hand-scoring activities that occur beyond score assignment (e.g., plagiarism detection).

Although NAEP scores a lot of responses and has a lot of items, the number of responses per item is relatively low—ranging from 2,000 to 30,000 per item (NAEP, 2013). Items are included in test forms about four times, resulting in total response counts of 8,000 to 120,000 across the life of a typical item in main NAEP; items in long-term trend NAEP are used more times because they are unchanged for a longer period of time. Items in the mandated reading and mathematics assessments with state and urban district samples are at the top of the range of response counts: items used in assessments with national samples are at the bottom of this range. In most implementations, automated-scoring models are trained for every item, and so increasing items increases costs.¹⁵ The NAEP response counts per item are near the threshold for achieving cost savings from automated scoring, which is typically around 30,000 responses; it depends on the cost savings from hand scoring and the overall number of items automatically scored.

Criteria for Examining Fairness and Validity Issues

The quality of automated-scoring procedures needs to be evaluated in the same ways as are done for human-scoring procedures (Bennett, 2011; Lottridge, Burkhardt, and Boyer, 2020; Williamson, Xi, and Breyer, 2012; Yan and Bridgeman, 2020). This evaluation should seek to determine the extent to which machine scores are reliable, fair, and valid for their intended

___________________

¹⁵ The use of item models for item creation—as discussed in Chapter 4—may allow automated-scoring models to be trained at the model level, rather than for individual items, which could result in further cost savings. While generic scoring models across items have been implemented in some contexts, this approach would need to be compared with the rubric requirements to ensure that scoring is valid.

Page 72 Cite

Suggested Citation:"7 Item Scoring." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

uses and the inferences they support. Studies will be needed to compare machine scores and hand scores in terms of descriptive statistics (i.e., mean, standard deviation, and distribution), rate of agreement between automated scores and human scores at the item level and the test level, and other measures of quality and to determine whether they vary with the training data used. These comparisons will need to be conducted for the full group of test takers and for test takers grouped by race and ethnicity, gender, English-learner status, disability status, family socioeconomic status, and other characteristics of interest.¹⁶

Fairness is a particularly important issue to consider in evaluations, given that research has documented disparities related to machine learning and automated scoring (Corbett-Davies and Goel, 2018; Hutchinson and Mitchell, 2019). The committee highlights the criteria established in Williamson, Xi, and Breyer (2012), which are widely used. Seeking to answer the question, “Is it fair to subgroups of interest to substitute a human grader with an automated score?,” Williamson outlined several group differences to examine: differences in the associations between automated and human scores across groups at the task, task type, and reported score levels; differences in the generalizability of automated scores by group; differences in the predictive ability of automated scoring; and difference in relation to the decisions made based on the scores.¹⁷ In evaluating fairness, it is also important to examine whether humans are introducing bias and, if so, to introduce methods to correct the bias, such as improved training and monitoring.

Finally, while the concepts of machine learning and automated scoring are becoming increasingly familiar to the public, there is still considerable skepticism. Much distrust rests on the fact that computers do not “understand” language in the way humans do and that the mechanisms underlying automated scoring do not match how humans score (Page, 2003; Wood, 2020). These are reasonable criticisms that programs using automated scoring need to address. Wood (2020) offers seven recommendations that focus on the creation of public-facing documentation that outlines how

___________________

¹⁶ Guidance and criteria for evaluation procedures are available in several publications, including Lottridge, Burkhardt, and Boyer (2020); Williamson, Xi, and Breyer (2012); and Yan and Bridgeman (2020). A broader approach to evaluation is discussed by Bejar (2011) and Bennett (2011), both of which present the view that automated scoring procedures should not be judged in isolation, without considering other aspects of the test and the testing context.

¹⁷ Other researchers suggest conducting differential item functioning analyses (Bridgeman, Trapani, and Attali, 2012; Shermis et al., 2017). If differences are identified, then it is important to investigate the source of those differences, both for human scorers and the engine (Ramineni and Williamson, 2018). See also Gregg, N., Young, M., and Lottridge, S. (2021, June), Examining Fairness in Automated Scoring, paper presented at the National Council on Measurement in Education, available in the project’s Public Access File.

Page 73 Cite

Suggested Citation:"7 Item Scoring." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

automated scoring works, how it is used in the program, and evidence of its performance, such as the results of comparisons with hand scoring. While NAEP does not report results at the examinee level, it is still critical to be able to explain the use of automated scoring to both technical and nontechnical audiences (Shermis and Lottridge, 2019).

ANTICIPATED COST REDUCTIONS FROM AUTOMATED SCORING

NCES currently plans to implement automated scoring where feasible for items in the reading and mathematics assessments in grades 4 and 8 in the near future.¹⁸ These are the assessments with state-level samples that will provide sufficient responses over the four-test life of a typical item to make automated scoring cost effective.

Currently, 40 to 50 percent of the reading items and 25 percent of the mathematics items are hand scored.¹⁹ NCES estimates that automated scoring can be used for 70 percent of the hand-scored reading items and 40 percent of the hand-scored mathematics items.²⁰ These figures are being empirically tested in the open challenge that is being conducted as this report is finalized and will be examined in future research studies.²¹ For the items that use automated scoring, hand scoring will continue for about 5 to 10 percent of responses to monitor the performance of automated scoring.²²

NCES estimates that automated scoring will cut the cost of hand scoring in half for the reading and mathematics assessments in grades 4 and 8 starting in fiscal 2024,²³ which would save approximately $2.5 million in scoring costs every 2 years. This reduction of $1.25 million in the annual average scoring cost represents 0.7 percent of NAEP’s budget. NCES estimates that a transition to develop online NAEP and automated scoring would require an investment of $2.5 million.²⁴ This investment “include[s] proof of concept and field test studies for online administration in addition to special studies to examine the feasibility of automated scoring.”²⁵

___________________

¹⁸ NCES response to Q57b.

¹⁹ NCES response to Q57b.

²⁰ NCES response to Q57b.

²¹ NCES (personal communication, December 17, 2021).

²² NCES (personal communication, December 17, 2021).

²³ NCES response to Q57b.

²⁴ NCES (personal communication, January 14, 2022).

²⁵ NCES response to Q57a. The NCES response to Q78 provides further detail about this work: The proof of concept will cost $80,000 and will “evaluate the use of automated scoring to score 2017 release NAEP grade 4 and 8 reading items.” A field test for $1–1.5 million will carry out a “duplicate ‘Shadow Score’ of 2019 NAEP Math & Reading items” using “the entire corpus of 285 constructed response mathematics and reading items.” In addition, NCES referred to ongoing special studies involving human double scoring ($400,000–600,000 each)

Page 74 Cite

Suggested Citation:"7 Item Scoring." National Academies of Sciences, Engineering, and Medicine. 2022. A Pragmatic Future for NAEP: Containing Costs and Updating Technologies. Washington, DC: The National Academies Press. doi: 10.17226/26427.

×

As would be expected on the basis of the typical industry experience for the threshold for automated scoring to be cost effective, NCES projections imply that automated scoring will result in only modest net cost savings over the next few years.

The cost savings projected by NCES do not currently reflect any use of automated scoring on other assessments with large state-level samples. Automated scoring may not be cost effective for these assessments because the low frequency of these assessments may not generate enough responses for each item. It is quite unlikely that automated scoring would be cost effective for the assessments with only national-level samples.

RECOMMENDATION 7-1: The National Center for Education Statistics (NCES) should continue its work to implement automated scoring on the reading and mathematics assessments for grades 4 and 8, with the item types that current scoring engines can score accurately and consistently. NCES should also consider the use of automated scoring on other assessments administered to state-level samples. In addition to benefiting from modest net reductions in costs, NCES should work to leverage the potential of automated scoring to improve the speed of reporting, increase the information provided about open-ended responses, and increase the consistency and fairness of scoring over time.

___________________

that will monitor the accuracy of automated scoring and work to expand its use.