National Academies Press: OpenBook

Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force (2023)

Chapter: 4 Evolution of Test and Evaluation in Future AI-Based DAF Systems

« Previous: 3 Test and Evaluation of DAF AI-Enabled Systems
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

4

Evolution of Test and Evaluation in Future AI-Based DAF Systems

4.1 INTRODUCTION

When the committee first set out to answer the questions driving this report, there was a healthy discussion about the study’s scope. At first, the questions asked appeared constrained, and the boundaries for areas of investigation seemed clear. However, after investigating each question, it became obvious to the committee that these questions could not be viewed in isolation. The areas being explored were as entangled with the complexity of the Department of the Air Force (DAF) bureaucracy as they were with the complexity of the technology. A common refrain in several data-gathering sessions was that the “DAF has a tiger by the tail”—a euphemism for the unexpected and unintended consequences that come with bold moves. These unexpected and unintended consequences are not necessarily unwanted or unneeded, but potentially more impactful than the DAF has anticipated. The evolution required to effectively operationalize artificial intelligence (AI) will affect a significantly larger part of the DAF than seems obvious at first glance, as the committee expects AI to be embedded throughout the entire DAF over the next decade. This chapter discusses the actual scope of the impact of these advancements—not only on the test community but also on the requirements processes and DAF culture. The chapter also reviews trends in AI technology that illustrate how quickly the field is changing, and hence how important it will be to maintain a firm yet flexible grip on this tiger’s tail as AI-based systems emerge ever more rapidly across the DAF.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

4.2 APPOINTING A DAF AI T&E CHAMPION

The magnitude of change this report suggests will require dedicated leadership, continuous oversight, and individual responsibility and accountability. This is best accomplished by formally designating a senior AI test and evaluation (T&E) official who reports to the Secretary of the Air Force, is responsive to the Chiefs of the Air and Space Forces, and who has the necessary resources and authorities to implement DAF-wide changes.

The 2022 dual-hat designation of the 96th Operations Commander as the chief of AI test and operations for the DAF Chief Data and AI Office (CDAO) is a positive and important step, and the report committee views the 96 OG/CC as one of the primary beneficiaries of this report. However, as currently constituted, the chief of AI test and operations for the DAF CDAO does not have the authority to make the magnitude of changes across the DAF this committee believes necessary to enable AI T&E.

Finding 4-1: Currently, no single person below the level of the Secretary or the Chiefs of the Air and Space Forces has the requisite authority to implement DAF-wide changes to successfully test and evaluate AI-enabled systems.

For this reason, the committee recommends that the Secretary of the Air Force formally designate an overall DAF AI T&E champion at the general officer or senior executive service level in the DAF, and delegate to them the necessary authorities to make changes on behalf of the Secretary and Service Chiefs. This advocate should have breadth and depth of experience in both AI and T&E, to include extensive experience with human-systems integration and agile software T&E. This advocate should establish an AI governance structure that includes formally delineating AI T&E reporting relationships and roles and responsibilities across the cri-Center, the future U.S. Space Force Operational Test Agency (OTA), the DAF CDAO, and operational air, intelligence, C2, space, and cyber units.1 This process should include assessing what broader DAF-wide organizational and governance changes are needed to reflect the differences between AI T&E and T&E for all other Air Force systems and capabilities.

The AI T&E champion should be charged with implementing the DAF AI T&E vision, granted the requisite authorities and resources (to include personnel) and fully empowered to help realize that vision for the DAF. The DAF AI T&E champion should focus on new test designs for AI-enabled systems that incorporate the core systems engineering principles of non-AI-enabled systems while adding new

___________________

1 Because of the unique T&E expertise required, the committee does not propose dual-hatting the DAF CDAO as the DAF AI T&E champion. Given the centrality of data to AI testing, however, the offices of the AI T&E champion and CDAO will be inextricably linked.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

elements that reflect the best AI T&E practices from academia, the private sector, and other government test organizations.

Recommendation 4-1: The Secretary of the Air Force and chiefs of the Air and Space Forces should formally designate a general officer or senior civilian executive as the Department of the Air Force (DAF) artificial intelligence (AI) testing and evaluation (T&E) champion to address the unique challenges of T&E of AI systems identified above. This AI T&E advocate should have the requisite AI and T&E credentials, and should be granted the requisite authorities, and responsibilities, and resources to ensure that AI T&E is integrated from program inception and appropriately funded, realizing the DAF AI T&E vision.

A successful model for appointing and empowering the AI T&E champion can be found with the response of the DAF to a previous National Academies study. In 2015, a study on the role of experimentation in the Air Force innovation life cycle2 recommended as its highest priority that a single individual at the top of the organization be responsible for “catalyzing” their desired outcome. That report emphasized the need for a singular authority responsible for “owning” the problem—and articulated that successful innovative organizations ensured that a “clearly identified individual was assigned responsibility for leading this work, was evaluated on their success in doing so, and woke up every workday focused on how to get it done better.” The DAF adopted this recommendation with great success.

General Mark Welsh, the then-Air Force Chief of Staff (CSAF), designated General Ellen Pawlikowski, then AFMC Commander, to spearhead the innovation and experimentation effort. Gen Pawlikowski instituted the strategic development planning and experimentation group (SDPE) to execute this responsibility. This group reported directly to Gen. Pawlikowski, and a new capability development council (CDC) reported to Gen. Welsh. Significantly, both of these institutions were chartered by the CSAF before the conclusion of the experimentation study. Notably, the SDPE continues to stimulate innovation across the Air Force—the next generation air dominance (NGAD) group is a salient example. Gen. Duke Richardson (the current AFMC commander) also recently established a digital transformation office (DTO) within AFMC; the AFMC commander used this similar approach to rectify the shortfall in implementing an effective digital strategy in the Air Force.

This model is just one successful demonstration that the DAF has of identifying and empowering a champion who is able to effectively implement the necessary changes.

___________________

2 National Academies of Sciences, Engineering, and Medicine, 2016, The Role of Experimentation Campaigns in the Air Force Innovation Life Cycle, Washington, DC: The National Academies Press, https://doi.org/10.17226/23676.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

4.3 ESTABLISHING AI T&E REQUIREMENTS

Throughout this study, one of the constant refrains this committee heard from speakers was the importance of formulating T&E requirements for AI capabilities that reflected the needs of end-users or operators, not only developers or testers.3 Yet the same speakers acknowledged the difficulty of defining comprehensive T&E requirements for software-centric capabilities whose “black box” performance under operational conditions could change continually based on the ingestion of more and more data and that generate probabilistic or statistically predictable behavior rather than deterministic results, and whose performance could change significantly with every update to a fielded model.

Most current AI models do not learn by themselves in the field. They are trained and tested a priori and deployed. They may be re-trained under operational conditions in an operational environment, in which case regression testing is required. Most AI models are per se deterministic in that, for example, a neural network has weights and thresholds and a method for combining the operations that is deterministic (i.e., the model is based on mathematical functions that operate in a predictable way). However, the data they ingest under operational conditions is stochastic, subject to environmental noise, sensor noise, data dropouts, faulty equipment and data collection, and environmental conditions. This “probabilistic” behavior is intrinsic to all sensing systems. What is unique to many AI models is that their behavior under these data corruption and stochastic behavior scenarios are not well understood at the theoretical level and often exhibits what is today seen as non-intuitive and brittle failure modes. At the same time, while overall model performance is expected to improve over time as more operational data are ingested, absorption of more data could also lead to significant reductions in performance if the new data are corrupted or poisoned or the AI model is subject to other forms of adversarial attacks (see Chapter 5). This would be particularly problematic if such attacks are undetected.

The intersection of these two equally important considerations sets AI T&E apart from all previous DAF T&E. It leads to a fundamental and persistent challenge for AI T&E today: understanding what requirements to test against when evaluating standalone AI models, and what requirements to test against once one or more AI capabilities are integrated into a DAF system. As the NSCAI noted and as Project Maven demonstrated, the former is challenging enough; the latter introduces formidable new complexities that will require entirely new approaches to performing T&E of AI-enabled weapon systems or decision support systems—not only for AI

___________________

3 One speaker noted that it was essential for AI developers to talk to operators or end-users at the beginning of a system’s design phase. This would not only allow developers to gain better insights into how a given AI-enabled capability would be used operationally, it would also help end-users gain a better understanding of the AI T&E process. The committee returns to this point later in this section.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

added to fielded systems, but also for AI that is baked-into new systems beginning with the design phase.4 With the current state of technology, AI T&E does not align conveniently with either T&E for traditional hardware weapon systems or T&E associated with DoD’s software acquisition pathway (although, generally, it is a closer fit to the latter than the former).5

This dilemma is a manifestation of the major differences between AI T&E and traditional T&E of hardware systems, which assesses and evaluates well-defined

___________________

4 In its recommendations for AI T&E future actions, the national security commission on AI final report notes that “Progress on a common understanding of TEVV concepts and requirements is critical for progress in widely used metrics for performance. Significant work is needed to establish what appropriate metrics should be used to assess system performance across attributes for responsible AI according to applications/context profiles. (Such attributes, for example, include fairness, interpretability, reliability, and robustness.) Future work is needed to develop: (1) definitions, taxonomy, and metrics needed to enable agencies to better assess AI performance and vulnerabilities; and (2) metrics and benchmarks to assess reliability and intelligibility of produced model explanations. In the near term, guidance is needed on: (1) standards for testing intentional and unintentional failure modes; (2) exemplar datasets for benchmarking and evaluation, including robustness testing and red teaming; and (3) defining characteristics of AI data quality and training environment fidelity (to support adequate performance and governance),” p. 645.
The committee encourages the DAF to adopt these recommendations. See National Security Commission on Artificial Intelligence (NSCAI), 2021, National Security Commission on Artificial Intelligence Final Report, Arlington, VA, https://www.nscai.gov/wp-content/uploads/2021/03/Full-Report-Digital-1.pdf, p. 137.

5 DoD Instruction 5000.89 describes DoD-wide test and evaluation policies, processes, and procedures for urgent capability acquisition, middle tier of acquisition (MTA), major capability acquisition, software acquisition, and defense business systems (DBS). See U.S. Office of the Under Secretary of Defense for Research and Engineering, 2020, “DoD Instruction 5000.89: Test and Evaluation,” Washington, DC: Department of Defense, https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodi/500089p.pdf. Defense acquisition of services does not require T&E policy and procedures. DoDI 5000.89 states that “For non-major defense acquisition programs (MDAPs) and for programs not on T&E oversight, these guiding principles should be used as a best practice for an integrated and effective T&E strategy,” p. 4. AI T&E is not discussed in DoDD 5000.89; accordingly, as currently written this directive provides “guiding principles” for AI T&E, not definitive guidance. See U.S. Office of the Under Secretary of Defense for Acquisition and Sustainment, 2020, “DoD Instruction 5000.02: Operation of the Adaptive Acquisition Framework,” Washington, DC: Department of Defense, https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodi/500002p.pdf. This Instruction addresses the use of the adaptive acquisition framework (AAF) in software acquisition. DoDI 5000.02 states explicitly that “Programs executing the software acquisition pathway are not subject to the Joint Capabilities Integration and Development System (JCIDS), and will be handled as specifically provided for by the Vice Chairman of the Joint Chiefs of Staff, in consultation with Under Secretary of Defense for Acquisition and Sustainment (USD(A&S)) and each service acquisition executive,” p. 3. It also notes that “Programs executing the software acquisition pathway will not be treated as major defense acquisition programs,” p. 3. See U.S. Office of the Under Secretary of Defense for Acquisition and Sustainment, 2020, “DoD Instruction 5000.87: Operation of the Software Acquisition Pathway,” Washington, DC: Department of Defense, https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodi/500087p.pdf.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

key performance parameters (KPPs) or, for information systems, net-ready KPPs (NR-KPP), and other largely static metrics6 driven by the joint capabilities integration and development system (JCIDS)7 and joint requirements oversight council (JROC) and established during system design and development. One of the dilemmas the AI test community must grapple with is to understand when more traditional KPPs or NR-KPPs should apply, and when more flexibility is required to avoid placing undue constraints on AI systems that are designed to meet end-user needs under operationally-relevant timelines. In other words, for AI capabilities the sponsoring organization must find the appropriate balance between overly broad and unnecessarily restrictive performance specifications, as the committee discusses in more detail below.

In general, AI requires greater integration between designers, testers, and operators or end-users to enable transparency of approach and outcome, as is common in the application of a DevSecOps process (see Recommendation 4-2). The differences between the two approaches need to be acknowledged in the near term. Still, all short-term solutions will continue to evolve over time through an iterative, interactive process as air force end-users and personnel within responsible test organizations gain more experience with writing AI-centric T&E requirements and with AI T&E processes and practices and as AI T&E becomes more automated and test results become more explainable. The committee echoes the NSCAI’s recommendation to the military services to “establish a TEVV framework and culture that integrates testing as a continuous part of requirements specification, development, deployment, training, and maintenance and includes run-time monitoring of operational behavior.”8 Section 255 of the FY2020 National Defense Authorization Act (NDAA) established a “shift left” for software that requires T&E be incorporated into the development life cycle of the software, at minimum. This policy would naturally extend to AI T&E, which will then need to go further to include the continuous T&E necessary for AI.

Conclusion 4-1: Compared to traditional T&E, AI T&E requires radically deeper continuous technical integration among designers, testers, and operators or end-users.

___________________

6 Such as critical technical parameters (CTP), critical intelligence parameters (CIP), key system attributes (KSA), interoperability requirements, and cybersecurity requirements. KPPs/NR-KPPs will still exist for AI-enabled systems, particularly in areas such as the safety and security of AI-enabled safety-critical systems.

7 For traditional hardware systems, the sponsoring service or agency enters the JCIDS process with a capabilities-based analysis (CBA), Doctrine, Organization, Training, materiel, Leadership, Personnel, Facilities, Policy (DOTmLPF-P) analysis, other studies or analyses, or transition of rapidly fielded capability solutions.

8 NSCAI, 2021, National Security Commission on Artificial Intelligence Final Report, Arlington, VA, https://www.nscai.gov/wp-content/uploads/2021/03/Full-Report-Digital-1.pdf, p. 384.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

Treating requirements for AI capabilities in the same manner as those for traditional hardware systems is likely to lead to unnecessary delays in development, acquisition, fielding, and sustainment.9 As AI is a software capability, it is essential for developers to be as flexible and agile as possible to allow fielding models and model updates on operationally-relevant timelines.10 Rather than applying the extreme rigor of and adhering to the extended timelines associated with the JCIDS requirements process, the preferred approach for AI-enabled capabilities is to link proposed solutions—whether provided by commercial vendors or DoD organizations—to existing JCIDS requirements while being sure to follow a DevSecOps or AIOps/MLOps development methodology. This will shorten development and fielding timelines considerably.11 One of AI’s most distinguishing features is the importance of relying on real and near real time feedback from operational users, and ingesting operational data, to make rapid iterative improvements in fielded AI models via the agile methodology and CI/CD processes.

Recommendation 4-2: The Department of the Air Force should adopt a more flexible approach for acquiring artificial intelligence (AI)-enabled capabilities that whenever possible links proposed solutions to existing joint capabilities integration and development system requirements, and that follows a development, security, and operations or AI for information technology operations/machine learning operations development methodology.

The DoD Algorithmic Warfare Cross-Functional Team (Project Maven) used this approach when soliciting computer vision solutions to meet standing operational needs: members of the Maven team performed an exhaustive search of JCIDS databases to find existing requirements that had identified operational limitations and requested solutions that could augment, accelerate, and automate processing, exploitation, and dissemination of tactical and medium-altitude UAS full-motion video. Once a commercial CV algorithm solution could be linked to an existing

___________________

9 Another risk that has not been sufficiently considered when “testing to requirements” in accordance with the JCIDS process, is that AI systems that return better-than-expected testing results could be discarded for not meeting specific narrowly defined JCIDS-dictated requirements.

10 See for example, W. McHenry and M. Brown, 2022, “The 1960s Had Their Day: Changing DoD’s Acquisition Processes and Structures,” Real Clear Defense, December 5, https://www.realcleardefense.com/articles/2022/12/05/the_1960s_had_their_day_changing_dods_acquisition_processes_and_structures_868279.html. The authors emphasize the difference between DoD’s linear acquisition processes and successful commercial technology programs that rely on cross-functional teams and continual user feedback during design, development, fielding, and sustainment.

11 DoDI 5000.89 requires a test strategy when using the software acquisition pathway, and notes that this pathway “focuses on modern iterative software development techniques such as agile, lean, and development security operations, which promise faster delivery of working code to the user. The goal of this software acquisition pathway is to achieve continuous integration and continuous delivery to the maximum extent possible” (p. 24).

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

formal DoD requirement and translated into a request for proposal (RFP), the Maven T&E team established testable and verifiable performance measures for that algorithm, as described previously in this report.

Members of Project Maven “translated” esoteric T&E metrics into terms that were most relevant to operational end-users. Because formal requirements had not been established for AI model performance, once the Maven team had completed data quality assurance, T&E on each model, integration testing in the Maven Integration Lab, and live-fly testing, user acceptance of each trained model and follow-on updates to those fielded models, was based primarily on an agreement between the Maven team and operational users that the models had demonstrated adequate performance under operational conditions (as compared to the baseline performance achieved with existing, non-AI systems). Once a minimum viable product (MVP) model was fielded, user feedback was instrumental in refining model performance through continuous integration and continuous delivery (CI/CD). This entire process, which was considerably less rigid than the T&E of major acquisition program hardware systems, underscored the importance of defining future T&E requirements for all AI capabilities and AI-enabled platforms, sensors, and tools in ways that reflect consensus between developers and end-users at every stage of the AI life cycle. The JAIC T&E division (now under the OSD CDAO) refined Maven’s processes, procedures, and practices and is publishing CDAO AI T&E playbooks and providing AI T&E frameworks to OSD DOT&E that the DAF should consider adopting.12

This less constrained approach to AI requirements formulation introduces risks. It creates the potential for overly broad performance specifications and disparities between contract language and end-user requirements. However, such risks can be mitigated substantially through a continuous dialogue between developers (DevSecOps or AIOps/MLOps teams), end-users, designated acquisition officials, and the responsible DAF test organization. Such a dialogue will help developers and testers formulate T&E metrics and performance measures that best match the end-users’ operational needs. While end-user involvement and feedback are valuable during the T&E of all systems, it is especially important during every stage of the AI life cycle due to general unfamiliarity with AI capabilities, as well as AI’s unique self-learning characteristics compared to all other traditional DAF hardware systems and software capabilities.

The National Institute of Standards and Technology’s (NIST’s) AI Risk Management Framework (RMF) lists representative AI actors across the AI life cycle.13

___________________

12 These include frameworks for T&E of AI-enabled systems (AEIS); operational testing of AEIS; human-system integration (HSI); system integration; responsible AI (RAI); and AI assurance.

13 National Institute of Standards and Technology, Department of Commerce, 2023, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, Washington, DC, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf. Also, see the accompanying (draft) NIST AI RMF Playbook, available at https://pages.nist.gov/AIRMF.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

This list of activities and representative AI actors in each stage of the AI pipeline underscores the role that operators or end-users should play in the AI life cycle, most importantly beginning with plan and design. Which, in coordination with testers, domain experts, AI designers, product managers, and others, is intended to lead to the formulation of AI T&E metrics and performance measures. The level and frequency of end-user or operator involvement in this process is another feature that distinguishes AI T&E from most traditional DAF hardware testing practices.

Addressing this foundational “how much?” question should be one of the DAF AI T&E champion’s initial top priorities, guided by discussions with the OSD CDAO, DOT&E, DASD(DT&E), AF CDAO, AFMC DTO, and other relevant DAF and joint AI test organizations and agencies. The answer to this question will always be context dependent, reflecting a combination of myriad factors such as end-user requirements, degree of urgency, technology and human readiness levels (TRLs/HRLs), assessed risks of action and inaction, scope, scale, and differences between an original fielded model and subsequent model version updates. It will also depend on the level of risks that end-users are willing to accept based on their operational imperatives. Yet this requires the test-responsible organization to communicate as transparently as possible to end-users measured and expected performance capabilities, system limitations, and possible failure modes of AI-enabled systems that users intend to accept for fielding.14

As the NSCAI recommended in its final report, one of the DAF’s critical first steps, led by the AI T&E champion in coordination with the OSD CDAO, OSD DOT&E, DASD(DT&E), and DAF CDAO, should be to establish “a process for writing testable and verifiable AI requirement specifications that characterize realistic operational performance,” and to provide “testing methodologies and metrics that enable evaluation of these requirements—including principles of ethical and responsible AI, trustworthiness, robustness, and adversarial resilience.”15

As noted above, the iterative and interactive dialogue between end-users, testers, and the broader AI community will help operators and testers agree on request for proposal/request for information (RFP/RFI) and contract language, help end-users understand how AI performance will be assessed by testers, and help testers develop appropriate test metrics and performance measures. As noted in the Project Maven case study, other AI T&E best practices include setting aside sufficient representative data for training, validation, or assessment, and test; building T&E

___________________

14 See, for example, M.A. Flournoy, A. Haines, and G. Chefitz, 2020, Building Trust Through Testing: Adapting DOE’s Test & Evaluation, Validation & Verification (TEVV) Enterprise for Machine Learning Systems, Including Deep Learning Systems, Washington, DC: Center for Security and Emerging Technology (CSET). https://cset.georgetown.edu/wp-content/uploads/Building-Trust-Through-Testing.pdf.

15 NSCAI, 2021, National Security Commission on Artificial Intelligence Final Report, Arlington, VA, https://www.nscai.gov/wp-content/uploads/2021/03/Full-Report-Digital-1.pdf, p. 384.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

harnesses; evaluating fielded models as part of ongoing operational assessments; defining model boundary conditions and assessing AI failure modes; and developing T&E processes for each subsequent update to fielded models through normal CI/CD processes. One of AI’s most distinguishing features is the importance of relying on real- and near-real-time feedback from operational users and ingesting operational data to make rapid iterative improvements in fielded AI models via agile principles and CI/CD processes. This includes integrating test personnel with operational units when feasible. The DAF should consider training a subset of the DAF-wide test cadre to be integrated into operational units to assist with onsite AI T&E.

When the committee refers to requirements, it includes the need for DAF-wide investments in enabling capabilities that support AI T&E at enterprise scale. Because of AI’s unique characteristics and the uncertainties associated with AI performance in operational environments, the committee recommends that the DAF prioritize investments for digital modernization of the DAF test enterprise and for implementing an enterprise-level T&E architecture, enabled to the maximum extent possible by the OSD CDAO,16 OSD DOT&E, OSD DASD(DT&E), the test resource management center (TRMC), and DAF CDAO. This should include major and near-term investments in modern AI stacks across AFTC, AFOTEC, and USAFWC (to include access to enterprise cloud-as-a-service and platform-as-a-service [PaaS] capabilities); modeling and simulation; the Virtual Test and Training Center (VTTC) at Nellis AFB; the joint simulation environment (JSE);17 the Air Force Digital Test Environment; the 96th Operations Group’s new initiative to establish

___________________

16 Through its National AI T&E Infrastructure Capability (NAITIC) study, OSD CDAO is coordinating with TRMC, DTE&A, and DOT&E to answer the following basic question: is DoD properly resourced to adequately test and evaluate AI-enabled capabilities? This study is designed to systematically explore supply and demand for T&E of AI capabilities and identify gaps in DoD infrastructure. The study’s primary conclusion is that there is no evidence-based analysis of DoD AI T&E infrastructure gaps tied to demand (programs with AI capabilities) or supply (extant T&E infrastructure). In the near term, the DAF can take advantage of CDAO’s test harnesses (available through the CDAO “test and evaluation factory”), T&E bulk purchase agreements (BPA), and the red teaming handbook. The CDAO’s joint AI test infrastructure capability (JATIC) is an interoperable set of state-of-the-art software for rigorous AI model and algorithm test and evaluation. The CDAO AI Assurance division also makes available actual test products such as test and evaluation master plans (TEMP), to include one for an autonomous system; red team assessments; algorithmic integrity assessments; and human-system integration type assessments.

17 The joint simulation environment or JSE is a scalable, expandable, high-fidelity, government-owned non-proprietary modeling and simulation environment. While designed originally for testing fifth-generation aircraft in a simulation environment, its use is expanding to fulfill other integrated testing requirements.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

a digital synthetic version of the air and ground ranges in and around Eglin AFB; digital twins;18 and live-virtual-constructive (LVC) integration.

As more AI-enabled weapon systems, especially AI-enabled autonomous weapon systems, are fielded across the Air Force and Space Force, there will be tremendous value in providing dedicated T&E “sandbox environments.” Such environments will be vital in supporting T&E for systems in more operationally-realistic settings and in providing more insights into potential AI system limitations and failure modes while also allowing the appropriate assessment of individual system risks and the risks associated with integration into a system-of-systems.

Digital modernization includes building and sustaining data management pipelines (DMP) for all AI projects. Every DAF AI project requires building a project-specific information architecture and establishing processes and procedures to generate training-quality data (TQD) essential to building and testing high-performance AI models. Additionally, the DAF AI T&E champion, in coordination with DAF system program offices (SPOs) and PEOs, should provide standardized contract options to address the need for TQD and machine-readable data, along with options for intellectual property (IP) protections and ownership of data rights and licenses for both commercial vendors and government entities.19 Finally, one of the primary duties of the DAF AI T&E champion would be to formally adopt and promulgate DAF-wide guidance, such as the February 2022 DAF-MIT AI Accelerator Artificial Intelligence Acquisition Guidebook.20

As noted previously, the committee expects DAF leaders to substantially underestimate the level of investments required to implement digital modernization of

___________________

18 While there are many definitions, a digital twin is generally defined as a digital representation of a physical object or system that can be used to simulate its real-world behavior and characteristics. Cortez et al. (2022) define a digital twin as a “digital representation of a Single Board Computer (SBC) and/or components representing a functionally correct, predictable and reproducible representation of that board or system at the appropriate level of fidelity to perform software verification, performance analysis and software validation tasks.” N.F. Cortez, E. Williams, A. House, and J. Ramirez, 2022, “Virtualization: Unlocking Software Modularity of Embedded Systems,” 2022 DoD Weapon Systems Software Summit, Orlando, FL: Orange County Convention Center, December 13, https://repo1.dso.mil/dsawg-devsecops/team-8/team8_artifacts/-/blob/master/Virtualization_-_Unlocking_Software_Modularity_of_Embedded_Systems_v2.pdf.

19 See, for example, A. Bowne and R. Holte, 2022, “Acquiring Machine-Readable Data for an AI-Ready Department of the Air Force,” The JAG Reporter, November 29, https://www.jagreporter.af.mil/Post/Article-View-Post/Article/3216144/acquiring-machine-readable-data. In addition to describing the importance of TQD and machine-readable data, the authors also address IP and data rights as part of the contracting and acquisition process for AI projects. See also Department of Defense, 2020, DoD Data Strategy, September 20, Washington, DC, https://media.defense.gov/2020/Oct/08/2002514180/-1/-1/0/DoD-Data-Strategy.pdf.

20 Department of the Air Force, 2022, Artificial Intelligence Acquisition Guidebook, Cambridge, MA: MIT, https://aia.mit.edu/wp-content/uploads/2022/02/AI-Acquisition-Guidebook_CAO-14-Feb-2022.pdf.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

the DAF test enterprise and establish modern AI data management best practices. Therefore, in coordination with OSD CDAO, the committee recommends that the DAF immediately initiate a comprehensive analysis of the resources required to carry out digital modernization across the Air and Space Forces and resource those requirements appropriately in future DAF budgets.

Over the longer term, when feasible and it makes sense operationally, the DAF should strive to integrate AI into programs of record, via the DAF’s SPOs and PMOs, and program executive officers (PEO), rather than “bolting on” AI to a system after fielding, as is the case today.21 In these cases, AI T&E can be integrated into the host weapon system test and evaluation master plan (TEMP). However, DAF responsible test organizations should be wary of allowing AI T&E to be “held hostage” when there are excessive delays in the parent weapon system test schedule during DT, OT, IOT&E, FOT&E, or live-fly test and evaluation (LFT&E). One speaker provided an example of a delay in flight testing that caused an undue delay in the planned rapid T&E of an AI capability integrated into the system under test. Another speaker cited an example of overly-restrictive conditions directed by a program of record owner on the ability to update an integrated AI capability hosted on a hardware platform. As a result, large portions of T&E can be accomplished on AI capabilities before they need to be tested as part of a fully hardware-software integrated weapon system. This departure from established hardware test practices suggests the need for a DAF-wide test enterprise cultural shift, which in turn depends on providing more education and training on AI T&E and agile principles. The committee addresses this in more detail in the following Culture Change and Workforce Development section.

Recommendation 4-3: To the maximum extent possible and where it makes sense operationally, the Department of the Air Force (DAF) should integrate artificial intelligence (AI) requirements into programs of record, via the DAF’s system program offices and program executive officers, and integrate AI testing and evaluation (T&E) into the host weapon system T&E master plan.

Even with the rapid development of new AI capabilities and the maturation of earlier AI-enabled systems that provide opportunities for rapid updates to fielded models, many AI systems today remain brittle. Apart from the difficulties of fielding AI-enabled capabilities that perform as well in the operational environment as they do on the laboratory bench, AI will be subject to corruption and adversarial attacks in the form of model or algorithm denial and deception, data poisoning, evasion attacks, and cyberattacks among others. Adversarial attacks will occur at

___________________

21 One notable exception is the Air Force Ground-Based Strategic Deterrence “Sentinel” program, which has incorporated digital modernization principles, to include the use of digital twins, since program inception.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

the model level, system level, during operational deployment, and throughout the entire AI life cycle and data management pipeline. As such, the DAF must establish dedicated independent AI red teams that are considered as fully integrated elements of AI T&E. These teams can help develop and update defenses against adversarial attacks while also supporting the development of offensive adversarial attack techniques—much in the same way that “red air” has played an indispensable role in improving the effectiveness of DAF over the past 40 years, or in how cyber “white hat” and red teams have been honing the skills of the DAF cyber defenders and attackers over the past decade.

Red teams represent a critical component of AI test design and the overarching requirements process. These teams must be capable of emulating current and future peer competitor capabilities and performance and should be integrated into the entire AI life cycle. Furthermore, the committee underscores the importance of not viewing AI red teams as entities that are completely separate from the AI T&E enterprise.22 Instead, they should be integral to AI T&E, although independent, and focused on operational performance and mission resilience in the face of known and unknown—but expected—adversarial attacks, beginning with the presumption of attack at every stage of the AI life cycle, including cyberattacks, data manipulation, and data corruption and poisoning (as discussed in the next chapter). This includes the importance of instrumenting fielded systems to inform end-users of a potential adversarial attack or unexpected degradation in model performance (which may indicate an adversarial attack). Similar to OSD DOT&E’s use of cyber red teams, the committee recommends that DAF AI red teams fall under the direction of the DAF AI champion (or designated AI T&E lead).

Establishing a DAF activity focused on AI-based systems red-teaming would provide trust and justified confidence in the face of potential adversarial attacks that present unique challenges for which the DAF is currently unprepared. To ensure that AI-enabled systems are resilient during development, training, deployment, and retraining or updating, the committee recommends the DAF develop T&E approaches integrated with red-team findings that reflect the range of adversarial activity anticipated during all phases of the AI life cycle.

Recommendation 4-4: The Department of the Air Force should establish an activity focused on robust artificial intelligence–based systems red-teaming, implement testing against threats the red-teaming uncovers, and coordinate its investments to explicitly address the findings of red-team activities and to augment research in the private sector.

___________________

22 OSD DOT&E has relied on DOT&E-sponsored and service-led cyber red teams for the past several years. See for example, DOT&E, 2022, “Cyber Assessment Program,” FY 2021 Annual Report, https://www.dote.osd.mil/Portals/97/pub/reports/FY2021/other/2021cap.pdf?ver=597qqovFSFg_PajZvaLu_w%3D%3D.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

Finally, the DAF AI T&E champion must address how to respond to requests for changes to fielded AI models beyond the process described earlier, which accounts for regular, periodic updates through CI/CD processes. For all designated DoD-wide weapon systems, existing urgent operational needs (UON)/joint urgent operational needs (JUON)/joint emergent operational needs (JEON) processes are used for capability requirements identified as impacting ongoing or anticipated contingency operations. For example, for air force aircraft and related systems, the normal peacetime change process begins with unit-level requests (such as an operational change request [OCR], or Form 1067, which is used to document the submission, review, and approval of requirements for modifications). For fielded electronic warfare (EW) systems, units can seek emergency reprogramming updates through the EW integrated reprogramming (EWIR) process. For cyber systems, the AFCYBER incident response plan can trigger requirements for changes to fielded capabilities. DAF leaders should consider the advantages and limitations of all these different processes—as well as those in other DoD organizations and private sector companies—when establishing new processes and procedures that govern requests for urgent updates to fielded AI models. These processes and procedures must account for data requirements, model retraining, and the extent of additional T&E required.23

Data Management Requirements

Despite the focus on digital transformation and data over the past 5 years, the DAF is not yet an AI-ready force. The DAF does not yet treat its huge capacity for data collection in its internal business operations and its external missions in ways optimized for AI-based processing and exploitation. With few exceptions, data are not treated as a “first-class citizen.” It is not sufficiently tracked, managed, curated, protected, or stored in formats that make it readily accessible by AI developers and testers, and AI models. The DAF has not established policies and practices for building and sustaining the data-management pipelines crucial to modern AI development. The DAF does not have the modeling and simulation architectures, synthetic environments, digital twins, or computational power needed to support developing, testing and sustaining advanced AI-enabled systems. These deficiencies,

___________________

23 As noted by the 96 OG/CC, because Eglin AFB is a designated Major Range and Test Facility Base (MRTFB) and is funded through a “pay to play” model (as directed by the NDAA), DAF leaders must address the disconnect between the timelines inherent in this type of funding model, and the certainty of needing immediate funding for high-priority emerging AI T&E requirements. The DAF AI T&E champion will also need to assess the impacts of traditional contractually mandated response timelines when responding to urgent and emerging AI T&E requirements.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

if not redressed, will adversely affect all aspects of AI-based systems development, including T&E activities.

Building off the 2020 DoD Data Strategy, the DAF should update its data vision and strategy to explicitly recognize data as a “first-class citizen.” This strategy and accompanying implementation plan should include policies and establishing processes to track, manage, curate, protect, and store data in ways optimized for AI developers and testers and that account for possible sources of bias in data. The DAF needs to provide guidance on building and sustaining DMPs, to include highlighting government and private sector best practices for collecting and generating AI-ready data. Data at all levels of classification should be stored in the purview of a zero-trust network architecture, particularly accounting for data privacy when systems are trained on sensitive data.

The committee recommends storing and protecting data at all levels of classification within the purview of zero-trust network architecture and accounting for data privacy when systems are trained on sensitive data.

4.4 CULTURE CHANGE AND WORKFORCE DEVELOPMENT

The concept of culture is much easier to experience viscerally than it is to define or even describe adequately. Yet culture is very real. It materially affects an organization or community’s health and long-term performance. In general, culture refers to a set of shared behaviors, beliefs, and values. It is formed over time, resulting from the combined actions and words of all the people within an organization or community. While an organization or community’s leaders play a paramount role in establishing a particular culture through the promulgation of their vision, mission, and value statements; their leadership philosophy and style; the way they treat members of the organization; the norms they establish and enforce; and how they incentivize good

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

and correct bad behaviors, organizational culture can only be formed, sustained, or changed by the collective behaviors of the entire organization or community over time.

While establishing and sustaining a particular culture is difficult, it is even harder to change an ingrained culture formed over many decades that is viewed as unique, elite, and highly successful. Those three qualities describe the culture of today’s DAF test enterprise.

It is hard to argue with past successes. While the committee cannot offer a “recipe” for culture change, it nonetheless believes that culture changes are necessary to ensure AI T&E’s best practices, processes, and procedures are adopted as rapidly as possible across the DAF. As noted throughout this report, despite many commonalities between traditional T&E and AI T&E, there are also notable differences. In particular, these include the lack of a clear delineation between DT and OT for AI capabilities; the importance of and reliance on agile principles and adaptive T&E principles (AIOps, MLOps, or DevSecOps) instead of waterfall development for AI systems; the centrality of data and high-end computing; the potential for a continuous data-based self-learning capability; the importance and challenges of mission- and domain-specific adaptation for AI-enabled systems; probabilistic or statistically predictable behavior rather than deterministic results; the effects and risks of dedicated adversarial attacks against AI models, at every phase from initial algorithm training through model deployment and sustainment; the desire for AI explainability and auditability; and the need for continuous integration and continuous delivery (CI and CD) for fielded AI-enabled systems.

The committee asserts that the magnitude of these differences warrants developing a new culture, one that combines the best of the extant test culture with a new and more risk-tolerant, agile, and adaptive mindset and approach to AI T&E. This sort of culture change will be instrumental in accelerating the adoption and integration of AI across the DAF at speed and at scale.

There are inherent dangers in rushing to change the legacy DAF test culture. Attempting to drive systemic changes across the test community without fully understanding the nature and magnitude of the change required or failing to communicate the rationale for change throughout the entire community can cause irreversible harm to the existing culture while simultaneously preventing leaders and organizations from forging and sustaining a culture that can endure for the foreseeable future. For these reasons, it is critical to identify specific aspects of the DAF test enterprise culture that need to be changed and why. Likewise, it is equally important to understand what elements of the existing DAF test culture should be preserved and how. These are not trivial steps. They will require active participation and buy-in from stakeholders and experts across the DAF test enterprise. Initial problem-framing must also include the participation of experts in AI and other emerging technologies from across the DAF, the federal government, and industry and academia—especially those with extensive AI and software T&E

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

experience, to include recent experiences with leading-edge T&E techniques and adversarial attacks. In essence, DAF leaders should seek to maintain the “best of both worlds”: combining elements of today’s test culture with new elements that the test community agrees will most likely lead to T&E success in a future environment characterized by software-defined warfare.

Culture change begins at the top. Changing any culture depends on setting and adhering to a coherent vision that aligns strategies, actions, incentives, and metrics. The committee recommends that DAF leaders communicate immediately to the Air and Space Forces both the importance of AI T&E and their commitment to establishing a culture unique to AI T&E through the right combination of people, processes, and technology. At the same time, they should emphasize the value of preserving the successful elements of the current DAF test community culture. The designated DAF AI T&E champion should be equally committed to long-term culture change and should be responsible for recommending changes to DAF leaders that are designed to help forge a new AI T&E culture. The champion should also be accountable for following through on the decisions of DAF leaders.

Workforce development is a critical component of the DAF-wide plan to introduce and sustain new AI T&E capabilities culture. In broad terms, workforce development comprises training, education, certification, and talent management. Because AI remains relatively new, these elements will, of necessity, include both general and test-specific AI training, education, and certification. Similarly, current DAF initiatives and programs that provide AI education and training—led primarily at present by the Department of the Air Force-MIT AI Accelerator (AIA) in coordination with Air University and OSD CDAO—should ensure that all levels of personnel have the appropriate training, from general officers and senior civilian executives to entry-level personnel.24 CDAO now also has AI education initiatives with JHUAPL and Naval Postgraduate School (NPS)/Stanford. This includes establishing requirements for continuing education and training (CET) on AI and AI T&E-specific topics. It will be equally important for the DAF AI T&E champion to advocate for centralized career-long tracking and management of personnel with specific AI and AI T&E skills, similar to other DAF efforts to manage myriad career fields (appropriate analogies include the cyber, space, and intelligence career fields, which recognize baseline training and certification along with additional identification of specialized training and certifications for specific positions held throughout a career).

The committee recommends that as opposed to general AI training, which can be accomplished by various DoD organizations, core AI T&E training should fall under the AFTC. Since few DAF organizations and agencies presently have the

___________________

24 For example, the DAF-MIT AI Accelerator and the MIT Sloan School of Management host a 3-day AI for National Security Leaders (AI4NSL) education program in Cambridge, Massachusetts.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

requisite level of AI and T&E expertise, the committee recommends that the DAF rely on UARCs and federally funded research and development centers (FFRDCs) to run AI T&E training, under the oversight of the DAF AI T&E champion and supported by AFOTEC, the USAFWC, Air Force Institute of Technology (AFIT), AFRL, and AIA.25 Furthermore, the AFTC AI T&E curriculum should be developed by personnel with substantial AI and AI T&E experience, not only from within the DAF but also, as appropriate, from industry and academia. The committee expects the test community will achieve better results this way rather than relying primarily on retraining AFTC, AFOTEC, or USAFWC test personnel on AI principles and AI T&E processes, practices, and procedures.26

The committee recommends that the DAF assess the utility of using the law school analogy for building a cadre of AI T&E personnel across the test enterprise. Just as all lawyers receive common core education on the law followed by extensive additional, specialized training for their planned area of practice (tort law, criminal law, contract law, and so on), DAF AI T&E personnel can participate in a common core test curriculum at the AFTC, with AI T&E-specific training (and training on other emerging technologies) provided either within the AFTC or at other designated DAF or joint organizations, such as the DAF AIA, AFIT, AFRL, or the Defense Acquisition University (DAU). Moreover, the importance of continuing education and training (CET) has its own analogy in the legal profession: as mandated by law, lawyers require so many continuing education units (CEU) annually. The committee suggests that the importance of CET is even greater for AI, considering the exponential rate of technological change.

As noted earlier in the summary, there must be sufficient flexibility at the operational and tactical levels to accommodate agile and CI/CD principles and continuous T&E. This may require deliberate placement of AI T&E experts within operational and training units outside the traditional DAF test community. Some of these people may already be test-certified (similar to how test pilots continually rotate through operational and training squadrons throughout their careers) and only require AI T&E “top-off” training. Others may possess useful skills (a computer science background or previous AI experience, for example) but have not

___________________

25 Similar for example to OSD DOT&E’s use of IDA for providing analytic support to DOD’s T&E community.

26 This expectation accounts for the substantive differences between traditional T&E of hardware systems, and AI T&E. The committee acknowledges the potential utility of a hybrid approach that takes advantage of the expertise of both highly experienced “traditional” test personnel, and people with extensive experience in the development, testing, fielding, and sustainment of AI-enabled systems.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

received training at the AFTC and thus should receive tailored AI T&E training aligned to their unit responsibilities.27

Whenever feasible, the DAF should take advantage of existing AI-related education and training initiatives. For instance, in response to congressional direction, in 2020, the JAIC, now CDAO, developed the 2020 Department of Defense Artificial Intelligence Education Strategy.28 In crafting the AI education strategy and implementation plan, the JAIC segmented the entire DoD workforce into six AI archetypes: specifically, personnel grouped by similar AI education and training needs.29 The committee recommends that the DAF continue to use these same archetypes in developing AI and AI T&E-specific training and education. The JAIC initiated an AI education pilot program in October 2020. The CDAO, in coordination with the DAF AIA, now offers a variety of AI training programs and courses for personnel across DoD. The AIA has compiled a list of AI educational resources for DoD personnel, which can be accessed with a common access card (CAC).30 Similarly, the 96th Operations Group Commander briefed the committee that the 96th is developing AI T&E educational programs for the test community to address the implications of lethal autonomous weapon systems (LAWS), human factors, and human-systems integration.

The committee also recommends that the DAF AI T&E champion consider using the DAU’s approach to modernizing the DoD T&E acquisition workforce as a guidepost for developing DAF-wide AI T&E education, training, and certification. DAU is pivoting from a “one-size-fits-all” certification framework to a component and workforce-centric, tailorable, continuous learning construct.31 The DoD

___________________

27 The committee suggests the AI T&E champion, in coordination with the AFTC, AFOTEC, USAFWC, DAF CDAO, AFMC Digital Transformation Office (DTO), and DAF Chief Experience Officer (CXO) assess the value of placing “digital natives” at the unit level. Analogous to the practice, for example, of placing unit intelligence officers within DAF squadrons.

28 Section 256 of the National Defense Authorization Act (NDAA) for Fiscal Year 2020 directed the Secretary of Defense to “develop a strategy for educating service members in relevant occupational fields on matters related to artificial intelligence.” It also directed the secretary to develop an implementation plan. (DoD Joint AI Center, 2020, Department of Defense Artificial Intelligence Education Strategy, Washington, DC, https://www.ai.mil/docs/2020_DoD_AI_Training_and_Education_Strategy_and_Infographic_10_27_20.pdf.) See Chief Digital and Artificial Intelligence Office, 2023, “Education & Training,” https://www.ai.mil/education_training.html for descriptions of the CDAO’s AI training programs.

29 As detailed on p. 7 in DoD AI Education Strategy, the six archetypes are Lead AI, Drive AI, Create AI, Employ AI, Facilitate AI, and Embed AI. For detailed descriptions of each archetype, see Appendixes B–G of the DoD AI Education Strategy.

30 C. Del Aguila, 2022, “AI Accelerator Focuses on Education,” Air Force Material Command, https://www.afmc.af.mil/News/Article-Display/Article/3013236/ai-accelerator-focuses-on-education.

31 S. Possehl, 2022, “Test and Evaluation: The Change Is Here Today,” Defense Acquisition University, February 1, https://www.dau.edu/library/defense-atl/blog/Test-and-Evaluation-change-today.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

acquisition force T&E functional area includes members working in developmental test and evaluation (DT&E), the TRMC, test ranges, and operational test and evaluation (OT&E) throughout all phases of the acquisition life cycle. This DAU initiative focuses on personnel development, streamlining functional areas, reforming the certification framework, modernizing talent management, and equipping acquisition professionals with the tools needed in the digital age. It includes both foundational (within 3 years of position assignment) and practitioner (within 5 years of position assignment) categories.

Moreover, it includes both T&E certification training requirements (basic requirements for working in a designated T&E acquisition position) and T&E credential development (additional training that will provide job-specific, specialty, and point-of-need training for mid- and advanced career jobs and opportunities). The initial set of training credentials includes—among others—T&E of AI, T&E of autonomous systems, evaluating data, T&E of software, and digital engineering (an existing DAU credential). Credentials are intended to be flexible for point-of-need applications. They may be required by senior leaders, functional leaders, supervisors, managers, and others.

Finally, for more advanced AI T&E education and training, the committee suggests that the DAF AI T&E champion review programs offered by the DAU, Air Force Institute of Technology (AFIT), and Air Force Materiel Command (AFMC). For example, AFMC’s Air Force Acquisition Instructor Course (AQIC), which is viewed as a “Weapons School for the acquisition career field,” includes an entire section on traditional T&E and another on emerging technologies. The committee expects that AFMC and AQIC will be receptive to providing more advanced education and training on AI T&E based on DAF, AFTC, AFOTEC, and USAFWC needs.

One of the most important first steps is to survey the entire DAF workforce to determine as accurately as possible the current baseline of AI and AI T&E skills that exist in the DAF today. The committee heard the resounding message from several speakers that such a baseline does not exist—not for general AI skills or

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

even more important for this report, for AI T&E experience. The DAF AI T&E champion should coordinate with the Air Force Personnel Center (AFPC) and other organizations, such as Air University and the DAF AIA, to develop and administer this DAF-wide survey. The DAF AI T&E champion, in coordination with the DAF CDAO, 96th Operations Group, AFPC, Air University, and the AIA, should consider taking the same approach used when developing the Air Force Computer Language Self-Assessment (CLSA) program in 2019. The CLSA, administered by Air University, allows DAF active duty, reserve, and civilian personnel to assess their knowledge and skills in various computer programming languages.32 Modifying the CLSA to allow personnel to identify their AI and AI test-specific skills and any formal AI training courses and certifications, while not perfect, is the best way to accelerate developing a DAF-wide baseline of personnel with AI and AI test-specific credentials.

Once a baseline of personnel with AI and AI T&E experience is established, the DAF AI T&E champion should coordinate with the applicable organizations and agencies to develop a tiered approach to AI implementations and AI T&E-specific education, training, and training certification. This includes modifying existing programs to reflect the needs of the test enterprise. Currently, Air University and the AIA use a useful approach for DAF-wide AI education and training: a three-tiered system that begins with basic training on digital skills through online courses offered by Digital University;33 a second tier focuses on digital skills for basic practitioners and mid-level managers (similar to what exists today for personnel in the cyber field); and a third tier comprises in-depth training, up to and including the designation of expert status. The DAF should consider using this approach for AI T&E training.

The DAF will be unable to build an AI T&E workforce as rapidly as needed to meet expected demands over the next 5 years. However, in the near term, the DAF AI T&E champion, supported by DAF senior leaders, should use the survey results described above to coordinate across the entire DAF to help rebalance the test force by shifting people with needed expertise into the test enterprise. At the same time, DAF test leaders should solicit volunteers from within the test community to be trained specifically on AI T&E. Part of this process includes, with the support of AFPC, formally designating with Air Force Specialty Codes (AFSC) and special experience identifiers (SEI), people who have certain AI and AI T&E skills—similar to how various career fields, including the air force test community—identify special

___________________

32 The CLSA is a self-paced, online program comprising a series of tests and exercises designed to evaluate an individual’s knowledge of programming concepts and techniques. Such a survey could also be used to gauge interest in entering the test community as an AI and AI T&E specialist.

33 Digital University is a joint venture started between the Air Force and Space Force, and is available to members of DoD. It provides access to Silicon Valley-accredited technology training and fosters a community of learners. It includes coding, data science, and product management training.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

skill sets today. Once the DAF embarks on this path, it will be equally important to continue to track these skills throughout a person’s career. Given the dearth of AI T&E expertise in the DAF today, the Air Force and Space Force can ill-afford to place personnel with these skills in positions unrelated to AI and AI T&E (with normal exceptions granted for career development at more senior levels in the officer, enlisted, and civilian ranks).

Recruiting and retaining AI expertise remains one of DoD’s biggest challenges. While this is a multifaceted problem with no single solution, the DAF should take advantage of numerous extant DoD-wide initiatives to find, recruit, and retain the nation’s best AI talent. Likewise, the DAF can take advantage of lessons from the standup of U.S. Cyber Command to ensure that military personnel, once trained, are tracked throughout their careers (as noted above) and to the maximum extent feasible retained in AI and AI T&E-related positions. Other creative ideas could include hiring contractors to work within DAF T&E facilities as AI T&E subject matter experts (SME); offering scholarship funds to undergraduate or graduate AI (or related) majors, with the caveat that the individual would serve for a designated period after graduation;34 reviewing the Science, Mathematics, and Research for Transformation (SMART) Scholarship-for-Service Program to ensure the appropriate emphasis on soliciting undergraduates for AI T&E; and reviewing the DAF’s programs for sponsoring graduate-level AI T&E work for military and civilian personnel serving in AI-related positions.

The DAF should also take advantage of Section 605 of the 2019 NDAA to help jump-start building an experienced AI T&E workforce.35 This section allows accelerated temporary promotion opportunities for officers with skills in areas designated to have a critical shortage of personnel. Section 605 applies as long as the Secretary of the Air Force designates AI and AI T&E as areas that are critically short of personnel.

Recommendation 4-7: The Department of the Air Force (DAF) should determine the current baseline of artificial intelligence (AI) and AI testing and evaluation (T&E) skills across the DAF, develop and maintain a tiered approach

___________________

34 The “National Security Commission on AI Final Report” includes several recommendations along these lines, to include the establish a new digital service academy and civilian national reserve to grow tech talent. See NSCAI, 2021, National Security Commission on Artificial Intelligence Final Report, Arlington, VA, https://www.nscai.gov/wp-content/uploads/2021/03/Full-Report-Digital-1.pdf. The FY2023 National Defense Authorization Act contains provisions authorizing DoD to establish a cyber and digital service academy. As proposed, the academy will provide scholarships for up to 5 years, in exchange for equivalent years of service in a civilian DoD position focused on digital technology and cybersecurity. Over time, the committee expects that more computer science and AI degree-granting programs will increase the emphasis on AI TEVV, perhaps even including TEVV as a specific subfield of study.

35 U.S. Congress, 2018, “John S. McCain National Defense Authorization Act for Fiscal Year 2019,” H.R. 5515, 115th Congress (2017–2018), https://www.congress.gov/bill/115th-congress/house-bill/5515.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

to AI and AI T&E-specific education and training, rebalance the test force by shifting people with needed expertise into the test enterprise, and consider placing personnel with AI T&E expertise into operational units.

4.5 SUMMARY OF IMPLICATIONS OF FUTURE AI FOR DAF T&E

Even as the DAF addresses its current needs and opportunities, it must evaluate emerging AI trends and their likely implications for T&E. Based on trends that the committee sees today, it has identified areas that will likely have significant implications for DAF AI-based systems and the T&E of these future systems. However, given the pace of AI progress, it is difficult to predict with precision which AI advances will be most impactful for Air Force applications. Therefore, the committee recommends that the DAF pursue a strategy that puts procedures and mechanisms in place to continually track emerging AI trends and investigate T&E implications.

4.6 RECOMMENDATION TIMELINES

This chapter makes numerous recommendations about actions the DAF should take concerning AI T&E. While each recommendation is important, the time horizon associated with the recommendations varies greatly. Therefore, for ease of prioritization, this section groups the recommendations into three groups: recommendations that could be addressed immediately, in the mid-term (3–5 years), and over the long term (over 5 years). These time frames are not hard delineations nor meant to be definitive. They may prove to be overly conservative or overly aggressive.

Action on several recommendations can be taken immediately. Appointing a DAF AI T&E champion (Recommendation 4-1), placing core AI T&E training under the AFTC (bullet 3 of Recommendation 4-6), and committing to establishing independent red teams (Recommendation 4-4) can all be implemented quickly.

In the 3- to 5-year range, many more recommendations can be implemented. This includes adopting an AIOps and MLOps approach for AI-enabled capabilities (Recommendation 4-2) and integrating AI requirements into the program of record (Recommendation 4-3). This would also be the timeframe where the DAFwidevision and strategy for data would be updated and promulgated (Recommendation 4-5), and the AI education parts of Recommendation 4-6 would be implemented. Having established the current baseline of AI and AI T&E skills across the DAF, this would also be the time frame where the DAF should develop a tiered approach to AI and AI T&E education and rebalance the test force by shifting people with needed expertise into the test enterprise (Recommendation 4-7).

Beyond a 5-year window, coordinating investments to explicitly address the findings of red-team activities (Recommendation 4-4) and inculcating an AI T&E culture (Recommendation 4-6) will be key.

Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 79
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 80
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 81
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 82
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 83
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 84
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 85
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 86
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 87
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 88
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 89
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 90
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 91
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 92
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 93
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 94
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 95
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 96
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 97
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 98
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 99
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 100
Suggested Citation:"4 Evolution of Test and Evaluation in Future AI-Based DAF Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 101
Next: 5 AI Technical Risks Under Operational Conditions »
Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force Get This Book
×
 Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force
Buy Paperback | $42.00 Buy Ebook | $33.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The Department of the Air Force (DAF) is in the early stages of incorporating modern artificial intelligence (AI) technologies into its systems and operations. The integration of AI-enabled capabilities across the DAF will accelerate over the next few years.

At the request of DAF Air and Space Forces, this report examines the Air Force Test Center technical capabilities and capacity to conduct rigorous and objective tests, evaluations, and assessments of AI-enabled systems under operational conditions and against realistic threats. This report explores both the opportunities and challenges inherent in integrating AI at speed and at scale across the DAF.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!