National Academies Press: OpenBook
« Previous: 2 Definitions and Perspectives
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

3

Test and Evaluation of DAF AI-Enabled Systems

The previous two chapters summarized the history of artificial intelligence (AI), ongoing Department of the Air Force (DAF) AI projects and defined key AI and AI test and evaluation (T&E)-related terms and definitions. This chapter begins with a synopsis of the air force’s historical approach to traditional flight T&E. Section 3.2 discusses OSD and DAF T&E policies for AI-enabled systems (noting, as applicable, where there are still gaps in the formulation of AI T&E-specific policies). Section 3.3 addresses the importance of DevSecOps or artificial intelligence operations (AIOps)/machine learning operations (MLOps) to the design, development, testing, fielding, and sustainment of national security and commercial sector AI-enabled systems. The speed of AI advances in the commercial sector over the past decade has included the commensurate design and deployment of T&E methodologies for AI-enabled commercial systems (autonomous vehicles, large language models, chatbots, recommendation engines, and so on), although DoD systems are more complex, consequential, and subject to more regulation than commercial systems. Section 3.4 presents a detailed discussion of these developments. In Section 3.5, the committee examines the core concepts of trust, justified confidence, AI assurance, and trustworthiness and how together they play an instrumental role in gaining end-user buy-in for fielded AI-enabled systems. Finally, Section 3.6 closes the chapter with a consideration of the critical importance of risk management throughout the entire AI life cycle, including risk awareness, analysis, acceptance, accountability, and responsibility.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

3.1 HISTORICAL APPROACH TO AIR FORCE TEST AND EVALUATION

The Air Force Test Center (AFTC) was established in 1951 to consolidate aircraft, missiles, and other systems’ testing and evaluation functions under a single organization; to standardize and streamline test processes; ensure consistency in T&E practices; deal with the rapid growth in numbers and types of air force aircraft entering fielding; and reduce unacceptable aircraft mishap rates. Today, the AFTC conducts developmental and follow-on T&E of manned and unmanned aircraft and related avionics, flight control, munitions, and weapon systems. The AFTC comprises the Arnold Engineering Development Complex (AEDC) at Arnold AFB, the 96th Test Wing (TW) at Eglin AFB, the 412th TW at Edwards AFB, and the Test Pilot School (TPS) at Edwards AFB. The 96th TW is the T&E center for air-delivered weapons, navigation, and guidance systems; command and control systems, and AF Special Operations Command systems. It is the principal AF organization for command, control, communications, computers, intelligence, surveillance, and reconnaissance (C4ISR) developmental testing, often in coordination with the Air Combat Command’s 505th Command and Control Wing (a subordinate unit of the U.S. Air Force Warfare Center). The 412 TW plans, conducts, analyzes, and reports on all flight and ground testing of aircraft, weapons systems, software, components, and modeling & simulation (M&S). The 412 TW flies an average of 90 aircraft and performs over 7,400 missions (over 1,900 test missions) annually. The USAF TPS at Edwards AFB trains pilots, navigators, and engineers on how to conduct flight tests.1

Understanding the historical context of AF T&E is important to conceptualize the changes needed to effectively and efficiently test and evaluate AI and autonomous systems. The AF test process has always focused on data collection, while evaluation emphasizes data analysis and comparing expected-to-actual performance to support decision-making. T&E is accomplished through a T&E master plan (TEMP), which contains thresholds and objectives, evaluation criteria, and milestone decision points. The TEMP is developed by the designated program management office (PMO). Traditionally, AF T&E has been divided into two primary components: developmental (D) and operational (O) T&E. At the basic level, DT&E centers on safety of flight concerns, while OT&E focuses on tactics and operating concepts. DT&E is conducted throughout the acquisition process to assist in engineering design and development and to verify that technical performance specifications are achieved. It includes the T&E of components, subsystems, hardware and software integration, and production qualification testing. DT&E examines the system’s compliance with contractual requirements

___________________

1 Air Force Test Center, 2021, “Fact Sheet: Air Force Materiel Command,” https://www.aftc.af.mil/About-Us/Fact-Sheets/Article/2382275/air-force-materiel-command.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

and the ability to achieve key performance parameters (KPP) and key system attributes (KSA).

OT&E, on the other hand, measures the overall ability of a system to accomplish a mission when used by representative personnel in the environment planned for the operational employment of the system. It conducts independent evaluations, operational assessments, and the ability to satisfy KPPs and KSAs. OT&E is conducted under realistic operational conditions, as close as possible to those expected in combat operations. The objective of OT&E is to determine a system’s operational effectiveness, operational suitability, survivability, and lethality for combat. It is a mission capability assessment.

For aircraft and aircraft systems, DT&E and OT&E have traditionally been treated as two distinct phases of T&E that do not overlap. If a system under test fails DT&E, the engineering design and development process must be addressed before testing. Once a system passes DT&E, it transitions to OT&E. If it fails OT&E, it reverts to DT&E to re-evaluate its technical performance specifications and ability to comply with contractual requirements. Once a system passes OT&E, it is cleared for operational fielding. After initial fielding, it will be declared to have achieved initial operating capability (IOC), a formal milestone noting that an operational (non-test) unit can employ the system effectively. Once IOC is declared, the system may require further development and testing to achieve its full capabilities. Once that occurs, the system will be declared fully operational capability (FOC). The FOC milestone is achieved when a system has demonstrated it can perform all its intended missions and functions in various operational environments and is fully integrated into the overall operational structure—the operational unit can employ and maintain the system. It is not unusual for FOC to be declared for several years after the IOC milestone, especially for more complex weapon systems. The FOC milestone represents completion of a system’s T&E and development efforts.

As discussed in more detail in the following sections, extant T&E processes that have worked so well for most DAF weapon systems over the past 70 years, with a clear delineation between DT&E and OT&E, were not designed to be applied to T&E of AI implementations and software and consequently fail.

3.2 AI AND DevSecOpS/AIOpS IN THE DAF AND COMMERCIAL SECTOR

The DAF has been transitioning from waterfall to agile development methodology, albeit at its traditional pace. The transition initially only encompassed basic development and deployment processes but has expanded to incorporate security evaluations earlier in the development process (DevSecOps). The migration from waterfall to DevSecOps-based processes is largely instigated by the

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

software’s increasingly wide and deep footprint in the complex systems the DAF deploys. While modern software has been a large catalyst for this evolution, the development and deployment of AI capabilities will be a true forcing function. AI will introduce speed into decision systems due to the simple automation of traditionally human-driven tasks and the ability to process previously insurmountable amounts of data. Additionally, the AI life cycle is inherently iterative and requires infrastructure to enable the continuous maintenance and improvement required on deployed models. The increase in pace will be an additional stress on traditional development and evaluation infrastructure.

It is a certainty that deployed AI models will encounter operational conditions not represented in the original training corpus and behave in unanticipated ways. A simple example of an unanticipated behavior could be an AI model labeling an object in an image incorrectly because it has never seen the object in training. This trite example is appropriate for illustration but can easily be extended to higher-risk and higher-consequence scenarios. There are significant implications of an AI-enabled system mislabeling an object that subsequently informs a high-risk, high-consequence targeting decision. Because of the guarantee of encountering unknown scenarios, adopting agile development improves the T&E processes for AI-enabled systems.

AI implementations are developed cyclically, often referred to as either AIOps or MLOps, and require continuous training, evaluation, and retraining as operational conditions change. Figure 3-1 shows a generic architecture and the cyclical feedback required to enable AI deployment to the edge. No organization can manage this production cycle and develop high-performing AI systems without using agile development methodologies that integrate T&E across the AI life cycle. Deployed models require “maintenance” that addresses shifts in operational conditions not represented in training data. This architecture is not a substitute for the safety systems and processes encompassing deployed AI systems; however, you cannot safely and effectively deploy AI without this iterative approach. For AI-enabled

Image
FIGURE 3-1 A generic architecture and the cyclical feedback required to enable AI deployment at the edge. SOURCE: Courtesy of NVIDIA.
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Image
FIGURE 3-2 Connections between the development, testing, and deployment of the AI capabilities required to deploy an autonomous vehicle. SOURCES: Courtesy of NVIDIA; DGX image courtesy of NVIDIA and Oak Ridge National Laboratory, Department of Energy.

systems, the DAF is currently not prepared for this level of continuous integration and continuous deployment or delivery.

For example, Figure 3-2 illustrates the connection between the development, testing, and deployment of the AI capabilities required to deploy an autonomous vehicle. Of significance in this example is not only the scale of infrastructure and tooling to create the original models but also the supporting fleet of cars that continually collect more operational data and refine the deployed models. Some autonomous vehicle systems will selectively record data correlated with an AI-driver disagreement to reduce the rate of data to curate and improve model performance. In addition, model creation and refinement are supported by a robust simulation architecture for handling edge cases and domain shifts known a priori, as well as those observed operationally. This simulation environment supports both the creation of synthetic data as well as hardware-in-the-loop training.

Key components of this architecture include:

  • Trained labelers: Labelers are trained on tooling and the data they are labeling.
  • Continuous monitoring, retraining, and redeployment of AI models: Model performance is constantly monitored. Models are regularly retrained and redeployed.
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
  • Instrumented deployment platforms to capture ML-ready data: Both the deployed models and the data streams they consume must be instrumented to capture the behavior deviation and the observations that manifested the performance shift.
  • Synthetic data engines and supporting digital twins: Enable faster incorporation of emergent threats, observed domain shifts, or previously unknown edge cases. These components must be built for the appropriate domains and modalities.2

These three components have distinct implications for traditional T&E processes in the DAF. Methodology and infrastructure are required to detect when model behavior has deviated from expected performance during operations, to retrain the model with the new, associated observations, and then to evaluate how the new model performs under both previous and newly-observed conditions. Integral to the retraining of models in operation is the ability to retrieve these observations. For the DAF, this implies platforms that adopt AI-enabled systems or components require the capabilities to record ML-ready data from their sensors and associated actuators and then send that data back to a training environment in an easily consumable form. These requirements have significant impacts beyond test processes. They must be accounted for in platform and sensor operational requirements, up to and including at the PMO or system program office level.

Similarly, synthetic data engines and digital twins are key to supplementing datasets with training examples for situations where there is either insufficient real data or data are too difficult to collect. Synthetic data engines and digital twins must be relevant, adaptable, and considered to be part of the AI life cycle. For the DAF, sensor models and situational constructions of interest are represented in modern modeling and simulation environments that can keep pace with the cadence required for maintaining a collection of supporting AI models.

There are MLOps solutions on the commercial market today that facilitate this life cycle. The solutions are varied and support myriad deployment scenarios. In combination with commercial T&E vendors, some of these solutions can supplement an organization with the infrastructure and processes required to maintain an enterprise deployment of AI-enabled systems. Larger institutions that were early adopters of AI integration have evolved large internal systems and tooling to support their requirements. However, the DAF is neither one of the early adopters

___________________

2 For a private industry example of unprecedented use of a synthetic environment to virtually “build” and evaluate an entire automobile factory and production line 2 years before physical production begins. See BMW Group, 2023, “BMW Group at NVIDIA GTC: Virtual Production Under Way in Future Plant Debrecen,” PressClub Global, March 21, https://www.press.bmwgroup.com/global/article/detail/T0411467EN/bmw-group-at-NVIDIA-gtc:-virtual-production-under-way-in-future-plant-debrecen?language=en.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

nor is it feasible for the DAF to be a direct consumer of unmodified commercial solutions. There are fundamental requirements rooted in operational requirements and constraints of DAF systems that demand a different approach. Figure 3-3 highlights the areas where the DAF’s requirements can be met with gaps in current commercial architectures. These gaps represent the areas where the DAF needs to invest in modifying commercial solutions to meet service needs.

Real-time operational testing implies the need to continuously maintain and test AI-enabled solutions once they are operational. This represents a fundamental departure from the traditional waterfall approach that characterized historical DAF T&E efforts and is a critical change from current approaches. This change in approach is necessary to handle domain shifts and edge cases. Commercial solutions will certainly incorporate methodology for monitoring and retraining models, but it is unlikely they will incorporate processes that capture the complex system integration and risk frameworks that apply to DAF systems, especially safety-critical systems in the foreseeable future. The DAF should invest in synthetic data engines, live virtual constructive environments, data repositories, and support for digital twins representative of their modalities and platforms of interest to facilitate rapid model retraining and maintenance. Data standards must be extended to the platforms to support this retraining and enable fast capture of AI-ready data to facilitate retraining around model failure events.

Many commercial MLOps solutions assume constant, high-bandwidth connectivity to the AI-enabled systems they support, with many of their deployment patterns dependent on commercial cloud infrastructure. This assumption breaks down in most DAF operational environments, especially during crises or conflicts. Many forward-deployed organizations will not have the luxury of high-bandwidth

Image
FIGURE 3-3 Areas where the DAF’s requirements can be met with gaps in current commercial architectures. SOURCE: Courtesy of NVIDIA.
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

data connections back to large MLOps factories to retrain and retest model updates. The decentralized nature of forward-deployed operations likely requires some edge-based computing for model maintenance and testing while in the field, along with trained personnel capable of retraining and retesting models under suboptimal conditions. Model updates produced at any edge node would also need to eventually be transmitted back to some centralized management system, implying a federated learning model. The DAF AI T&E champion should outline and prioritize these requirements and coordinate with commercial providers to adapt available solutions accordingly.

Finding 3-1: The DAF will have similar training infrastructure requirements to support the development and maintenance of AI-enabled systems. The decentralized nature of DAF operations means training cannot be supported by standard commercial offerings. The committee knows of no commercial off-the-shelf solution presently supports these requirements.

Recommendation 3-1: The Department of the Air Force artificial intelligence testing and evaluation champion should outline and prioritize these training infrastructure requirements and coordinate with commercial providers to adapt available solutions accordingly.

Developing and deploying AI-enabled systems implies that there are companion deployment systems designed to receive and run the trained models in operations. These systems would comprise the sensors and reasoning systems that leverage the deployed models to extract information or make decisions. Any AI-enabled system that needs to operate at the edge—whether originating from commercial or military sources—will have unique size, weight, and power (SWAP) challenges. The DAF deploys systems and platforms that typically have bespoke security and SWAP requirements. These requirements will likely be constraints to the edge computing architectures that complement and integrate with commercial MLOps solutions. AI-enabled systems require high-performance computing solutions—typically graphics processing units (GPUs) or field programmable gate arrays (FPGAs)—to run AI models. Deployment configurations vary dramatically across DAF platforms, making the integration of these devices challenging and time-consuming. This fractured and bespoke approach to computing requirements limits the DAF’s ability to drive features into these commercial products and results in costly customizations per platform that are not repeatable or cost-effective.

The DAF should invest in standards that enable the consolidation of computing requirements into fewer modular configurations designed to meet the needs of AI and autonomous systems. Through consolidation, the addressable

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

market for these solutions becomes larger and more feasible for commercial vendors to tackle at a scale that would accelerate time to market and reduce cost. It would also reduce the test footprint by simplifying test configurations for DT&E and OT&E.

3.3 OSD AND DAF T&E POLICIES FOR AI-ENABLED SYSTEMS

The DAF’s current requirements formulation and acquisition processes continue a tradition of directly testing various capabilities against functional requirements under expected operational deployment conditions. As noted in previous chapters, the rapid introduction of AI-enabled capabilities across the DAF over the next several years requires an assessment of the applicability of established DAF-wide T&E approaches to AI and revising current policies or developing new ones that apply specifically to AI T&E. This is equally true whether AI or hML is integrated into a program of record, or added separately after a system has already been fielded. Based on presentations by DAF test enterprise leaders to this committee, the committee concludes that the DAF has not yet developed a standard and repeatable process for integrating, testing, and sustaining AI capabilities in DAF major acquisition programs. The few examples the committee knows of, such as Project Maven, consist of capabilities added to major programs (such as Air Force Distributed Common Ground Station (AF DCGS)) after fielding, outside of the traditional program of record acquisition processes. As one speaker commented, “We are not classically trained to do this [type of] T&E.”

Finding 3-2: The DAF has not yet developed a standard and repeatable process for integrating, testing, acquiring, developing, and sustaining AI capabilities.

Much like the advances in DAF-wide T&E for C2, cyber, and ISR systems over the past decade, the committee expects the DAF to make up ground with AI T&E relatively quickly. This assumes that DAF leaders prioritize AI T&E accordingly, applying sufficient resources in funding, infrastructure, policies, and personnel management. Despite its current shortcomings, the DAF is no further behind in AI T&E than most other government organizations and agencies. The DAF can take advantage of the extensive work already carried out by OSD CDAO developing AI T&E policies, processes, and frameworks, as well as applying lessons learned from commercial companies that have substantially advanced their internal AI T&E processes over the past several years (the committee includes some examples later in this chapter). This is an opportune time for the DAF to craft an AI T&E vision and commit to a long-range AI T&E strategy and implementation plan that includes specific and measurable objectives and goals.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

At the OSD level, myriad instructions, directives, and policies referenced throughout this report exist to guide T&E. However, most of these are not AI-specific, and OSD DOT&E has not yet published DoD-wide formal AI T&E guidance.3 Moreover, as noted elsewhere in the report, there has been limited direction addressing the lack of a clear distinction between developmental test (DT) and operational test (OT), or between initial operational T&E (IOT&E) and follow-on operational T&E (FOT&E), for AI capabilities. This represents a considerable challenge for the department.4

Finding 3-3: OSD DOT&E has not yet published DoD-wide formal AI T&E guidance.

Finding 3-4: There is a lack of clear distinction between DT and OT phases for AI capabilities.

Conclusion 3-1: A lack of formal AI development and T&E guidance represents a considerable challenge for the DAF as AI-based systems emerge.

As noted in the Project Maven case study, the JAIC T&E division refined Maven’s T&E processes, procedures, and practices, and under their new organizational structure, the OSD CDAO is publishing AI T&E playbooks and providing AI T&E frameworks to OSD DOT&E. These include frameworks for testing AI-enabled systems, human-system integration (HSI), operational test,

___________________

3 OSD director of operational test and evaluation (DOT&E) is an independent entity whose director does not report to the secretary of defense, but to Congress. The DOT&E director is the principal staff assistant and senior advisor to the Secretary of Defense on operational test and evaluation in DoD. The DOT&E mission is to issue DoD OT&E policy and procedures; review and analyze the results of OT&E conducted for each major DoD acquisition program; provide independent assessments to the Secretary of Defense, the under secretary of defense for acquisition and sustainment (USD(A&S)), and Congress; make budgetary and financial recommendations to the secretary regarding OT&E; and oversee major DoD acquisition programs to ensure OT&E is adequate to confirm operational effectiveness and suitability of the defense system in combat use. DOT&E is tasked to assess operational effectiveness, suitability, survivability, and sustainability. The organization currently relies on red teams for evaluation of DoD cyber capabilities but does not presently manage any AI-specific red teams.

4 While the committee realizes this is an OSD-level concern, it recommends that the DAF AI T&E champion coordinate with OSD (especially OSD CDAO, OSD DOT&E, the DASD [DT&E], and the test resource management center or TRMC), the joint staff, and the military services to explore organizational solutions that address the lack of clear lines and lanes between AI developmental and operational test and evaluation. Also, as noted in the report summary, the DAF AI T&E champion should assess what broader DAF-wide organizational changes are called for to reflect the differences between AI T&E, and T&E for all other air force systems and capabilities.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

and operationalizing responsible AI. Also, as noted elsewhere, the 96th Operations Group is developing AI T&E academic materials and curricula, and the DAF-MIT AIA is developing an AI T&E Guidebook (which will not be official policy). Finally, in 2020 the AFTC’s 412th Test Wing published Test and Evaluation of Autonomy for Air Platforms, a technical information handbook.5 While it deliberately does not address AI-enabled autonomous systems, it could be modified to address the T&E of AI-enabled autonomous systems and promulgated DAF-wide.

Until DOT&E and the DAF publish and promulgate formal AI T&E guidance, the committee recommends that the DAF consider adopting the OSD CDAO’s AI T&E playbooks and frameworks. The tri-center should adapt these documents based on Air and Space Force AI T&E requirements, modifying them as necessary once OSD DOT&E promulgates official department-wide AI T&E directives, policies, and instructions. It is worthwhile to integrate the appropriate commercial best practices, which are documented in Chapter 3.

DOT&E is in the early stages of formulating AI T&E guidance. As a DOT&E official6 told the committee, DOT&E recognizes that the department is “not where we need to be . . . with respect to even machine learning, never mind AI.”7 He noted that AI T&E is a young field, with very few, if any, operational use cases across the department, almost no DoD-wide AI T&E best practices,8 and almost no historical military AI T&E studies or reports to fall back on. The official echoed a critical question from this report’s summary, namely, for AI-enabled learning systems, “How much testing is enough?” He also emphasized addressing where, how, and when AI testing is accomplished. He acknowledged not only that DOT&E’s “tried- and-true” test designs of the past were insufficient for fully testing AI-enabled systems, but that DOT&E did not yet possess the same kind of tried-and-true test designs or processes for AI. He noted that agile principles (see Section 3.3) were critical in developmental test and postulated that they would be equally important

___________________

5 R.A. Livermore and A.W. Leonard, 2020, Test and Evaluation of Autonomy for Air Platforms, Edwards, CA: 412 Test Wing, Edwards Air Force Base, https://apps.dtic.mil/sti/pdfs/AD1105535.pdf.

6 M. Crosswait, 2023, presentation to the committee, September 28, Washington, DC, National Academies of Sciences, Engineering, and Medicine.

7 The same official urged the committee to make available to the entire department this report’s findings and recommendations, with the goal of accelerating the development and promulgation of AI T&E best practices DoD-wide.

8 With the exception of the OSD/CDAO’s development of AI T&E playbooks and frameworks, which CDAO has provided to DOT&E. The committee expects DOT&E will publish their own frameworks, modeled after CDAO’s products, after concluding their ongoing comprehensive review of all DOT&E test guidance.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

in operational test (while acknowledging that DOT&E had not yet determined what exactly this would entail for AI-enabled systems).9

In addition to underscoring the importance of developing an AI T&E culture and supporting the development of a more operationally relevant AI T&E risk management framework (RMF), DOT&E is analyzing how to test AI-enabled systems for unexpected outcomes (to include testing boundary conditions and system behavior under varying conditions);10 how training, validation, and test data should be selected and evaluated as part of the overall AI T&E process11 (to include assessing security vulnerabilities and susceptibility to adversarial attack); how to account for the black-box nature of AI models; how to evaluate user trust and justified confidence in AI-enabled systems (under both expected and unanticipated operational conditions); and how to assess the ability of AI models to adapt to different missions and in different domains. Also, similar to DOT&E’s extensive use of cyber red teams during operational test and evaluation (to include when integrated into combatant command exercises), it intends to evaluate the effects of adversarial attacks on AI-enabled systems, especially those systems designated as mission- and safety-critical.12

For DoD systems performing mission- or safety-critical missions, especially those capable of generating lethal effects or that can lead directly to generating lethal effects, the committee agrees with DOT&E’s recommendation that, before fielding a substantive update to an AI capability, such systems must be operationally tested before the new version is fielded operationally. This type of “mini-OT” can be accomplished in the future by testing the updated capability in a digital twin or with an equivalent modeling and simulation architecture under operationally-realistic conditions (simulated or actual). Additionally, some processing architecture is required for system testing including data acquisition, cleaning, and labeling. The goal is to test updated capabilities as rapidly as possible (to include validation, verification, and accreditation) based on operational requirements, test conditions,

___________________

9 The official emphasized that for non-AI-enabled programs under his purview, adherence to agile principles has led to high-performance fielded systems, especially several Missile Defense Agency (MDA) missile defense projects. He suggested that a large part of the success derives from the principle, described in the Requirements section of this report, of early and frequent interaction between users, developers, program managers, and testers throughout the entire life cycle of a program.

10 The DOT&E official explained that one of the organization’s primary objectives is to ensure that, for AI-enabled systems in DoD, the “intolerable outcomes” do not occur (while acknowledging that the definition of intolerable outcomes had to be determined for each AI-enabled system and integrated system-of-systems).

11 The DOT&E official mentioned the possibility of maintaining the equivalent of a “war reserve mode” (WRM) of AI training data, that could be used to continue to develop and sustain AI models in the aftermath of adversarial attacks against existing datasets or fielded models.

12 OSD DOT&E has relied on DOT&E-sponsored and service-led cyber ted Teams for the past several years. See for example, DOT&E, 2022, “Cyber Assessment Program,” FY 2021 Annual Report, https://www.dote.osd.mil/Portals/97/pub/reports/FY2021/other/2021cap.pdf.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

and the use of test personnel with experience testing the original fielded model and system.13 Understanding the limitations of data drift, domain adaptation, and AI model boundary conditions (see Section 4.3) will help to increase the level of confidence that a certified system will operate as expected once deployed. Based on lessons from Project Maven and the JAIC, the committee expects that the extent of T&E required for each subsequent AI model version will depend on the scope of the changes included in each update (this is also the state of industry best practices; see Section 3.5). In most cases, later versions of fielded models will require a shorter T&E process than during earlier updates. Backed by industry best practices (see Section 3.5), updates to AI-enabled mission- or safety-critical systems require full transparency between testers, developers, and end-users to ensure all stakeholders have a common understanding of how much additional T&E is required and acceptable before fielding each update.

In summary, OSD DOT&E has provided an initial roadmap for how to redesign T&E for DoD AI-enabled systems to reflect the substantial differences between the T&E of traditional DoD systems and the T&E of AI capabilities. It does not, however, currently have the resources or the expertise, nor is the necessary foundational knowledge available, to make the changes needed to move beyond vision to immediate DoD-wide implementation. While DOT&E provides further guidance in the form of official policies, directives, instructions, templates, and frameworks, the committee recommends that in the near term, the DAF continue to work closely with DOT&E, the Deputy Assistant Secretary of Defense for Developmental Test and Evaluation (DASD(DT&E)), and the CDAO AI T&E community of interest while adopting or adapting T&E best practices from across the government (for example, OSD CDAO’s AI T&E playbooks and frameworks), the private sector, and academia. The committee recommends that the DAF AI T&E champion focus on new test designs for AI-enabled systems that incorporate the core systems engineering principles of non-AI-enabled systems14 while adding new elements that reflect the best AI T&E practices from academia, commercial industry, and other government test organizations.

___________________

13 As noted elsewhere in the report, the Missile Defense Agency (MDA) makes extensive use of modeling and simulation (M&S) during missile defense system design and testing, to include integrating actual hardware and M&S as part of an overall design, development, and testing architecture (through the command and control, battle management, and communications [C2BMC] program).

14 To include, for example, principles that straddle the traditional and AI T&E worlds such as MIL-STD-882F, the replacement to MIL-STD-882E, DoD Standard Practice: System Safety, 11 May 2012. This system safety standard practice is a key element of systems engineering (SE) that provides a standard, generic method for the identification, classification, and control of hazards. The revised document will include a section on AI and ML, to include the AI criticality index (AICI), which will be used to determine the level of rigor (LOR) of software assurance activities to be imposed on the software. Department of Defense, 2012, DoD Standard Practice: System Safety, MIL-STD-882F, Washington, DC, https://cdn.ymaws.com/system-safety.org/resource/resmgr/documents/Draft_MIL-STD-882F.pdf.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

The introduction of AI-enabled capabilities into the Air Force and Space Force has been limited and has proceeded slowly. The DAF has not addressed the pervasive implications of AI throughout the DAF or how T&E has to be integrated throughout the entire AI life cycle, from design through sustainment. The DAF has not yet committed to making immediate, sustained investments in AI governance, workforce development, AI research and development, AI development and T&E infrastructure, AI standards and practices, and targeted experimentation. The DAF has not developed the digital infrastructure needed to support AI development and T&E, and the requisite investments have not been programmed into the DAF budget. The lack of a designated AI T&E champion at the senior executive or general officer level, with commensurate SECAF-delegated authorities and resources at their disposal, has contributed to the low priority accorded to AI T&E across the DAF.

To ensure that the future AI-enabled Air Force and Space Force remain the most capable, responsible, and safe defense forces in the world, the committee recommends that DAF leaders prioritize AI development and T&E and address the implications across the entire DAF, including committing the necessary level of resources—people and funding. As a key initial step, the DAF should update its AI T&E vision and commit to a long-range AI T&E strategy and implementation plan that includes specific and measurable objectives and goals. The DAF, in coordination with OSD CDAO, should update its analysis of the resources required for digital modernization across the Air and Space Forces to reflect AI T&E-specific requirements, and sustain those resources in future DAF budgets.15 The DAF should leverage investments from OSD CDAO, OSD DOT&E, and OSD DASD(DT&E) and make or sustain AI-specific modernization investments in the Test Resource Management Center (TRMC),16 DAF CDAO, and Air Force Materiel Command’s (AFMC’s) Digital Transformation Office (DTO), and work closely with TRMC to identify AI T&E needs that will be addressed with TRMC funding and use DAF AI-specific modernization investments to address AI T&E gaps not being pursued by TRMC. These investments should include major and near-term investments in modern AI stacks across AFTC, Air Force Operational Test and Evaluation Center (AFOTEC), and the United States Air Force Warfare Center (USAFWC) (to include access to enterprise cloud-as-a-service and platform-as-a-service [PaaS] capabilities); modeling and simulation; the Virtual Test and Training Center (VTTC) at

___________________

15 See, for example, National Academies of Sciences, Engineering, and Medicine, 2022, Digital Strategy for the Department of the Air Force: Proceedings of a Workshop Series, Washington, DC: The National Academies Press, https://doi.org/10.17226/26531.

16 TRMC has several efforts under way to develop tools for testing AI. TRMC’s T&E and S&T program, a 6.3 advanced technology development effort, has 10 test technology areas (TTAs). Autonomy and AI test (AAIT) is one of the TTAs. DAF T&E representatives participate in the AAIT working group (WG). AFRL and Edwards AFB have been the two USAF organizations represented in the AAIT WG.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

Nellis AFB and the joint simulation environment (JSE); digital synthetic range environments at Edwards AFB and Eglin AFB; digital twins; and live-virtual-constructive (LVC) integration. The DAF AI T&E champion should work closely with the DAF’s representatives on the TRMC AAIT (Autonomy and Artificial Intelligence Test) WG to identify AI T&E projects for TRMC’s T&E and S&T program, while the DAF should also increase its representation on the AAIT WG.

Recommendation 3-2: The Department of the Air Force (DAF) leadership should prioritize artificial intelligence (AI) testing and evaluation (T&E) across the DAF with an emphasis on a radical shift to the continuous, rigorous technical integration required for holistic T&E of AI-enabled systems across the design, development, deployment, and sustainment life cycle.

3.4 AI T&E IN THE COMMERCIAL SECTOR

The committee was briefed by representatives from current defense industrial base companies actively developing T&E capabilities for the department,17 the autonomous vehicle safety group at NVIDIA, and an ISO working group developing a consensus report on functional safety for AI-enabled systems. It is important to note that while commercial industry is more sophisticated than the DAF in implementing and scaling up T&E for large scale AI deployments, it is still very much a field under development.

The ISO/IEC (International Electrotechnical Commission) TR 5469 working group18 is currently drafting a consensus report with representative members across several stakeholder industries, including avionics, robotics, healthcare, and autonomous vehicles. While in draft form, the standard is potentially subject to change from the briefed version, the TR 5469 report has the potential to provide a well-informed framework for thinking about risk, mitigation, and verification and validation (V&V) for AI-enabled systems. The main goal of the TR5469 report is “to enable the developer of safety-related systems to appropriately apply AI technologies as part of safety functions by fostering awareness of the properties, functional safety risk factors, available functional safety methods, and potential constraints of AI technologies.” Many of the main points proposed in the draft align with what the committee found from the commercial sector, summarized in Table 3-1. Therefore, it is the committee’s recommendation that the DAF track the progress of this report through the publication process and leverage it as a starting point for adapting their T&E processes for AI-enabled systems.

___________________

17 For example, Morse Corporation and Calypso AI.

18 This is a working group under the auspices of the International Organization for Standardization tasked with establishing standards on functional safety and AI systems.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

Recommendation 3-3: The Department of the Air Force should track the progress of the International Organization for Standardization/International Electrotechnical Commission TR 5469 working group report through the publication process and leverage it as a starting point for adapting their testing and evaluation processes for artificial intelligence–enabled systems.

Work presented by the industry executing on DoD T&E requirements presented a rich set of T&E tooling that has been iterated on through various pilot AI efforts and informed by engagements with non-DoD commercial customers. These techniques have begun to either be packaged in, or at least inform, various government off-the-shelf (GOTS) developer libraries being released to the broader community. When integrating these toolchains into larger systems, these developer kits codify the pilot projects’ best practices for statistical analysis and Application Program Interface (API) designs. Specifically, through work on Project Maven, one developer could develop and implement T&E systems that achieved significant reductions in model evaluation time for model vendors (from months to hours). This evaluation system enables the fast iteration of model development against withheld test datasets for model comparison. While these techniques have become more sophisticated over time, they still only codify specific mathematical approaches for validating models’ accuracy in isolation. To date, the committee found these contributions are very biased toward computer vision perception algorithms and have yet to extend their capabilities to fully address system-level T&E and the impact integration has on system-wide verification and validation.

Finding 3-5: DAF AI contributions to date have been focused on computer vision perception and natural language processing algorithms and have yet to extend to fully address system-level T&E.

Autonomous vehicle development was selected as a valid case study for the committee to investigate due to its similarity to some of the autonomy goals of Air Force programs. It is the modern example of AI being integrated into a safety-critical system that requires complex system-level integration. Commercial industry is increasingly investing in the technology fundamental to making autonomous vehicles a reality for consumers and participating in standards creation that govern their deployment. A presentation by a representative of NVIDIA’s autonomous vehicles safety team gave an overview of their system-wide approach to managing the T&E of developed AI models within an extension of a systems engineering risk modeling framework (shown in Figure 3-4).

Within this framework, the development of the AI-enabled system begins with defining the product specification (e.g., what does the system need to do?). The product specification drives the risk model creation that, in turn, generates the

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Image
FIGURE 3-4 NVIDIA’s system-wide approach to managing the T&E of AI models. SOURCE: Courtesy of NVIDIA.

functional requirements to achieve the goals of the system. The product specifications and the risk model are continuously updated through cyclical review. Two cornerstone concepts of the architecture were the assertion that AI implementations will always have a failure mode and that there are no known formal methods to demonstrate the “correctness” of AI. To manage the risks associated with these assertions, NVIDIA implements a methodology for decomposing and reducing requirements into the minimal components required to make validation and verification tenable. While the decomposition of requirements into fundamental components simplifies testing, it has a limitation with deep neural networks. Multi-model DNNs in isolation can induce common cause failures (CCFs), where multiple failures occur due to the same cause, that become impossible to capture in testing (see Figure 3-5). Because there are no known technologies to analyze the CCFs of DNNs, the failure rate of DNNs is hard to quantify, even in the presence of diverse inputs. An alternative design pattern pairs DNNs with rule-based software blocks and empowers an arbiter module to decide the best decision given the risk (see Figure 3-6). Through analysis and due diligence, a software arbiter “inherits” the argumentation so that it achieves a failure rate of 10−2n.

To safely and effectively integrate AI capabilities into safety-critical system processes, continual test and refinement approaches must be implemented to manage against accepted and residual risks. Validation and verification are accomplished in both complex simulation environments and real-world test fleet deployments. Both capabilities feed refinements back into product specification. Additionally, the processes and tools that manage and implement the T&E must themselves be

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Image
FIGURE 3-5 Common cause failures (CCFs) induced by multi-model DNNs in isolation that become impossible to capture in testing. Failure rates are represented in values 10−n. Because CCF modes in a solution such as DNN fusion, the overall failure rate cannot be represented as the product of the two incoming failure rates. SOURCE: Courtesy of NVIDIA.
Image
FIGURE 3-6 An alternative design pattern that pairs DNNs with rule-based software blocks. With analysis and due diligence (diversity and independence, and sufficient freedom from common cause failure), the arbiter “inherits” the argumentation that it achieves a failure rate of 10−2n. SOURCE: Courtesy of NVIDIA.
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Image
FIGURE 3-7 NVIDIA’s “AI factory.” SOURCE: Courtesy of NVIDIA.

secure and safe. To that end, NVIDIA has built suites of cloud-native toolchains to support the scale and latency requirements to support the iterative cycle required. Each step of the process, shown in Figure 3-7, is analyzed to identify and eliminate errors that could lead to safety-critical DNN results. Every software tool is evaluated for safety-critical bugs and user errors. NVIDIA asserted that they treat cloud-based DNN generation as a manufacturing process and view the infrastructure as an “AI factory.”

It is equally important and critical to point out that the same V&V rigor applied to the creation and testing of the AI models themselves must be extended to the data used to create the models, further emphasizing the critical importance of data within the AI life cycle. Two main questions asked about all data used to train AI models are: Is the sample being considered sound? And: Is it complete? Sound data implies the sample is valid and a true member of the input space for a model. A dataset is complete when one can say that all samples that can affect safety have been identified. Demonstrating the soundness of data can be addressed with various “levels of difficulty” of a few approaches (e.g., simulation, replay of collected data, replay of augmented data, labeling of ground truth, and A/B testing). Demonstrating the soundness of data remains a significant challenge in the AI field and is currently managed through detailed limitation analysis of operational design domains (ODDs or “the scenarios”) and test escapes via the test fleet or deployed fleet. Further discussion of these significant challenges can be found in Section 3.6.

As shown in Figure 3-8, there are several valid test methodologies that can be leveraged within a T&E framework for autonomous vehicles. Each methodology has its place, but also presents unique challenges. The methodologies are the following:

  • Replay of the collected data: Sensor and meta-data collected in (large) field data campaigns are replayed as input to the system under test (SUT).
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Image
FIGURE 3-8 Comparison of test methodologies for autonomous vehicle systems. SOURCE: Courtesy of NVIDIA.
  • Replay of augmented data: Collected data are augmented with 3D modeling to create input data that would otherwise be very difficult to collect on public roads.
  • Simulation: Simulation of all input to the ego-vehicle (which contains the sensors) and closed-loop response to all output from the ego-vehicle.
  • Track and road testing: System-level behavior testing on track or public roads.

3.5 CONTRAST OF COMMERCIAL AND DOD APPROACHES TO AI T&E

Large structural and organizational limitations within the DAF T&E ecosystem will affect the DAF’s ability to meet the T&E requirements to operationalize AI implementations. A major source of these challenges has to do with the discrete differences between traditional waterfall approaches and the cyclical nature of actions required to support the AI life cycle. To highlight these limitations, it is worthwhile to walk through a conceptual example of how an AI capability would progress from development through to deployment using the current systems in place for T&E within the DAF. The point of walking through this example is to clearly point out where the current accepted T&E approach for the DAF will not support the AI life cycle. For simplification purposes, one can make several assumptions:

  • All the integration requirements for a deployment platform are satisfied.
  • All data collection and labeling requirements are satisfied.
  • Reasonable requirements can be constructed that describe developmental and operational requirements.
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
  • The developmental test community has the infrastructure needed to verify the delivered capability meets those requirements.
  • The operational test community can iteratively test the capability and has the AI infrastructure in place to retrain as needed or can reach back to the contractor to facilitate the modifications.

With all these assumptions in place, the current process would produce a capability that needs to be handed off to the operational test community. At some point in the process, the test community will certify this new AI capability, and it will be handed over to an operational unit to employ and maintain. This part of the deployment process essentially amounts to the “operations and maintenance” (O&M) of the AI model, yet these operational units have no capacity, requirements, or infrastructure to monitor, retrain, or re-certify models as the AI life-cycle demands. Furthermore, there are no personnel in these units whose training would enable them to facilitate this type of O&M. The current processes fail to meet the AI life-cycle requirements.

The gaps become more obvious when contrasting the DAF’s current approach to AI T&E against what approaches successful AI-ready commercial organizations are employing. Table 3-1 presents what the committee observed as the major differences between commercial approaches and the DAF’s current approach and is not intended to be comprehensive. An AI-ready organization in this context means a group can safely, reliably, and continuously create and deploy AI-enabled systems into operational environments.

TABLE 3-1 Comparison of AI T&E Approaches Between Commercial Industry and the DAF

Commercial Approach DOD/DAF Approach
  • Significant up-front investments in data centralization, processing capability, and tooling for making data accessible, discoverable, and organized. Data are easily formed into datasets for training purposes.
  • Treat the creation of AI implementations as a manufacturing process. Assumes secure, scalable infrastructure to support the continuous development and test of AI components.
  • Employ a methodical use-case-based development of AI requirements that focuses on AI integration, not bolt-on design patterns.
  • Decomposes requirements based on test and evaluation requirements.
  • Continuous development and monitoring are supported by an operational deployment fleet and large-scale simulation environments.
  • Large-scale investments in promoting data as a first-class citizen by improving accessibility and discoverability via data feeds and APIs. Lacking any rigor or tooling around dataset creation, tracking, and improvement.
  • AI infrastructure (compute, AIOps services, and data management services) investments are ad hoc and lack consistency.
  • AI requirements are functional and not developed without considering test and evaluation.
  • Simulation and digital twin capabilities are ad hoc and not scalable.
  • Development to deployment process is not well aligned with the AI life cycle.
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

3.6 TRUST, JUSTIFIED CONFIDENCE, AI ASSURANCE, TRUSTWORTHINESS, AND BUY-IN

Trust has been at the heart of the relationship between the operational community and the air force test community for decades. When operational units accept a new aircraft, new hardware that is integrated into an aircraft, or new or updated embedded aircraft software, they start from a position of explicit trust. It is a level of trust earned over the past 70 years by working with an air force test community characterized by its credibility, expertise, professionalism, discipline, and track record. Operational buy-in is also gained through the air force test community’s well-understood standardized sequence of flight testing—DT, OT, IOT&E, DOT&E, live-fire test and evaluation (LFT&E), and follow-on testing. And when a fielded system fails for any reason, operational crews trust that the test community will identify and fix the problem before returning the system to the field.

Once the test community approves a system to be fielded, line organizations rely on testing results (to include explicit warnings and cautions about performance envelopes), academic instruction, simulators, and flights to gain confidence in the system’s performance. Academic training focuses on normal performance parameters, expected critical failure modes, and how to respond to cockpit indications of degraded system performance. Even for highly complex integrated systems such as an aircraft terrain-following radar, crews adapt relatively quickly through dedicated training—academics, simulators, and flights—and trust and confidence in the original equipment manufacturer and air force test enterprise software and hardware testing processes.

The air force test community’s reputation and track record have been instrumental in allowing end-users to gain and maintain deep confidence in traditional aircraft and other hardware systems. With hardware, trust is typically perceived as a binary yes-or-no concept. AI, however, is fundamentally different. Existing T&E procedures and standards do not work well for nascent and immature software capabilities, especially the black-box, self-learning, adaptive, data-centric nature of AI. Furthermore, it is hard to gain buy-in for AI-enabled capabilities when the DAF test community has not yet established the same kind of testing policies, processes, and procedures that have guided flight testing for the past 70 years. This lack of an established baseline for AI T&E makes it difficult to establish the same level of trust between the testing and operational communities that has been instrumental in fielding traditional hardware systems.

The general concept of justified confidence has gained traction in the AI community over the past several years. This term recognizes the challenges with using the concept of trust that has worked for other legacy hardware systems. It refers to the level of certainty or reliability achieved through direct evidence collected during design and operational test events that can be assigned to the outputs or

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

decisions made by AI-enabled systems. It is a term that describes how well a system can be expected to justify its decisions or predictions, considering the data used for training, the algorithms used, and any potential biases or limitations in the system. Justified confidence helps to provide evidence, transparency, and accountability in AI-enabled systems and helps establish trust in their output.

Instead of looking at trust in AI-enabled systems as a binary concept, users of AI-enabled systems will seek to gain justified confidence in that system over time. Justified confidence will also have different meanings at different levels. For example, the test community will establish internal conditions determining when an AI-enabled system can be released to the field. At the operational level, users will be less interested in tests performed in controlled or curated environments than in whether the system performs as expected under operational conditions and what could happen if the system degrades or fails. At the policy level, for higher-consequence, higher-risk systems such as AI-enabled weapons, decision-makers will seek to gain sufficient confidence in a system before approving operational deployment, measured in ways such as expected behavior, boundary conditions, potential failure modes, and possible consequences or consequence sets. Calibrating confidence in any AI-enabled system will be continuous and cumulative for end-users. It will never end. A user’s confidence—and it is important to make a distinction between “trust” and functional acceptance—in a smart system will depend on context: the nature of the task, the complexity of the question to be answered, the system’s previous performance record, the user’s familiarity with the system, and so on, and may vary over time. In general, continued high performance in lower-risk, lower-consequence tasks will give users more confidence when facing higher-risk, higher-consequence tasks. Until users gain more experience teaming with smart machines, they will face the dilemma of placing too much or too little confidence in AI-enabled systems. Justified confidence applies any time a human and machine interact—not just when they are working together as a team.

When referring to AI-enabled systems, justified confidence is increasingly joined with the concepts of assured systems and trustworthiness. David Tate of IDA provides a framework for determining whether a system is assured and defining whether an AI system is trustworthy. He proposes that a system is assured: “when the relevant authorities have sufficient justified confidence in the trustworthiness of the system to authorize its employment in specified contexts.”19 He also defines a system to be trustworthy to the extent that “(1) when employed correctly, it will

___________________

19 Institute for Defense Analysis (IDA), 2021, “Trust, Trustworthiness, and Assurance of AI and Autonomy,” Alexandria, VA, https://apps.dtic.mil/sti/trecms/pdf/AD1150274.pdf. Tate also argues that three key features determine the level of assurance: whose trust is needed (i.e., a regulating authority); the level of confidence required (given potential benefits and risks); and the (context-dependent) level of confidence justified by the available evidence (p. 5).

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

dependably do well what it is intended to do; (2) when employed correctly, it will dependably not do undesirable things; (3) when paired with humans it is intended to work with, it will dependably be employed correctly.”20 The committee agrees with his assertion that “the purpose of T&E becomes clear: it is the activity that produces the evidence that completes the needed assurance arguments.”21 The committee recommends that the DAF adopt this framework as part of its AI T&E practices.

AI assurance is another term that, along with justified confidence and trustworthiness, replaces the binary concept of trust for AI-enabled systems. It refers to the process of evaluating, monitoring and ensuring the reliability, effectiveness, robustness, and safety of AI systems. AI assurance comprises a set of practices and methodologies for assessing the quality of AI models and systems, including verifying their accuracy and performance, detecting and mitigating potential biases, and evaluating their ethical and societal implications. The goal of AI assurance is to provide confidence in the decision-making processes of AI systems and to promote the responsible and trustworthy deployment of AI technologies. For DoD, AI assurance combines AI T&E and the tenets of responsible AI (RAI).22

RAI helps promote the safe, lawful, and ethical use of AI. AI T&E should be designed to test system performance across the RAI attributes of fairness, interpretability, reliability, and robustness. The NSCAI final report includes a detailed framework to guide the responsible development and fielding of AI implementations, which includes key considerations for policymakers and technical practitioners across the entire AI life cycle.23 The DAF should consider using this framework and the NIST AI RMF24 in establishing AI Assurance best practices. The committee concluded that DAF does not need to sacrifice speed to ensure adherence to the principles of RAI: it is possible to move at the speed of operational relevance while accounting for the importance of fielding AI implementations that are reliable,

___________________

20 IDA, 2021, p. iii.

21 IDA, 2021, p. 9.

22 Department of Defense, 2022, Responsible Artificial Intelligence Strategy and Implementation Pathway, Washington, DC, https://www.ai.mil/docs/RAI_Strategy_and_Implementation_Pathway_6-21-22.pdf.

23 National Security Commission on Artificial Intelligence, 2021, The National Security Commission on Artificial Intelligence Final Report, Arlington, VA, https://www.nscai.gov/wp-content/uploads/2021/03/Full-Report-Digital-1.pdf, p. 384.

24 National Institute of Standards and Technology, Department of Commerce, 2023, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, Washington, DC, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf. See also R. Elluru, C. Howell, and M. Garris, 2023, National Security Addition to the National Institute of Standards and Technology Artificial Intelligence Risk Management Framework Playbook (NIST AI RMF), Special Competitive Studies Project, https://www.scsp.ai/wp-content/uploads/2023/04/National-Security-Addition-to-NIST-AI-RFM.docx-1.pdf.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

safe, lawful, and ethical.25 The NIST AI RMF concludes that the safe operation of AI systems is improved through26 the following:

  • Clear information to deployers on the responsible use of the system
  • Responsible decision-making by deployers and end-users
  • Explanations and documentation of risks based on empirical evidence of incidents

The DAF should work with OSD CDAO to adopt a definition of AI assurance. One definition to consider is “a process that is applied at all stages of the AI engineering life cycle ensuring that any intelligent system is producing outcomes that are valid, verified, data-driven, trustworthy, and explainable to a layman, ethical in the context of its deployment, unbiased in its learning, and fair to its users.”27 The committee also recommends that the DAF adopt and promulgate DoD’s RAI principles and implementation plan.

Recommendation 3-4: The Department of the Air Force should adopt a definition of artificial intelligence (AI) assurance in collaboration with Office of the Secretary of Defense Chief Digital and AI Office. This definition should consider whether the system is trustworthy and appropriately explainable; ethical in the context of its deployment, with characterizable biases in context, algorithms, and datasets; and fair to its users.

Until AI is fielded widely across the DAF, the air and space force test communities gain DAF-wide agreement on AI TEVV definitions, and the test community establishes DAF-wide AI testing policies, processes, and procedures, the committee recommends that the DAF—through the AI T&E champion—codify the concepts of justified confidence, trustworthiness, and AI assurance for all AI-enabled systems. The committee expects that operational buy-in of AI-enabled systems will be neither instantaneous nor permanent. Instead, the test community and end-users

___________________

25 See, for example, M. Ekelhof, 2022, “Responsible AI Symposium—Translating AI Ethical Principles into Practice: The U.S. DoD Approach to Responsible AI,” West Point: The Lieber Institute, November 23, https://lieber.westpoint.edu/translating-ai-ethical-principles-into-practice-us-dod-approach.

26 National Institute of Standards and Technology, Department of Commerce, 2023, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, Washington, DC, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf. See also R. Elluru, C. Howell, and M. Garris, 2023, National Security Addition to the National Institute of Standards and Technology Artificial Intelligence Risk Management Framework Playbook (NIST AI RMF), Special Competitive Studies Project, https://www.scsp.ai/wp-content/uploads/2023/04/National-Security-Addition-to-NIST-AI-RFM.docx-1.pdf.

27 This definition of AI Assurance was proposed by F.A. Bararseh, L. Freeman, and C.-H. Huang, 2021, “A Survey on Artificial Intelligence,” Journal of Big Data 8(60), https://doi.org/10.1186/s40537-021-00445-7.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

will have to work closely together over the next several years in an iterative process to gain a better understanding of AI TEVV, gather more insights on AI-enabled system performance under all conditions, and establish AI testing roles, responsibilities, and authorities at all levels across the DAF, to include at the unit level.

3.7 RISK-BASED APPROACH TO AI T&E

The formulation of T&E requirements across the AI life cycle is linked inextricably to the concept of risk management. One cannot be considered in isolation from the other. In this section, the committee considers operationally-oriented risks pertaining to the integration of AI capabilities into DAF systems and the fielding decisions associated with those systems. In Chapter 5, the committee examines a broader and more detailed set of technical risks, particularly corruption and adversarial attacks, throughout the AI life cycle.

As with the T&E of all other DAF systems, risk management will play a vital role in testing AI-enabled systems. Risks are increasing as AI moves beyond specific-purpose systems to more general-purpose AI systems that are expected to become vastly more capable in different operational settings and across multiple domains.

Risks will also increase significantly as different AI-enabled systems are integrated into and begin to interact across system-of-systems architectures in complex, highly dynamic multi-domain environments and demonstrate online learning and even emergent behavior.28 Therefore, the DAF should incorporate an AI risk management framework (RMF), such as the National Institute of Standards and Technology (NIST) AI RMF,29 in all AI-related design, development, fielding, and sustainment. Any AI RMF includes assessing and understanding the potential risks of fielding AI-enabled systems based on different levels of dedicated T&E,

___________________

28 See, for example, J. Harvey, 2018, “The Blessing and Curse of Emergence in Swarm Intelligence Systems,” Chapter 6 in Foundations of Trusted Autonomy: Studies in Systems, Decision and Control, H.A. Abbas, ed., Vol. 117, https://doi.org/10.1007/978-3-319-64816-3_6. Harvey defines emergence as behavior “at the global level that was not programmed in at the individual level and cannot be readily explained based on behaviour at the individual level,” p. 117.

29 National Institute of Standards and Technology, Department of Commerce, 2023, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, Washington, DC, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf. “The AI RMF refers to an AI system as an engineered or machine-based system that can, for a given set of objectives, generate outputs such as predictions, recommendations, or decisions influencing real or virtual environments” (p. 1). The NIST AI RMF defines trustworthy AI as AI that “is valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy enhanced, and fair with their harmful biases managed” (pp. 2–3). See also R. Elluru, C. Howell, and M. Garris, 2023, National Security Addition to the National Institute of Standards and Technology Artificial Intelligence Risk Management Framework Playbook (NIST AI RMF), Special Competitive Studies Project, https://www.scsp.ai/wp-content/uploads/2023/04/National-Security-Addition-to-NIST-AI-RFM.docx-1.pdf.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

communicating risks to decision-makers and end-users, and determining responsibility and accountability for system failure or unanticipated performance problems.

The NIST AI Risk Management Framework (RMF) states that “AI risk management offers a path to minimize potential negative impacts of AI systems, such as threats to civil liberties and rights while providing opportunities to maximize positive impacts. Furthermore, addressing, documenting, and managing AI risks and potential negative impacts can lead to more trustworthy AI systems.”30 It also notes that risk management “should be continuous, timely, and performed throughout the AI system life-cycle dimensions.” Since this study is directed primarily toward AI T&E under operational conditions, the committee does not address the kinds of broad societal-level risks described in the NIST AI RMF. The committee recommends, however, that the DAF adopt the NIST’s AI RMF Core, comprising the four major functions of Govern, Map, Measure, and Manage.31

Major risk factors commonly associated with the design and operation of AI-enabled systems are potential drop in performance due to domain shift (discussed in Section 3.2), vulnerability due to adversarial attacks (discussed in Chapter 5), perception of bias, privacy concerns, and a lack of explainability. Therefore, T&E protocols should assess the impact of each of these factors on the operational viability of AI-enabled systems and take the needed corrective measures.

Every AI capability, like every hardware system, introduces operational risks. AI shares the combination of safety and security risks with all other extant hardware and software systems.32 Applying the NIST AI RMF categories can be a useful decomposition of some of the risks inherent in AI-enabled systems.33 The DAF T&E enterprise has a distinguished performance record of assessing and mitigating the

___________________

30 National Institute of Standards and Technology, Department of Commerce, 2023, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, Washington, DC, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf.

31 The NIST AI RMF describes these major functions as follows: “Govern: A culture of risk management is cultivated and present; Map: context is recognized and risks related to context are identified; Measure: Identified risks are assessed, analyzed, or tracked; Manage: Risks are prioritized and acted upon based on a projected impact.”

32 MIL-STD-88F, “DoD Standard Practice System Safety,” will replace the May 11, 2012, version (MIL-STD-88E) and will include a section on AI and ML. It will also include the AI criticality index (AICI), which will be used to determine the level of rigor (LOR) of software safety assurance activities to be imposed on the software. See Department of Defense, 2012, Department of Defense Standard Practice: System Safety, MIL-STD-882F, Washington, DC, https://cdn.ymaws.com/system-safety.org/resource/resmgr/documents/Draft_MIL-STD-882F.pdf. For an insightful examination of an integrated approach to safety and security, see, for example, W. Young and N.G. Leveson, 2014, “An Integrated Approach to Safety and Security Based on Systems Theory,” Communications of the ACM 57(2):31–35.

33 For current organizations using NIST RMF companies to help manage risk, it is important to realize that the NIST AI RMF requires a different set of expertise and would likely require a separate organization to perform the AI risk analysis.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

risks inherent in hardware systems, especially flight and space weapon systems. As a form of self-learning software, however, AI presents novel sources of risk in the operational environment that are not presently well understood by the DAF test community due to a lack of familiarity with how AI systems operate, lack of operational experience with AI-enabled capabilities, and the inherent characteristics of advanced AI models. This problem will become especially acute when AI is integrated into a system-of-systems or network-of-networks architecture, leading to unknown or unanticipated cumulative and aggregate risks.

Potential risks must be considered at every stage of the AI life cycle, beginning with the formulation of AI capability requirements and associated T&E metrics and performance measures through operational fielding and sustainment via CI/CD processes.34 AI-enabled capabilities should be fielded using a “measured risk” approach (see Section 4.3) as rapidly as operational requirements dictate while taking steps to prevent the emergence of unnecessary risks resulting from fielding capabilities that are immature, insufficiently tested, unproven, or unsafe. As one speaker argued, in some cases, the performance of an AI-enabled capability may be so compelling that leaders will have to make a risk-based decision to field even in the absence of full trust or a completely explainable system.

The committee acknowledges the challenges inherent in finding and maintaining the right balance between speed-to-field and the rigors of comprehensive T&E. As opposed to processes used for traditional hardware fielding decisions, DAF leaders should embrace the concept of “field to learn,” putting capabilities in the hands of users after sufficiently rigorous “back bench” T&E by a certified AI T&E team and incorporating end-user feedback to make iterative improvements to fielded systems via accepted CI/CD processes (with the commensurate amount of T&E for all model updates).35 Until the DAF test, program office, and operational communities gain more experience developing, testing, and fielding AI-enabled systems, the committee recommends biasing toward a more cautious—but not inherently lethargic—approach to ensure sufficient testing before any AI technology is fielded. Precaution should guide but not unduly constrain the DAF from introducing a new product or process whose ultimate effects are disputed or unknown.

One speaker noted that AI model complexity is currently doubling every 2 months. This is a staggering rate of change. Unfortunately, as presently structured, the committee expects that the DAF T&E enterprise is not capable of adapting to this rapid evolution.

___________________

34 As one example, the Chief Architect from a commercial company briefed the committee on their use of traditional safety engineering V-models that had been adapted to reflect the entire AI life cycle, up to and including the impact of data feedback loops and CI/CD on overall system safety.

35 S. Moore, 2023, “Right Hands, Right Place: Why We Must Push Military Technology Experimentation to the Edge,” Defense One, January 19, https://www.defenseone.com/ideas/2023/01/right-hands-right-place-why-we-must-push-military-technology-experimentation-edge/382000.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

Operational risks will increase as AI implementations (see Section 1.3) expand beyond narrow, single-task, and single-domain computer vision and natural language processing (NLP) capabilities to more advanced AI, such as reinforcement learning (RL); reinforcement learning with human feedback (RLHF); transfer learning (TL); semi-supervised, self-supervised and unsupervised learning; and foundational models and generative AI that will be vastly more capable in different operational settings and across multiple domains. Risks will also increase significantly as different AI-enabled systems are integrated into and begin to interact across system-of-systems architectures and demonstrate emergent behavior.36 Therefore, as discussed above, the DAF should incorporate an AI risk management framework in all AI-related design, development, fielding, and sustainment; the committee recommends incorporating key elements of the NIST AI RMF, “Special Competitive Studies Project (SCSP)” in the National Security Addition to the NIST AI RMF Playbook,37 and ISO/IEC standards and frameworks,38 along with any DAF-specific additions. Any AI RMF includes assessing the potential risks of fielding AI-enabled systems based on different levels of dedicated T&E, communicating risks to decision-makers and end-users, and determining responsibility and accountability for system failure or unanticipated performance problems. Risk assessments should also address the risks presented by user unfamiliarity with AI-enabled systems (risks expected to decrease but not disappear with increasing user familiarity with such systems).

___________________

36 See, for example, Richard Danzig’s 2018 monograph, “Technology Roulette.” Danzig offers a compelling caution that “Experience with nuclear weapons, aviation, and digital information systems should inform discussion about current efforts to control artificial intelligence (AI), synthetic biology, and autonomous systems. In this light, the most reasonable expectation is that the introduction of complex, opaque, novel, and interactive technologies will produce accidents, emergent effects, and sabotage. In sum, on a number of occasions and in a number of ways, the American national security establishment will lose control of what it creates” and that “twenty-first technologies are global not just in their distribution, but also in their consequences.” R. Danzig, 2018, “Technology Roulette: Managing Loss of Control as Many Militaries Pursue Technological Superiority,” Washington, DC: Center for New American Security, https://s3.us-east-1.amazonaws.com/files.cnas.org/hero/documents/CNASReport-Technology-Roulette-DoSproof2v2.pdf?mtime=20180628072101&focal=none.

37 R. Elluru, C. Howell, and M. Garris, 2023, National Security Addition to the National Institute of Standards and Technology Artificial Intelligence Risk Management Framework Playbook (NIST AI RMF), Special Competitive Studies Project, https://www.scsp.ai/wp-content/uploads/2023/04/National-Security-Addition-to-NIST-AI-RFM.docx-1.pdf?utm_source=substack&utm_medium=email.

38 See, for example, ISO/IEC SC 42. SC 42 is a joint committee between the IEC and ISO. It serves as the focus and proponent for the ISO/IEC joint technical committee (JTC 1) international standardization program on AI and provides guidance to JTC, IEC, and ISO committees developing AI applications. Draft ISO/IEC TR 5469, “Functional Safety and AI Systems,” is expected to be published in 2023. Also, see, for example, SAE AS 6983, “Process Standard for Development and Certification/Approval of Aeronautical Safety-Related Products Implementing AI.”

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

For all AI-enabled capabilities, the DAF should clearly distinguish between mission- and safety-critical systems and all other AI-enabled systems. Mission- and safety-critical systems demand a much higher level of rigor and scrutiny throughout the entire T&E process, from design and development through sustainment under operational conditions. This includes an examination of reliability, repeatability, predictability, directability, safety, and security. When individual AI-enabled systems are integrated into network-centric architectures, this analysis also requires individual platform-centric assessments as well as aggregated assessments.39 As noted earlier in this chapter, the committee heard examples from the private sector of an integrated, iterative, and comprehensive approach to AI T&E for safety-critical systems such as autonomous vehicles. These represent a good example of a complex system employed in a safety-critical operation requiring perception, decision-making, and other autonomous characteristics.

In summary, when fielding AI-enabled capabilities under operational conditions, DAF end-users, program offices, DevSecOps/AIOps teams, testers, and leaders must use a tailored AI RMF to address a series of risk-related questions at each stage of the AI life cycle.40 These include, though are not limited to: what are the risks at each stage of the AI life cycle (including when AI systems are fielded, the potential risk to mission, and risk to force)? How are those risks determined and measured (including red teams’ roles and responsibilities in assessing adversarial AI attacks against AI models)? Who assesses each risk? How are risks briefed to decision-makers at each level, and who has the authority to accept each risk or, if the risk is deemed unacceptable, to pause further development or fielding? What are the risks of catastrophic failure, either in isolation or when integrated across multiple architectures (i.e., worst-case failure modes)? How are risks managed and mitigated where necessary (including adjusting AI T&E requirements as necessary)? Finally, who is held responsible and accountable for system failure?

Recommendation 3-5: The Department of the Air Force should develop standardized artificial intelligence (AI) testing and evaluation protocols to assess the impact of major AI-related risk factors.

___________________

39 Aggregated risk assessments of complex network-centric architectures should be completed by a multidisciplinary group that has broader visibility into all the components of the network and system-of-systems architecture.

40 See, for example, Appendixes AC for a proposed comprehensive AI RMF for the U.S. Intelligence Community, In C.R. Stone, 2021, “The Integration of Artificial Intelligence in the Intelligence Community: Necessary Steps to Scale Efforts and Speed Progress,” Joint PIJIP/TLS Research Paper Series, 73, https://digitalcommons.wcl.american.edu/research/73.

Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 49
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 50
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 51
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 52
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 53
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 54
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 55
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 56
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 57
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 58
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 59
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 60
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 61
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 62
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 63
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 64
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 65
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 66
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 67
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 68
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 69
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 70
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 71
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 72
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 73
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 74
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 75
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 76
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 77
Suggested Citation:"3 Test and Evaluation of DAF AI-Enabled Systems." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 78
Next: 4 Evolution of Test and Evaluation in Future AI-Based DAF Systems »
Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force Get This Book
×
 Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force
Buy Paperback | $42.00 Buy Ebook | $33.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The Department of the Air Force (DAF) is in the early stages of incorporating modern artificial intelligence (AI) technologies into its systems and operations. The integration of AI-enabled capabilities across the DAF will accelerate over the next few years.

At the request of DAF Air and Space Forces, this report examines the Air Force Test Center technical capabilities and capacity to conduct rigorous and objective tests, evaluations, and assessments of AI-enabled systems under operational conditions and against realistic threats. This report explores both the opportunities and challenges inherent in integrating AI at speed and at scale across the DAF.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!