National Academies Press: OpenBook

Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force (2023)

Chapter: 6 Emerging AI Technologies and Future T&E Implications

« Previous: 5 AI Technical Risks Under Operational Conditions
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

6

Emerging AI Technologies and Future T&E Implications

New and promising artificial intelligence (AI) techniques and capabilities are on the horizon. Even as the Department of the Air Force (DAF) addresses its current needs and opportunities, it must evaluate these emerging AI trends and their likely implications for test and evaluation (T&E). The committee was tasked with recommending “promising areas of science and technology that may lead to improved detection and mitigation of AI corruption” (see Appendix A). Although it is difficult to predict which AI advances will be most impactful for Air Force applications with precision, five areas seem particularly salient:

  • Trustworthy AI
  • Foundation Models
  • Informed Machine Learning Models
  • AI-Based Data Generators
  • AI Gaming for Complex Decision-Making

Each of these areas has implications for future Air Force T&E practices and infrastructure needs, as discussed below.

Recommendation 6-1: The Department of the Air Force should focus on the following promising areas of science and technology that may lead to improved detection and mitigation of artificial intelligence (AI) corruption: trustworthy AI, foundation models, informed machine learning, AI-based data generators, AI gaming for complex decision-making, and a foundational understanding of AI.

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

6.1 TRUSTWORTHY AI

The pressure to reap the benefits of AI technology has encouraged private industry to market AI-based products even though there is no tightly bound theoretical understanding of their performance and robustness. The risk of failure is tolerated because the consequences are acceptable.1 Commercial AI is generally hardened through continual testing and rapid incremental refinement, usually through extensive user feedback. The DAF should employ this approach whenever possible; nevertheless, it is difficult to confidently engineer robustness and performance into a system when the performance foundations are poorly understood. As military services seek to apply and deploy AI under dynamic and high-risk operational conditions, the need for AI robustness, survivability, resilience, safety, fairness, explainability, ethics, and theoretical performance bounds becomes crucial.

There are several barriers to trustworthy AI. First, current machine learning performance theory lags behind the practical application of AI. For instance, existing theory cannot reliably predict how a neural network architecture will affect performance or how well a learned model will perform in new environments or under new operating conditions. This situation presents a fundamental risk to the trustworthiness of AI and challenges the use of AI in military weapon systems and other safety-critical applications.

A second barrier is a dearth of rigorous testing mechanisms. Testing systems in controlled environments yields over-optimistic evaluations of an AI system’s performance, while testing “in the wild” may present significant risks to bystanders; this issue has been observed in catastrophic failures of autonomous vehicles.

A third barrier is limitations in training data. For instance, large language models have made significant strides in English, but their extension to languages with far less online content from which to scrape training data will be challenging and face inherent limitations. Furthermore, biases in training data can harm some stakeholders, as evidenced by Google’s and Amazon’s AI recruiting tools being biased against women2 and facial recognition systems not accurately recognizing Black people, partly because those systems had limited training samples from certain subpopulations. Similar challenges will have high-stakes consequences in DAF deployments in various communities, cultures, and environments.

Trustworthy AI depends on reliable human-AI interactions. Humans must be able to see an AI’s prediction and assess its confidence in that prediction and

___________________

1 Of course, this observation does not apply to the use of AI in safety-critical commercial systems such as industrial robotics or self-driving cars. Indeed, the T&E approaches and requirements for trustworthy components are similar to those faced by the DAF.

2 J. Dastin, 2018, “Amazon Scraps Secret AI Recruiting Tool that Showed Bias Against Women,” Reuters, https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G.

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

characterize the AI’s basis for that prediction. Without interpretability and uncertainty quantification, trust in AI will remain limited.

Mechanisms for adapting to distribution drifts are essential to trustworthy AI. Furthermore, such mechanisms are necessary to account for shifting environmental conditions and imbalances in training data (e.g., having different fractions of samples for different subpopulations at test time than at training time).

AI components are often integrated into a larger system, so typical metrics used to assess an AI component’s performance in isolation may inaccurately reflect its effect on the overall system performance.

Finally, trustworthy AI systems must be robust in the face of training- and inference-time attacks. Training-time attacks include data poisoning attacks and back doors, while inference-time attacks include making small changes to test samples to induce significant changes to the AI output—both white-box attacks that depend on knowledge of the AI model’s inner workings and black-box attacks which pose risks even when details of the AI model are hidden.

DoD has recognized the need to improve the trustworthiness of AI. How AI will interact with the warfighter to improve trustworthiness is becoming a central concern as DoD seeks to adopt AI technologies. Human-AI interaction models, as they exist now, do not account for the dynamic and stressful situations in which warfighters find themselves. Thus, in its February 2022 memo, “Technology Vision for an Era of Competition,” the OUSD(R&E) identified “Trusted AI and Autonomy” as one of 14 critical technologies areas and noted that, “[t]rusted AI with trusted autonomous systems are imperative to dominate future conflicts.”

Furthermore, the June 2022 DoD report U.S. DoD Responsible AI Strategy and Implementation Pathway stated, “[t]o ensure that our citizens, warfighters, and leaders can trust the outputs of DoD AI capabilities, DoD must demonstrate that our military’s steadfast commitment to lawful and ethical behavior applies when designing, developing, testing, procuring, deploying, and using Al.”

The basic issue is whether a warfighter will trust their life to an AI-based system. These DoD concerns, combined with heightened public concern, have encouraged intensified research and development in trustworthy AI technologies. As a result, we can expect both near- and longer-term progress that will benefit AI-based DAF systems in general and DAF AI T&E specifically.

Finding 6-1: Existing approaches for designing trustworthy AI-enabled systems do not take into account the role of humans who interact with the AI-enabled systems.

Implications of Advances in Trustworthy AI to DAF T&E

While many challenges will undoubtedly persist into the foreseeable future, the continued focus from both private industry and the U.S. government will improve understanding of the theoretical foundations and will lead to the creation of more

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

trustworthy AI components. Furthermore, these advances will lend greater clarity to the range of appropriate AI applications and will extend that range by developing new ML approaches and improved architectures.

In short, trustworthy AI will enable higher quality and more dependable AI components. Additionally, advances expected in the next few years will permit AI adopters to insist that ML models have their uncertainty comprehensively measured or analytically bounded. Again, this will greatly benefit T&E activities. Ultimately, an ML component should be able to incorporate quantifiable uncertainty as part of its output. Thus, systems based on these components will already have a good test base from which to proceed to system-level testing, which will not only aid T&E but also help guarantee robust and resilient operation and promote user trust.

To reap the benefits of trustworthy AI components, the DAF must adopt new system engineering and T&E practices that explicitly incorporate requirements for trustworthy AI. The DAF will need acquisition approaches that recognize the state of the art in AI trustworthiness, placing realistic but aggressive requirements on AI components. These new practices must be codified in a set of standards and supported by appropriate tools and infrastructure. Developmental testbeds will be needed to explicitly measure AI robustness, resilience, safety, and other trustworthiness attributes. The Air Force will need T&E processes, canonical test datasets, and infrastructure to perform T&E of these higher-quality AI components, including the means to efficiently test performance against out-of-distribution, dynamic, and unexpected operational conditions. AI data generators will likely play a key role. In addition, adversarial T&E processes similar to those emerging in the cyber domain will be important to probe and redress vulnerabilities and deficiencies.

Recommendation 6-2: The Department of the Air Force should invest in developing and testing trustworthy artificial intelligence (AI)-enabled systems. Warfighters are trained to work with reliable hardware and software-based advanced weapon systems. Such trust and justified confidence must be developed with AI-enabled systems.

Trustworthy AI components will be enablers for safety critical systems such as weapon systems and semi or fully autonomous vehicles. However, not all trustworthy AI components must necessarily be fail-safe. As with other complex systems, some failure modes will be acceptable given the operational context of the model. The goal in engineering a trustworthy AI component will be to make its performance significantly more interpretable and predictable than the AI models currently available while tailoring it to its intended application. For example, recommender systems can be more tolerant of errors than autonomous control systems and thus have different AI trustworthiness requirements. There will be a

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

natural trade between the time and expense to build certain levels of trustworthiness and the intended application of the model.

Uncertainty quantification is an essential ingredient of AI-enabled DAF systems. Two examples of currently common paradigms include Bayesian estimation and conformal inference.3 Bayesian statistics have a long and rich history, and methods in common use by the DAF, such as Kalman filtering, are grounded in Bayesian methodology. However, some priors, such as priors on weights of neural networks, can be difficult to interpret or validate. Conformal inference uses carefully selected quantiles of training data to quantify the uncertainty in predictions without any distributional assumptions on the data and minimal assumptions on ML algorithms. This framework has a high potential for facilitating T&E. A related challenge is communicating uncertainty measurements to human decision-makers. If the prediction is a scalar value (e.g., predicted amount of precipitation next week), then the DAF will have various excellent tools at its disposal. But when ML systems yield high-dimensional outputs, such as images, visualizing or communicating uncertainty is a persistent challenge.4 However, the rate of change is such that by the time this report is publicly released, several new relevant examples will have been developed.

6.2 FOUNDATION MODELS

Foundation models (FMs)5 are deep learning models that have emerged in the past 5 years, initially for language processing applications, where they are called large language models (LLMs) (exemplified in Figure 6-1). However, FMs have recently been applied to visual, multimodal, and multitask applications. These models are extremely large deep neural networks that use immense training sets, with some models exceeding 100 billion learning parameters. FMs employ self-supervised learning (SSL) where the model is presented with (x’, x) pairs, where x’ is an edited version of x with some of the constituents of x having been excised. The model is taught to predict the excised constituents and uses as a training signal its understanding of the full contents of each x to generate a loss function. Typical examples of edits for image and video-based SSL are coloring, rearranging the sections of an image or frames of a video, and other geometric transformations. One of the main advantages of SSL is that the costly process of labeling training data is avoided. This can greatly simplify data curation for both training and testing.

___________________

3 J. Lei, M. G’Sell, A. Rinaldo, R.J. Tibshirani, and L. Wasserman, 2018, “Distribution-Free Predictive Inference for Regression,” Journal of the American Statistical Association 113(523):1094–1111.

4 A. Angelopoulos, S. Bates, J. Malik, and M.I. Jordan, 2020, “Uncertainty Sets for Image Classifiers Using Conformal Prediction,” arXiv:2009.14193.

5 M. Casey, 2023, “Foundation Models 101: A Guide with Essential FAQs,” Snorkel AI, March 1, https://snorkel.ai/foundation-models.

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Image
FIGURE 6-1 The growth in the size of Large Language Models (LLMs). Due to computational requirements, it is unlikely that the exponential rate shown can continue indefinitely, but if the trend plateaus near its current size, only a small set of organizations will be able to develop future LLMs. SOURCE: Courtesy of NVIDIA.

Today, FMs represent the state-of-art for natural language processing (NLP) tasks and consistently outperform the previous leaders, recurrent neural networks (RNNs), and long-short-time memory (LSTM) models.

Candidate applications for early Air Force adoption include language translation, communications denoising, language and speaker identification, human-machine interfaces that use DAF-specific nomenclature and idioms, recommender systems for training and intelligence analyses, and data summarization for intelligence reporting. As CV-based FMs become mainstream, the Air Force can leverage them in missions that involve large amounts of ISR data, where FMs will perform state-of-the-art detection, classification, ID, and tracking tasks. Ultimately, multi-modal and multi-tasking FMs will help fuse data from multiple sources and will help analysts, pilots, and commanders perform complex and time-critical tasks.

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

The Implication of Foundation Models for DAF T&E

The Air Force may elect to build its own FMs or procure pre-trained FMs and adapt them to Air Force applications. In either case, it must address the testing challenges that accompany these huge and complex models. For example, methods are still being developed to test FMs, and what methods exist are highly human-intensive. The allure of FMs is that while initial training and T&E require huge datasets and computing resources, once the base FM has been trained and tested, it can be readily adapted to a broad suite of downstream applications. As a result, the amount of adaptation development and T&E required for each application is less than would be required if the application were created from scratch without the FM as a base model. Moreover, as improvements are made to the base FM, these newer versions can be re-integrated readily into the applications, thereby efficiently propagating improvements across the entire suite.

Unfortunately, today’s FMs are huge “black-box” components—literally 100s of billions of learning parameters—that lack transparency, explainability, and interpretability. T&E failure isolation can be a major challenge. For example, if the adapted FM has a failure mode, it may be unclear if the failure is due to the base FM, the adaptation, or the interaction between these two parts. Assigning accountability and correcting failures may be difficult, especially when the failures are due to the complex and subtle interplay between the components.

There are other issues, as well. For example, FMs will likely propagate their failure modes and biases to their adaptations. Thus, if the DAF has used a base FM for many applications, they may all exhibit the same base FM vulnerabilities. Furthermore, while FMs adaptations can perform extremely well, performance under transfer to new environments or continual learning in evolving environments can be suboptimal compared to dedicated models.

The DAF may consider using commercial FMs and adapting them to Air Force applications. For instance, an FM trained on images may lead to “off-the-shelf” image feature representations that could be used to train an Air Force EO image classifier. This framework is tantalizing in terms of the relatively fast development time and small computational resources required for training. However, the pre-trained FMs may also present significant security risks. In particular, commercial FMs are generally trained using massive collections of uncurated data scraped from the internet. This means that an adversary may post images or other data online to be scraped by the FM, which are explicitly designed to poison the FM for a particular task. For instance, an adversary might upload a series of images of jets designed to shift how images of jets are represented by FMs and affect downstream classifiers. Such attacks are almost impossible to detect, and accounting for this possibility is essential for accurate T&E.

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

Large FMs and data generators (discussed in the next section) will require massive computational resources for training, T&E. Therefore, the DAF must consider strategies for access to supercomputing class computers. One possibility is to partner with the national DOE laboratories, such as Sandia National Laboratory or Oak Ridge National Laboratory. Another possibility is to upgrade the DoD high-performance computing capability to handle these demanding AI workloads. Leasing capability from a major cloud provider should also be investigated. In any case, the solution must be readily accessible to both AI developers and T&E professionals and be able to protect data and AI software at multiple security levels.

Finding 6-2: Large language FMs exhibit a behavior termed “hallucination,” where the model output is either non-sensical or is not consistent with the provided input or context. The effects of hallucination are task-dependent. There are no metrics to assess the impact of large FMs on the various downstream applications they have been applied to.

Finding 6-3: Several Large FMs are available for single modality, language being the most dominant one. DAF tasks may involve multi-modal sensing and inference. SSL-based Large Language Models are just recently becoming available for multi-modal paired or unpaired data.

6.3 INFORMED MACHINE LEARNING MODELS

Although foundation and other data-driven deep neural network (DNN) models have become the mainstay of machine learning applications, newer approaches to deep learning are emerging that seek to explicitly incorporate more application-domain knowledge into the learning process.6 The committee refers to these approaches collectively as informed machine learning (IML).7 IML models seek variously to incorporate knowledge in the form of algebraic equations, differential equations, simulation results, spatial invariances, logic rules, knowledge graphs, probabilistic relations, and human feedback into the learning process or the model architecture.

IML approaches increase model performance, generalizability across targeted domains, robustness, interpretability, and explainability. Fundamentally, IML models

___________________

6 Conventional deep learning, of course, also integrates knowledge into its learning processes, through labeled data, feature engineering, and by exploiting invariances or equivariances (in convolutional neural networks, for example); but the IML techniques seek to integrate more knowledge and do so in a principled manner that does not depend on the data itself but, rather, on the domain whence the data derives.

7 L.V. Rueden, S. Mayer, K. Beckh, et al., 2021, “Informed Machine Learning—A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems,” IEEE Transactions on Knowledge and Data Engineering 35(1):614–633.

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

aim to improve the utility and trustworthiness of deep learning. Compared to FMs and other deep learning models, IML models can be relatively small and trained using fewer data samples. Furthermore, when an IML model incorporates general laws or constraints (of physics or geometry, for example) it can offer an improved ability to handle non-stationary environments and to generalize better beyond the scope of its training set.

The DAF will find numerous uses for IML models, especially in physics-based applications such as radar, sonar, EO/IR processing, and where the models need to be embedded in size, weight, and power-constrained applications. IMLs will also apply to applied science research, such as the discovery of new materials for hypersonic systems or the assessment of aircraft design under various operational conditions.8

Implications of Informed Machine Learning Models for DAF T&E

IML models represent an emerging area in machine learning, with many applications and research directions. The DAF needs to assess the T&E needs of these models in the context of relevant applications. Notwithstanding the nascent nature of these models, it is likely that the principled incorporation of knowledge into machine learning will reduce and better characterize test space coverage for DAF applications; this will allow for more efficient testing at both the component and system levels. Also, these models may be more amenable to analytical verification processes based on, for example, the physical constraints programmed into the models. The reduced size of these models and their ability to leverage and focus on causally related environmental features will contribute to model explainability and interpretability, thereby facilitating failure analysis and improving trustworthiness.

There will also be challenges to overcome for effective T&E. IML models get their power from the integration of prior knowledge about the application domain. But this human-directed incorporation of knowledge may lead to unconscious biases or unintended limitations embedded in the models. Also, the development of IML models requires close collaboration between domain experts and machine learning experts. Thus, T&E teams and processes must be multidisciplinary to properly implement efficient test approaches and interpret test results. Furthermore, physics-based information incorporated into ML systems may be approximations of the true physics or may change over time or across instances. For example, an ML model may be trained for one radar sensor and work well in that context but yield poor results when used for a different sensor. Accounting for shifts in the physical knowledge between the training and testing phases is critical;

___________________

8 G.E. Karniadakis, I.G. Kevrekidis, L. Lu, et al., 2021, “Physics-Informed Machine Learning,” Nature Review Physics 3:422–440, https://doi.org/10.1038/s42254-021-00314-5.

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

while methods such as model adaptation can help overcome this challenge, such considerations are a vital component of T&E for IML systems. Finally, adversarial robustness may manifest differently in IML systems than in their more generic counterparts. While the side information embedded into IML systems may help reduce opportunities for data poisoning, for instance, it may also mean that new methods are necessary for identifying, counteracting, or safeguarding against poisoning attacks.

6.4 AI-BASED DATA GENERATORS

AI-based data generation is an active and rapidly advancing area in machine learning research and development, with many novel AI techniques appearing in the past 10 years. In the visual domain, for example, generators range from generative adversarial networks (GANS)9 to variational autoencoders,10 autoregressive models,11 normalizing flow techniques,12 and denoising diffusion models.13 Neural radiance field (NeRF) models have recently emerged that can generate multi-view 3D volumetric images from multiple 2D images. NeRFs are a type of informed machine learning (covered in the next section) that combine neural networks and traditional geometry-based rendering techniques. In the text domain, generators include transformer-based architectures such as GPT-3 and ChatGPT.14

Data generators can create realistic augmented reality and virtual reality simulations; they can fill in missing data, extrapolate or predict data based on existing datasets, and realistically (or otherwise) perturb existing datasets. In short, they can simulate an existing reality or can create fake but realistic variants of reality. This is demonstrated in Figure 6-2, which shows examples of photorealistic faces generated using a denoising diffusion model. They can create realistic images of all sorts and sizes that are often hard for humans to detect as fabrications. They can create photorealistic faces (and other objects or gestures) and can morph one face

___________________

9 Google Machine Learning Education, 2022, “Generative Adversarial Network,” updated July 18, https://developers.google.com/machine-learning/gan.

10 J. Rocca and B. Rocca, 2019, “Understanding Variational Autoencoders (VAEs),” Towards Data Science, September 23, https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73.

11 Author, “Guide to Autoregressive Models,” Turing, https://www.turing.com/kb/guide-to-auto regressive-models, accessed April 25, 2023.

12 A. Omray, 2021, “Introduction to Normalizing Flows,” Towards Data Science. https://towardsdatascience.com/introduction-to-normalizing-flows-d002af262a4b.

13 J. Ho, A. Jain, and P. Abbeel, 2020, “Denoising Diffusion Probabilistic Models,” University of California, Berkley.

14 T. Brown, B. Mann, N. Ryder, et al., 2020, “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33:1877–1901.

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Image
FIGURE 6-2 Photorealistic faces generated using a denoising diffusion model. SOURCE: Courtesy of University of California, Berkeley.

(or object) into another face (or object) or into a different aspect view of the same face (or object). They can extend beyond images to create realistic videos and audio. FMs, discussed earlier, can be used as text data generators, creating realistic sentences, full paragraphs, and even essays that could plausibly come from intelligent (or otherwise) humans. By combining video generators and text generators, text-to-image and image-to-text transcriptions are possible. The DALL-E system is a modern example of the power of text-to-image generation, and ChatGPT is a modern example of text generation in response to prompts.

Potential DAF uses of AI-based data generators are extensive. These include creating training scenarios for combat games or pilot training, generating data for influence operations, and training machine learning algorithms and autonomous systems for operation in simulations of denied or contested environments. There are also numerous applications to ISR in data extrapolation, smoothing, or interpolation. Today, for example, an AI-based Global Synthetic Weather Radar (GSWR) system has been prototyped for the DAF. The GSWR uses AI data generation techniques that integrate multiple data sources to predict how weather radar returns would appear in regions where they are absent.

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

Finding 6-4: Physics-based and other knowledge-informed models have the potential to increase the robustness and computational efficiency of data-driven methods. These models can also help enforce physics or knowledge-based performance boundaries, which can increase the efficiency of the T&E process. However, to successfully deploy such models the DAF must ensure that the parameters and assumptions upon which they are based are actually present during operations, which requires additional attention to operational T&E.

Implications of AI-Based Data Generators to DAF T&E

Data generators will likely play a significant role in future DAF T&E activities. For example, they offer the capability to automate and accelerate the exploration of large test spaces using simulation; they can extrapolate from real data to generate unusual or special test datasets; they can be combined with live data and hardware-in-the-loop to support integration testing; and they can help evaluate concepts of operation and human-machine interactions.

However, the effective use of data generators for T&E will require rigorous T&E of the generators themselves. This need, in turn, calls for standardized evaluation and test metrics for these generators that probe their vulnerabilities and limitations. Indeed, generated data may appear valid when in fact, it is erroneous. For example, GANs are quite capable of generating “fake” images, but the fidelity of the fakes may be crucial in certain T&E activities, such as evaluating the robustness of an AI-based system in new environments. The quintessential question that T&E needs to answer is: does the generated data properly represent the important aspects of its domain and intended use? More specifically, generative models can be considered a tool for drawing samples from an estimate of the probability density underlying the training data, where that density is represented using a neural network. Generated images may look realistic, but testing procedures must ensure that all modes of distribution are accurately captured and that rare but mission-critical events or samples are not ignored by the generative model

The cost and effort to produce sufficiently realistic and useful data must be weighed against the cost and effort of other approaches, such as operational testing and analytical methods. Simulation testbeds that leverage data generators may be expensive to build initially, but their ability to test many situations rapidly could readily amortize the initial investment and lead to more cost-effective T&E overall.

Recommendation 6-3: The Department of the Air Force should assess the capabilities of data generators to enhance testing and evaluation in the context of relevant applications.

Data generators can exhibit significant biases. This phenomenon has been well-documented in the context of racial and gender biases, but in DAF settings, the

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

biases may be unpredictable, and the DAF lacks tools for detecting unanticipated biases. Furthermore, data generators typically are focused on generating “typical” samples from a distribution corresponding to the training distribution, whereas in some settings, the DAF may have a stronger interest in extreme or anomalous events. Therefore, DAF T&E must consider how data generators may affect our ability to understand systems in atypical operating conditions.

In addition, training state-of-the-art generative models can incur significant costs. For instance, it has been reported that training GPT-3 on 0.5 trillion words required $4.6M and generated 500 metric tons of carbon dioxide. Yet, it is unclear how this trend toward larger models will evolve, and it is also unknown what the scale of generative models for various DAF applications needs to be. Generative models for computer-vision applications such as image generation are generally smaller by orders of magnitude compared to GPT-3-scale generative language models, but training can nevertheless require substantial computing resources. Future applications that combine language and computer vision models (multi-modal generative models) will require substantial training and will undoubtedly also pose computational challenges. In summary, generative AI model development and training costs may affect the use of such models in varied USAF contexts and limit the DAF’s ability to address problems with generative models uncovered in T&E.

6.5 AI GAMING FOR COMPLEX DECISION-MAKING

Recent AI gaming technology, such as Alpha Zero (Go, Chess, and Shogi) and Pluribus (poker), has demonstrated superhuman capabilities in extremely complex albeit constrained adversarial decision-making contexts. Reinforcement learning combined with deep learning is at the heart of these technologies. Typically, very large computational resources are required. These systems are often boot-strapped with labeled training sets and then further trained through self-play. Recent models (e.g., Alpha Zero) use self-play exclusively and require no labeled training data.

AI board game systems have developed strategies of play that surpass those that humans have developed across centuries of over-the-board play. AlphaStar is even more sophisticated, with the ability to play Starcraft II at the grandmaster level. StarCraft is more challenging than typical board games, as shown in Table 6-1. AI researchers continue to develop more sophisticated AI gaming and decision-making capabilities, aiming to achieve superhuman-level decision-making in demanding and realistic situations.

As AI gaming technology continues to increase in sophistication, it will be an important technology choice to augment complex decision-making in air force autonomous systems, robotics, command and control, logistics, planning, and scheduling applications. In most circumstances, human-AI teaming will be a crucial element of success (to include reinforcement learning with human feedback, or RLHF). In other circumstances, the gaming technology may need to operate

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

TABLE 6-1 A Comparison of the Challenges Presented by Games Such as Chess and Go Versus the Challenges Presented by Wargames Such as StarCraft

“Simple” Board Games (e.g., Chess and Go) StarCraft-Like Environments
Huge state space of possible moves Huge state space of possible moves
Fully observable Partially observable
Single-player Multiple agents and types of agents
Turn-taking Simultaneous movement
Deterministic Stochastic observations and effects
Few rules, some context coupling Many rules, often context-dependent
Non–real time Real time

completely autonomously for periods of time, such as for EW or cyber applications where superhuman response times are required. Coordination of multiple assets at a large scale is another example where AI gaming technology may excel. For example, research today on using deep reinforcement learning to coordinate multiple drones in real time15 may translate to new swarm warfighting capabilities in the future. Based on recent successes as well as their future promise, the DAF should stay abreast of the latest advances in AI-enabled gaming technologies and explore how these capabilities might help enhance DAF missions. At the same time, the DAF AI T&E champion should ensure that such systems undergo the same type and level of T&E as any other AI-enabled weapon system.

Implications of AI Gaming for Complex Decision-Making to DAF T&E

Future advances in AI gaming and its foundational deep reinforcement learning (DRL) techniques will enable the Air Force to build systems that are more capable than ever before and that involve AI in more sophisticated and complex ways. This increased system complexity will mean more challenges facing T&E. Also, the teaming relationship between the human and AI elements will likely be much more interrelated and complex. Thus, many tests will need to assess human-AI interactions and overall teaming effectiveness and will require more intricate user participation. This is typical of operational testing today, but the key point is that the T&E process will need to engage the user continuously, from the early stages of development to the operation of the system. Indeed, one important way to address this challenge is for the Air Force to adopt the agile and continuous

___________________

15 A.T. Azar, A. Koubaa, N. Mohamed, et al., 2021, “Drone Deep Reinforcement Learning: A Review,” Electronics 10(9):999, https://doi.org/10.3390/electronics10090999.

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

testing approaches that are currently being used by commercial industry for its complex AI-based systems.

In the cases where the AI agent is acting autonomously or is generating a set of complex decisions that exceed human capability, the systems can fail in non-intuitive and potentially catastrophic ways. Thus, the Air Force should require that explainability and interpretability be key engineering goals not just for individual AI components but for the entire system. Additional fail-safes and data-logging capabilities will need to be built into such systems. Safeguarding systems that provide appropriate performance guarantees can help narrow the test space of the overall system. Lessons can be learned from the private industry efforts to build autonomous automobiles, where continual testing, ghost AI hosting, early user involvement, human-AI teaming, and other techniques are being pioneered.

Notably, researchers are making important advances in safety-critical reinforcement learning. For example, control barrier functions have been shown to provide control-theoretic guarantees for obstacle avoidance. Hamilton-Jacobi Reachability16 provides an exact formulation of the states that may lead to failure and can be used to formulate optimal safety control policies. While these techniques have difficulties generalizing and scaling, recent machine learning approaches are emerging that use such techniques offline to learn approximate but effective safety control policies. Computationally efficient, approximate but guaranteed safety-critical control is then applied online.17 This is a very active area of research, and the committee expects continued progress that will aid in DAF autonomous systems T&E.

Finding 6-5: Recent and anticipated advances in AI gaming technologies will enable the Air Force to build systems that are more capable than ever before and that involve AI in more sophisticated ways, but this increased system complexity will make the teaming relationship between the human and AI elements much more interrelated and complex, thereby placing additional challenges on effective T&E.

AI Foundations

In addition to the core area discussed above, important research is progressing in foundational and theoretical AI research. A foundational understanding of AI is akin to investments in a foundational understanding of medicine, biology, chemistry, and materials science. The DAF must have strong pillars on which to build, test, and evaluate AI systems. Testing and evaluation of AI-enabled systems

___________________

16 S. Herbert, University of California, San Diego, “The Safe Autonomous Systems Lab,” http://sylviaherbert.com/hamilton-jacobi-reachability-analysis, accessed April 27, 2023.

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×

require understanding the implicit biases and generalization properties of learned models, and when all potential operational scenarios cannot be tested explicitly, theory can provide invaluable insights. For instance, many modern neural networks are “overparameterized,” so the number of parameters learned during training far outstrips the amount of available training data. In these settings, we can often interpolate the training data, and the architecture of the neural network determines the nature of the learned interpolator; theory may provide insights into the nature of the interpolator as a function of architecture. Interpretable machine learning, which is essential to our ability to debug faulty systems, is a further foundational research challenge. Several empirical studies have shown interpretability may come at the expense of accuracy, but there is no evidence that this is a fundamental or insurmountable challenge. Foundations are essential to understanding how a learned model will perform under new operating conditions or how a model trained in one setting will perform in a shifted environment. Theory can also inform trustworthiness assessment through the development of new metrics. Privacy and stability guarantees, important safeguards in trustworthy AI, depend on a cadre of theoretical tools. Model compression is also ripe for theoretical advances and important to air force deployment or continual learning settings with limited power. Finally, theory is essential to the development of new tools for uncertainty quantification without assumptions on the distribution underlying data or properties of the learning algorithms or models.

Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 121
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 122
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 123
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 124
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 125
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 126
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 127
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 128
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 129
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 130
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 131
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 132
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 133
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 134
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 135
Suggested Citation:"6 Emerging AI Technologies and Future T&E Implications." National Academies of Sciences, Engineering, and Medicine. 2023. Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force. Washington, DC: The National Academies Press. doi: 10.17226/27092.
×
Page 136
Next: 7 Concluding Thoughts »
Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force Get This Book
×
 Test and Evaluation Challenges in Artificial Intelligence-Enabled Systems for the Department of the Air Force
Buy Paperback | $42.00 Buy Ebook | $33.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The Department of the Air Force (DAF) is in the early stages of incorporating modern artificial intelligence (AI) technologies into its systems and operations. The integration of AI-enabled capabilities across the DAF will accelerate over the next few years.

At the request of DAF Air and Space Forces, this report examines the Air Force Test Center technical capabilities and capacity to conduct rigorous and objective tests, evaluations, and assessments of AI-enabled systems under operational conditions and against realistic threats. This report explores both the opportunities and challenges inherent in integrating AI at speed and at scale across the DAF.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!