Page 1 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

Proceedings of a Workshop—in Brief

Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force

Proceedings of a Workshop—in Brief

On June 28–30, 2022, the National Academies of Sciences, Engineering, and Medicine’s Air Force Studies Board (AFSB) convened a hybrid workshop in support of its consensus study on testing, evaluating, and assessing artificial intelligence (AI)–enabled systems under operational conditions. The goals of the study are as follows:

Evaluate and contrast current testing and assessment methods employed by the Department of the Air Force and in commercial industry.
Consider examples of AI corruption under operational conditions and against malicious cyber-attacks.
Recommend promising areas of science and technology that may lead to improved detection and mitigation of AI corruption.

The information summarized in this Proceedings of a Workshop—in Brief reflects the opinions of individual workshop participants. It should not be viewed as a consensus of the workshop’s participants, the AFSB, or the National Academies. The workshop planning committee heard from a wide range of experts from government, industry, and academia to help inform them about the Air Force Test Center’s (AFTC’s) ability to test, evaluate, and assess AI-enabled systems. The purpose of this workshop was to hear about how the U.S. Air Force (USAF) currently approaches AI testing and evaluation (T&E), industry approaches to testing AI, and challenges to AI testing. Exploration into other topic areas from the statement of task will be done in future data-gathering meetings by the workshop planning committee.

47TH CYBER TEST SQUADRON OVERVIEW

The first speaker was Jacob Martinez (47th Cyberspace Test Squadron [CTS]). Martinez began by giving a brief overview of the 47th CTS, which is part of the AFTC, and its two primary mission areas: providing test environments for hardware and software type cloud environments and conducting cybersecurity and resiliency activities for the Air Force’s kinetic and non-kinetic weapons. In essence, the 47th CTS looks at not only the physical capabilities but also software capabilities. Martinez also noted that the 47th CTS is primarily a “fee for service” organization. He explained that the squadron relies on normal, agile, and continuous methods of T&E, with the intent to focus on continuous T&E in the future. He also stated that the 47th CTS is primarily a developmental testing (DT) organization.

Page 2 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

The discussion shifted toward Unified Platform (UP), a project that aims to integrate cyber capabilities, systems, infrastructure, and data analytics while allowing cyber operators to conduct numerous tasks across the full spectrum of cyber operations. It is also one of the five elements of the Joint Cyber Warfare Architecture. The 47th CTS worked on this project and looked at several vendors to help support the application of AI/machine learning (ML) in UP. It determined that investments to begin integrating AI/ML into UP are estimated to be anywhere from $75,000 to $255,000 per year in licensing costs alone.

Thomas A. Longstaff (Software Engineering Institute; workshop planning committee co-chair) was curious if, within UP especially, Martinez’s group is focusing more on the tools and techniques within UP or on what is within the development, security, and operations (DevSecOps) chain on the testing side from the software factory. Martinez responded that the 47th CTS is tied into the DevSecOps pipeline process. Discussion then ensued about ownership and responsibility. Martinez stated that, ultimately, the end user is the one who assumes the risk and takes responsibility. Rama Chellappa (Johns Hopkins University; workshop planning committee member) asked for an explanation of how they currently recruit people who can be a step ahead and fully understand the implications of the system design, AI, and so on. Martinez responded that industry is paying individuals, with that level of expertise, more than what he can provide. Instead of using high salaries to entice talent, he suggested using the PALACE Acquire (PAQ) Internship Program.¹ Martinez stated, “by embracing and offering training positions and PAQ internship positions, we not only get the latest training from academia, but we also can hold those individuals for 2 or 3 years and invest in them, in their education, and they invest in us by providing us new techniques and capability.” This policy is not official, but an idea proposed by Martinez, he clarified. Longstaff asked a final question regarding applying resilience testing to things that may have adaptive behavior. Martinez responded that the 47th CTS does have a mission in which they do cyber resilience testing. The point was also made that resiliency testing will, in Martinez’s opinion, probably become integrated with future AI/ML requirements as they may develop. The only issue is that acquiring and funding technological concepts takes a long time. Martinez has usually seen, within the Department of Defense (DoD), a 2- to 3-year gap between the time it takes for a concept to be accepted, funded, and explored.

46TH TEST SQUADRON “KILL CHAIN DEVELOPMENTAL TEST”

Dave Coppler (46th Test Squadron [TS]) talked to the workshop planning committee about the 46th TS, a subordinate to the AFTC’s 96th Test Wing, and the importance of DT. He noted the squadron’s importance in considering all stages of the “kill cycle,” also using the term “kill chain DT.” He gave an overview of the organization’s chain of command and mission statements. One of the squadron’s primary focuses is on the testing of kill chain–relevant systems.

Coppler transitioned to talking about DT and why it is essential. He stated that DT is necessary government work that helps to accelerate acquisition by leveraging unique expertise, facilities, equipment, and capabilities. The 46th TS supports the entire system life cycle to ensure that upgraded systems do not break any of the system’s initial capabilities. They can also provide upgrades to the software with new capabilities and ensure that they work properly. They also provide highly qualified experts with proper clearances to engage customers on any level and provide the necessary support. Coppler also discussed the importance of the test environment that the 46th TS provides for DT.

Coppler fielded questions from the workshop planning committee. Longstaff asked if, within the simulated emulated systems that the 46th TS is already using, it is considering incorporating more AI systems behavior into its simulated systems (e.g., the F-15E, etc.). Coppler responded that until the F-15Es, F-22s, and F-35s start incorporating AI into their platforms, the TS has no desire to do that. Longstaff followed up by asking if the TS is thinking about doing any automated AI-based behavior within the hardware for testing. Coppler stated

__________________

¹ The PAQ Internship Program is a paid, full-time, 2- to 3-year USAF program for graduates interested in a number of disciplines. More information can be found at the AFCS website, at https://afciviliancareers.com/recentgraduates.

Page 3 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

that he thinks that is way off in the future. The TS is still in the very early stages of developing the art of the possible. Chellappa asked about annotation and who does it. Coppler responded that the 46th TS does provide the truth data for physical things in real time, but it is not involved when the AI, for example, takes a deeper look at how data are being generated and used.

AI DT FOR COMMAND AND CONTROL

The next speaker, Marshall Kendrick (Air Operations Center Combined Test Force), opened by saying that the 45th TS is just getting started in the AI business. He then talked about the different efforts that the 45th TS is undertaking, many of which are in the big data/algorithm stage. In the future, he noted that most of the efforts have the potential to move into full AI/ML capabilities. Last, Kendrick posed two questions that his organization has been tracking for the past few years: how to test AI and how to use AI to test and test better.

Kendrick talked about some of the squadron-level flight programs his organization is involved in, such as Air Ops Command and Control (C2), a space flight that uses DoD’s Kobayashi Maru C2 program, and other programs. He also discussed the need for real-time data processing, as everything is constantly changing (potential threats, environments, etc.). AI can assist in this effort, particularly with the Advanced Battle Management System (ABMS) and the Joint All-Domain Command and Control (JADC2) vision. Kendrick then talked about ongoing efforts within the 45th TS. Lt. Gen. (Ret.) John N. Shanahan (USAF; workshop planning committee member) asked if the 45th TS would play a role in helping to develop some of the C2 capabilities that the Air Force is working on. Kendrick responded that they could absolutely play a role, particularly on the software side. Kendrick and Shanahan also discussed the operator’s role throughout the test process, the need to identify risk, and who accepts the risk.

Kendrick also discussed other ongoing efforts, such as cloud-based C2. He explained that this effort comes from the Air Force’s Rapid Capabilities Office as part of the ABMS work. They have already built the test data sets and are working directly with developers. Kendrick mentioned that he has people meeting with the developers to ensure that the 45th TS understands the developers’ test methodology and their data. The goal is to see how they can extend the test data and ensure that it covers all of the operational boundaries.

AI ASSURANCE

Jane Pinelis (Office of the DoD Chief Digital and Artificial Intelligence Officer) led the final presentation of the first day. She opened by defining AI assurance. She described AI assurance as the combination of T&E and responsible AI. She explained that the AI assurance process provides arguments and evidence to establish trustworthiness and justified confidence. She defined the goal as providing stakeholders with “justified confidence” that DoD AI-enabled systems meet requirements and support missions through ethical action. Stakeholders include the warfighter, commander, program manager, regulators, taxpayers, and others. She also talked extensively about the existing partnerships that the Chief Digital and Artificial Intelligence Office (CDAO) has and the different support it provides to these stakeholders.

Pinelis moved on to talk about the AI T&E process. The first step, algorithmic testing, is when reserved test data are used against a vendor’s model in a laboratory environment. Next, the model tests four areas: integrity testing, confidence assessment, robustness, and resilience. Integrity testing shows the model’s effectiveness using metrics such as the number of false positives, F1 score, precision recall, and other data points. Pinelis also talked about a new method called calibrating “model competency,” where someone uses trained data on a specific data set deployed in an operational environment. She noted the importance of the model competency step in assessing “domain adaptation,” or the model’s ability to perform in different operational environments at the same level as observed in the bench testing environment. Confidence assessment calculates the distance between a data point and everything on which the model has previously been trained. Pinelis mentioned that this type of test helps with things such as label prioritization. She then talked about robustness—specifically, natural perturbations—and how they are transitioning a tool from the Test

Page 4 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

Research Management Center that will help identify edge cases in a test set. Resilience was the final test, where they specifically focused on adversarial action and whether it comes through adversarial AI or cyber. It also measures the data set’s ability to diagnose and recover from those attacks.

The second step is system integration, Pinelis said. This measures how well a model performs when plugged into a legacy system not intended to interact with AI. The key things that the CDAO looks for are functionality, reliability, interoperability, compatibility, and security.

The third step Pinelis described is human–system integration (HSI). This step involves inserting a human in the loop—that is, when a model is mounted to a platform and works. They tied the observe, orient, decide, act loop to DoD AI ethical principles to describe the HSI framework. She emphasized that human interactions with machines need to be maximally informative.

The final step is an operational test. Pinelis described this as both the easiest and the most challenging step. It is the toughest because, in her opinion, operationally testing AI-enabled systems, particularly autonomous ones, is very difficult. It is also the easiest because the CDAO always gets to collaborate with somebody when doing it. She then stated that the theory and methods behind operational testing are extraordinarily well developed and established. With AI, things have changed slightly. Tactical testing is an important part of the culture shift that avoids doing one big test at the end of the process and instead focuses on doing smaller but more frequent tests in multiple contexts and environments. There is also a push to evaluate the quality of decision making as performance. This attempts to evaluate the quality of decision making. The final point focused on the idea that one cannot test for everything and that test culture needs to shift to becoming more risk accepting rather than risk averse.

Pinelis noted that the CDAO was working with the Office of the Secretary of Defense Director, Operational Test and Evaluation, on various AI T&E products that would be available throughout DoD (to include T&E best practices, cloud-native test harnesses, a T&E bulk purchasing agreement, T&E tools, test products, and an AI Red Teaming handbook, among others). She ended her presentation by briefly discussing the different challenges of T&E and responsible AI. Longstaff asked how industry best practices would interact with a newly established T&E factory.² Pinelis responded that they would absolutely continue to get industry’s tools and host them in the factory. She also stated that they try to keep the CDAO’s tools available to industry for items they build for the CDAO, but they do not share the test data. However, some tools are ones that the CDAO does not want widely advertised, for national security purposes. Discussion took place about how there are lessons to be learned from the private sector’s safety community for using AI in safety systems. Chellappa asked about domain adaptation and how Pinelis’s group will tackle it. Pinelis responded that they will do their best to train the system with the data that they have but that a lot of emphasis should be placed on learning after the system is fielded. She also talked about privacy and how data transformation and governance can be significant in keeping data useful while ensuring that identity is not recoverable. Last, Shanahan asked about the cultural shift between the traditional developmental testing/operational testing and how that is coming along. Pinelis responded that integrative testing had been discussed for a long time but had not yet been implemented. Shanahan also touched on an AI mishap database and whether any thought had been put into that. Pinelis affirmed that they had thought about that and are establishing a database for responsible AI that will be a repository not just for incidents but also for tools and data.

DAY 1: WORKSHOP PLANNING COMMITTEE DISCUSSION

May Casterline (NVIDIA; workshop planning committee co-chair) raised a go-back question to Kendrick on whether or not the testing rigor that Pinelis described in her presentation was captured in their requirements. Kendrick responded that he has assessed whether rigor

__________________

² A T&E factory is a broad set of tools to empower non-experts in DoD to test a model when it arrives as a black box (i.e., when the model’s inner workings are difficult to understand). K. Foy, 2022, “Graph Exploitation Symposium Emphasizes Responsible Artificial Intelligence,” Massachusetts Institute of Technology, https://www.ll.mit.edu/news/graph-exploitation-symposium-emphasizes-responsible-artificial-intelligence.

Page 5 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

has been properly addressed, but that his assessment would probably not be the same as what Pinelis described in her presentation. Kendrick pointed out that it is difficult for a fee-for-service organization to solve a problem when they need a contract before hiring, tasking, building, and testing are available to address the problem. Shanahan observed that a philosophical question needs answering at the Air Force level, writ large, on establishing “who owns what part” of this difficulty and looking into the requirements process. Another point was that some of the language used, such as F1, F2 scores, ROC curves, and false positives, is new for many people involved with Air Force T&E. He noted that this is not a typical T&E discussion. He followed on by saying that it sounds like the Air Force would like these terms to become part of the T&E discussion, but wondered how the Air Force builds toward that.

Chellappa and Shanahan discussed how someone would know if a new AI system is performing much better than what is already out there. This thought was a central question for some in figuring out “what is good enough?”—something that is still unresolved. Coppler commented that during his time on active duty with the 53rd wing, they would test “good enough” by measuring against what they already had. Chad Bieber (CDAO) agreed with Coppler and added that there are many ways to be good. Coppler jumped back in and posited that if an AI/ML algorithm does not perform as expected when tested, it may be doing something better than one ever thought possible. Longstaff resonated with that point and brought up his concern that sticking with the old regime of “testing to requirements,” may result in the discarding of systems that yield surprisingly better results.

A final discussion ensued regarding the testing of large systems. Longstaff used JADC2 as an example—once one starts incorporating more AI capabilities, the nature of the entire system changes. How does one test that and begin to think about what to do to test an integrated system of that size and scale—an integrated system incorporating behavior and change based on how an adversary changes?

DAY 2: MORNING DISCUSSION

The workshop planning committee opened the day with a recap and discussion of the previous day. There was discussion regarding unknowns, such as the lack of ownership regarding liability and requirements. One workshop planning committee member commented on a contrast in approaches between the CDAO’s office and the test squadrons. Shanahan commented that at the end of the day, the Air Force has to come in at an Air Force level and decide the best way forward—the test center versus the warfare center—regarding roles and responsibilities. Tamara G. Kolda (MathSci.ai; workshop planning committee member) asked if there was a way to audit decisions and collect them as AI systems deploy. Coppler responded that without hooks in the AI algorithms, the test community has no idea how to look into those algorithms and understand what they are doing. Kolda asked if the inputs and outputs of an AI system are logged. Coppler responded that they were. Bieber added that it is not always a given that one can check the inputs and outputs of an AI in a box. AI might be a larger component of the software, and it has a fundamental problem: it cannot be instrumented after the fact because that might change how the software operates.

PRACTICAL GUIDE TO AI TESTING

Bieber spoke briefly about his background as a tester and his previous work at the Joint Artificial Intelligence Center.

Initially, Bieber spoke about metrics and metric development. When developing metrics, one needs to understand what metrics developers are using, understand how program management has defined requirements, and understand how to measure operational success—that is, the importance of soliciting the end user’s assessments of operational performance. He also talked about standards and how everyone makes their own tools and products. Unfortunately, this does not allow much in the way of interchangeability. Bieber also talked about tools and the CDAO’s work establishing a T&E software factory, as well as its vision for developing a “suitcase” test kit, which would allow AI T&E in situ. Bieber explained that such a capability would not only be invaluable in assessing competency

Page 6 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

(domain adaptation) but would likely also lead to the ability to “tune results” under operational conditions. He then touched on modeling and simulation (M&S). He stated that M&S is vital to AI T&E. He talked specifically about the common worry, or complaint, regarding the exploding state of the AI space. Bieber does not think of that as the biggest problem. AI’s unique problem is that it does not understand the performance across that space well enough to predict behavior between two points, much less outside the area it tests.

Bieber then presented a scenario regarding a dog-finding uncrewed aerial vehicle (UAV) used by emergency services. Within this scenario, he talked about different metrics and their uses, such as mean average precision (mAP), average precision, Recall, and f-scores. Longstaff asked if Bieber could contrast mAP to accuracy. Bieber responded that precision, in the computer vision world, has a smaller, less overloaded definition than accuracy. Casterline and Chellappa discussed the applicability of some of the metrics that Bieber mentioned and commented that they are very computer-vision-centric. Chellappa stressed the need to understand the metrics and think more about what would work for AI-based systems. The workshop planning committee also discussed the idea of an algorithm deployed in the field that continuously learns during deployment. Bieber mentioned SmartSensor,³ which does have the ability to retrain rapidly. Trevor Darrell (University of California, Berkeley; workshop planning committee member) asked for Bieber’s thoughts on the idea of merging the culture of testing and development. He also asked for thoughts regarding identifying specific entity labels and not just a broad category, such as identifying a T-72 versus a tank. Bieber responded that he had seen the opposite problem, where they have tried to use computer vision to detect too far down the ontological hierarchy. Bieber also stressed the need for continuous testing. He stated, “We have to have the ability, if we’re doing continuous development, to do testing at the same speed as the development process.” He also spoke about competency testing and the different ways one can do it.

The discussion then shifted to autonomous vehicles and testing metrics, such as the number of user interrupts, to measure operational performance. Darrell spoke about how coming up with a commonsense tool that could look at and summarize a performance dump⁴ could be helpful. He also spoke about how it would be valuable to require some disclosure and the ability to benchmark against open systems. Bieber followed on and spoke about the challenges of a black-box system and being unable to look inside it. Although he did say that while it would be useful to have full access to everything, the financial cost of having full access may not be feasible. Darrell suggested model cards and documentation confidentiality as middle-ground solutions. Bieber stated that the CDAO requires model cards. In closing, Kolda and Bieber engaged in discussion regarding data sequestration. Kolda also asked about model learning and whether the models that Bieber’s group receives are already trained. Bieber responded that once the algorithm was delivered and deployed, it did not change.

ROBUST AND RESILIENT AI

Olivia Brown (Massachusetts Institute of Technology [MIT] Lincoln Laboratory [LL]) spoke about how AI systems have great promise for DoD. However, they are demonstrably brittle and often vulnerable to different forms of data corruption, Brown said. She specifically named post-sensor digital perturbations as a form of corruption. Brown explained that there are sources of natural and adversarial forms of vulnerability. A natural source could be when an AI model trains on upright chairs. When tipped over, the model could suffer a significant performance drop. Adversarial forms of vulnerability could involve deliberately manipulating an image’s pixels, causing the model to fail in correctly classifying inputs, according to Brown.

__________________

³ Smart Sensor is a CDAO project delivering an on-platform, AI-enabled autonomy package that allows a UAV to conduct automated surveillance and reconnaissance functions in contested environments. Satnews, 2022, “DoD CDAO Partners with USAF to Conduct Developmental Test Flight of AI and Autonomy-Enabled Unmanned Aerial Vehicle,” Satnews, https://news.satnews.com/2022/06/23/dod-cdao-partners-with-usaf-to-conduct-developmental-test-flight-of-ai-and-autonomy-enabled-unmanned-aerial-vehicle.

⁴ A performance dump of the system is a collection of data from a service processor after a failure of the system, an external reset of the system, or a manual request. IBM, 2021, “Initiating the Performance Dump,” https://www.ibm.com/docs/en/power9/0009-ESS?topic=menus-initiating-performance-dump.

Page 7 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

Brown then talked about the current way that machine models train. First, they undergo a design phase, where training data are collected and validated. The model then tests on a test data set similar to the one on which it trained. The system then deploys. She noted that they often observed that performance of the deployed system in the operational domain was much worse than predicted during the test phase. This degraded performance reduces operator and user trust and results in the system going offline, reoptimization, and ultimately redesign, Brown noted.

Brown stated that the path to creating a more robust system starts at the opposite side of the development process. Talking to operators at the beginning of the system’s design phase is essential. In this way, the developer understands the operational environment into which the system will deploy. This awareness allows the programmers to consider potential sources of variation in the data that the system is likely to encounter. Next, the developer should establish a testing process that avoids experimenting against a test set similar to the system’s training. Instead, one should test against that training data’s perturbations or that training distribution. Last, Brown advised training the model to perform better against perturbed data. Brown then spoke about the work at MIT LL in robust AI research that addresses new ways to tackle natural and adversarial sources of vulnerabilities. Brown highlighted tools like HydraZen⁵ and the Responsible AI Toolbox (rAI-toolbox),⁶ which will help Brown’s team at MIT continue its research on evaluating AI robustness. The workshop planning committee conversation then shifted toward different use cases that utilized these tools. Brown concluded by describing MIT LL’s next steps in supporting the development of robust and responsible AI.

Longstaff asked if there was a way to specify a requirement that would allow them to test against the requirement once the robustness training was complete. He also asked how well the robustness pipeline works with non-vision-oriented AI. Brown responded that she does not necessarily have an answer, but that setting the requirements is very important. She responded that MIT LL was exploring ways to use simulators and to figure out how to integrate those into the training process. Regarding the second question, Brown stated that MIT LL is moving beyond natural images and looking at radar and time series. Longstaff followed up and asked about data augmentation strategies. Brown responded that these strategies exist to train against a single type of perturbation, but you will (normally) have a suite of them.

AI TRUST AND TRANSPARENCY

Michael Wellman (University of Michigan) opened his presentation with a brief discussion of his past work. He started with how trust and transparency are nothing new for AI. Trust in an AI system is ill-defined because people have different ways of defining it, Wellman said. Moreover, trust goes beyond AI systems—it applies to any software system or system that generates recommendations, information, or decisions. To Wellman, however, trust is not a necessary condition to use a system. Many instances exist where people use technology without understanding its full consequences, Wellman said.

Wellman then discussed an example of autonomous AI–stock trading. In certain instances, companies have employed AI to control large trading accounts that act autonomously in financial markets. Indeed, inserting a human in the loop is not feasible. By the time a human can do anything, the opportunity evaporates. He cited a company, Knight Capital, where a software configuration error led to a loss of around $400 million that took the company down. Nevertheless, even with that kind of outcome, people did not stop trusting or using this technology, Wellman said.

Wellman then discussed transparency in AI systems. Specifically, he spoke about the common approach, called the explanation approach, used to interrogate the underlying model so that one can explain the decision or recommendation it produces. However, this approach has some dangers—mainly that it is easy to come up with an explanation that seems plausible and could be the reason

__________________

⁵ See the Hydra-Zen site, at https://github.com/mit-ll-responsible-ai/hydra-zen.

⁶ See the Responsible AI Toolbox site, at https://github.com/mit-ll-responsible-ai/responsible-ai-toolbox.

Page 8 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

for an underlying decision, but that might not necessarily have a causal connection. Wellman then presented a different approach—to limit oneself to models that are interpretable in the first place. In other words, the model has a certain simplicity or structure that one can discern directly—the explanation that the model deduces is causally related to an actual decision or recommendation. He maintained, however, that it is not always possible to do this.

Wellman then introduced strategic domains. This approach considers decisions in worlds where the outcome depends on other agents’ actions. The finance and trading example discussed earlier is one example. He mentioned cybersecurity as a strategic domain because an attack or defense is always relative to the other party’s actions. Negotiation, monitoring, war gaming, or anything in conflict is also considered a strategic domain. Strategic domains present a transparency challenge—the decisions made in a strategic situation often require unpredictability. So, something like debugging is more challenging. Wellman concluded that from a designer’s perspective, it requires extra care to preserve transparency.

Wellman ended his presentation and opened up the discussion. Chris Serrano (HRL Laboratories) pointed out that, while we may not have a theorem on whether an attacker or a defender of a system will win, there is undoubtedly an idea of how the cost grows when defending a system versus attacking one. Wellman added that cost also determines who wins in the end. Longstaff and Wellman discussed counterfactuals and how to utilize them in dealing with the issue of inferring intent. Wellman explained that using counterfactual queries could infer intent—in this instance, identifying whether someone is a scammer. Shanahan asked a question regarding Wellman’s statement on trust not being a prerequisite for adoption. He asked if we are getting too detailed or “cute” with some of our existing systems, particularly given how basic their capabilities are right now. Wellman said he framed his stock trading example as a cautionary tale to show that it may not be possible to stop a system without full trust or confidence because it will be compelling. Sometimes that is worth embracing. However, there is always going to be a matter of measured risk. Chellappa said that there are four things to look at: domain adaptation, adversity of attacks, bias, and privacy. Wellman responded with adversarial approaches in black-box situations. He said that the risk with domain adaptation is that things get deployed in situations for which they were not designed.

EVALUATION OF AI

Thomas Strat (DZYNE Technologies) started by going through the background of his company and presented some case studies about the types of AI work in which DZYNE Technologies is currently involved.

DZYNE Technologies is a small company that designs, builds, and operates autonomous aircraft, Strat said. These aircraft can range anywhere from small 6-pound aircraft to the largest UAVs that the military currently operates. The company also has an AI group of around 25 individuals who help to deploy AI capabilities on their aircraft, Strat said.

Strat then discussed semantic labeling from satellite imagery. Specifically, he talked about determining which algorithm is better, given two different image classifications. A qualitative approach to answering the question is useful because it allows you to look at the data from a visual perspective and ask if they represent your intuition. A quantitative approach allows one to use several metrics to determine accuracy. Strat pointed out that while there are many commonly used metrics, there is not one single obvious best metric to use in any given situation. He then stated that it is seldom clear what metrics to use from the outset, which depends on many factors. Strat also pointed out that to do an evaluation, you must have some form of ground truth to compare it to, and ground truth is not always complete and correct. The quality of that ground truth makes an important difference, he said. Overall, some key challenges concern trade-offs between the evaluation metrics that someone chooses and the quality of ground truth making a big difference. Strat stated that some solutions to help with these challenges include having multiple metrics

Page 9 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

and pretraining a model without annotation. Longstaff asked if any metrics explain the quality of ground truth. Strat responded that he was not sure, but during his time as a Defense Advanced Research Projects Agency (DARPA) program manager, they played around with that. Chellappa stated that there are some models of label noise, but it is hard to figure out how good the ground truth is.

Strat spoke about another case study regarding the area of building extraction. This case study aimed to highlight all of the buildings in an image and use brightness to determine building height. Strat pointed out that as one gets toward the perimeter of the image, off-axis pixels increase. He posed the question, “How do you evaluate the accuracy of these data sets?” You do not have ground truth that covers city-size areas with any accuracy, according to Strat.

Additionally, said Strat, any algorithm’s accuracy will not be uniform across something the size of an entire city. Cities are not uniform and have many factors that could affect the algorithm. For example, he stated that the heights of trees in a certain area could affect the ability to extract data properly.

Strat then shifted the discussion toward autonomous vehicles. First, he considered how to evaluate progress, and that speed may not necessarily be the way to measure that. He then talked about the DARPA Grand Challenge. This challenge aimed to put autonomous vehicles to the test in a real-world environment—in particular, an operationally relevant environment such as a desert. Strat said that he favors attempting system-level tests in operational environments whenever possible, as he believes that there is nothing more convincing than doing that. He then covered autonomous aircraft—specifically, a long-endurance air platform (LEAP). LEAP has been in operation since 2016 in the Middle East in numerous combat operations. First, it was evaluated and tested using simulation and takeoff tests at military bases. LEAP then moved on to formal operational assessments in theater in the hands of the military, where it has been continually reassessed since its first use. At this point, Strat said, it has amassed more than 50,000 hours of operational use by the military in the Middle East. Strat then talked about the mishaps. Most of them have been mechanical, some were owing to hostile action, and a number were attributed to operator error or the human in the loop. According to Strat, zero mishaps were attributed to the AI error. Instead, when the operators did not trust the AI, problems occurred, such as intervening in the aircraft’s landing approach. Last, Strat presented a final video showing off ROBOPilot, an autonomous system that can fly an airplane. Over the span of a few years, the system was developed and trained by Strat’s team to fly an airplane with no human in the cockpit. Strat then spoke about the potential application of this robotic technology for the military.

Chellappa asked for Strat’s assessment of the efficacy of simulations. Strat stated that the answer to whether it is useful for AI algorithms is complicated, but why would it not be? The more veracity the simulation has, the less reason there is to doubt its efficacy for training or evaluating AI algorithms. Longstaff asked about ROBOPilot and how it compares to the full auto function that a 787 Max has. Strat responded that there is a market for autonomous flight. He specifically mentioned aircraft that may have been deemed unsafe for human flight. Strat also said that fly by wire is the way to go and that he would not necessarily put ROBOPilot up against a fully integrated autopilot system. He ended his talk by briefly discussing how interfacing with a human being is one of the most difficult challenges for AI because the human brain is so complex. It is much easier to interface with physics than it is with humans. As such, the ROBOPilot program is a much easier challenge to solve than what DARPA set out to do with the ALIAS program.

ASSURANCE: THE ROAD TO AUTONOMY

Jim Bellingham (Johns Hopkins Institute for Assured Autonomy) discussed his background in marine robotics and autonomous marine vehicles. He also talked extensively about the application of AI and autonomy in many vehicles that he had helped develop and utilize. In this talk, he also referenced several tools.

Page 10 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

Bellingham explained that autonomy is everywhere: finance, logistics, military systems, the medical environment, and others. The big problem is assurance. He explained that for industry, it is a trade-off between assurance and ensurance. He also said that assurance is a key to accelerating AI and autonomy.

Bellingham wrapped up his presentation by stating that robotics and autonomy will transform society. He reviewed some current societal drivers regarding future conflict, such as the lack of guidelines for managing escalation and the changing geography for future conflict (land, sea, air, space, cyber, etc.). Bellingham also shared that an enormous amount of research needs to be done regarding the connection between humans and AI. He ended by noting that getting ahead of the curve is important to slow down adversaries.

AI: CURRENT AND FUTURE CHALLENGES

Matt Turek (DARPA) began discussing current AI breakthrough applications, such as AlphaGo, DeepBlue, and more. However, even with all of this success, we may not be on the right trajectory with AI. For example, he brought up self-driving cars—specifically Tesla’s autopilot feature, and how it relies on computer vision, not multimodal sensing. He also spoke about how even when autopilot mode is engaged, Tesla holds the human drivers responsible. He continued by saying that users are not at a point where they can reliably delegate critical decisions to autonomy. He mentioned that some people have been excited in parts of the AI/ML community—but they are not working in ways comparable to humans. He noted that state-of-the-art large language models lack basic comprehension, fail to answer simple but uncommon questions or match simple captions and pictures, do not understand social reasoning, do not understand psychological reasoning, and do not understand how to track objects.

Turek then commented on his belief that the evaluation of AI/ML systems is broken. He talked about how we are chasing very narrow benchmarks and optimizing performance against those benchmarks. According to Turek, this just reinforces the building of narrow and relatively fragile AI systems. He then spoke about how current evaluation techniques do not encourage AI/ML systems to generalize. He also explained how current evaluation techniques do not reveal AI/ML fragilities.

Turek identified how DoD needs do not align with the focus of the AI/ML industry, as follows:

Industry is profit-driven, has access to massive amounts of data, has a low cost of errors, and faces threats from commercial adversaries. DoD is purpose-driven, has access to limited amounts of data, has a high cost of errors, and faces threats from active nation-state-level adversaries.

Turek then highlighted what may be possible as future national security–relevant capabilities:

Trustworthy autonomous agents who sense and act with superhuman speed and can adapt to new situations;
Intuitive AI teammates who can communicate fluently in human–native forms;
Agents that promote national security; and
Knowledge navigation for intelligence and accelerating defense technology development.

To realize some of these capabilities, Turek stressed the importance of investment in AI engineering, human context, and theory to help build robust DoD AI systems.

Turek closed by stressing the importance of:

Developing theories of deep learning and ML;
Measuring real-world performance by developing a rigorous experimental design that measures fundamental capabilities and produces generalizable systems;
Focusing not only on performance but also on resource efficiency;
Developing compositional models using principles approaches to exchange knowledge; and

Page 11 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

Developing appropriately trustable AI systems that have predictable adherence to agreed-upon principles, processes, and alignment of purpose.

Shanahan responded to Turek’s comment that the trial- and-error approach to AI testing is no longer acceptable by saying that human beings do an awful lot of trial- and-error learning. He then asked if DARPA has been looking at hybrid approaches to solving the AI T&E problem. Turek responded that DARPA is interested in hybrid AI, particularly across statistical and symbolic approaches. Last, Longstaff asked Turek if we are going in the right direction in regard to making advances in the fundamentals of AI. Turek responded that he does not have a magic solution, but that his team is trying to set a vision for things that they think need to be done.

HUMAN AI: TEAMING IS UBIQUITOUS

Nancy Cooke (Arizona State University) was the day’s final speaker. She began by discussing human-AI teaming. She stated that AI could not be effectively developed or implemented without consideration of the human. AI does not operate in a vacuum and will interface with multiple humans and other AI agents.

Cooke then spoke about different aspects of teaming, specifically regarding team composition and role assignment, processes, development, and effectiveness measurement. She then talked about the Synthetic Teammate Project on which she is working. The project’s objective is to develop a synthetic AI teammate to take the place of air vehicle operators and work with two humans in the remotely piloted aircraft system task, Cooke said.

Cooke affirms taking human–machine teaming seriously. She defined a team as two or more teammates with heterogeneous roles and responsibilities who work independently toward a common goal. She then commented on what is currently known regarding human–AI teaming. Her first point was that team members have different roles and responsibilities and that this argues against having AI replicate humans. It also upholds that narrower AI allows AI to do what it is best at, such as big data analytics and visualization for humans. Her second point was that effective teams understand that each member has different roles and responsibilities that avoid role confusion but back each other up as necessary. Cooke stated that AI should understand the whole task to provide effective backup. Her third point was that effective teams share knowledge about the team goals and the current situation; over time, this facilitates coordination and implicit communication. Cooke stated that human–AI team training should be considered and that we should not expect a human to be matched with an AI system and immediately know how to work well together. Her fourth point was that effective teams have team members who are independent and thus need to interact or communicate, even when direct communication is not possible. Cooke said this argues not necessarily for natural language but maybe some other communication model. The fifth thing we know is that interpersonal trust is important to human teams. Cooke stated that AI needs to explain, provide a reason for its decision, and be explicable.

Cooke then spoke about the challenge with human–AI teaming. She stated that research on human–AI teaming cannot wait until AI is developed; it is then too late to provide meaningful input. Instead, a research environment, or testbed, is needed to get ahead of the curve and conduct research that can guide AI development. She then discussed different examples of physical and virtual synthetic test environments. Next, she introduced a concept called the “Wizard of Oz” paradigm in which a human plays the role of the AI or even remotely operates a robot to simulate very intelligent AI in a task environment. She also spoke about the importance of measures and models when measuring different aspects of human–AI teaming effectiveness.

Longstaff asked, regarding the Synthetic Teammate Project, how she would write a requirement for someone else to develop that pilot program? He also asked how she would test what she got back from the developer to know if she got the right product. Cooke responded that she would write down the details and results about her team’s experiment and say, make it better so that it succeeds. She also stated that they would test it the

Page 12 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

same way her team tested it the first time. Robin R. Murphy (Texas A&M University; workshop planning committee member) jumped in and asked if the way to ensure that the synthetic agent is aware of its team responsibilities would be to develop an Adaptive Control of Thought—Rational model of the entire operational space. Cooke responded that she thinks so. Chellappa asked how she sees human–AI as different, better, or more complicated than human–computer interaction (HCI). Cooke responded that much is known about human systems integration that can be brought to bear on human–AI systems that people do not consider on HCI when one person is interacting with a product. She also stated that HCI had not done much in massive system areas such as JADC2. Shanahan asked if Cooke had a separately controlled experiment where it was a three-AI team with no humans involved. He also asked if there were any takeaways regarding their work with HSI. Cooke responded that they had not done any three-agent teaming. She also stated that one of her takeaways was that AI is too often optimized on task work when it is important, in these complex systems, to optimize teamwork. Casterline commented that there are concepts of multiple agents being able to learn how to work better to serve an objective function in robotics, game theory, and more. Robin Murphy asked about metrics. Specifically, how can they estimate whether one team is more likely to produce the right answers than another? Cooke responded that they have metrics and are trying to develop more, such as domain-independent measures. Longstaff and Cooke talked about AI training in the context of human–animal teaming. Robin Murphy asked about the feasibility of predicting the performance of a human–AI team. Cooke responded that she could not think of a way to do it without seeing them perform, potentially in a training scenario.

DAY 2: WRAP-UP DISCUSSION

Casterline started the day’s wrap-up discussion by stating that she found it interesting how one presentation spoke about how people will not really be able to trust AI, so they just have to accept the risk, versus when AI trust is really a requirement and more rigor is necessary. David S. Rosenblum (George Mason University; workshop planning committee member) said that he was struck by the fact that many presentations made it seem hard to separate any discussion of T&E from the requirements against which the T&E is being performed. Owing to the narrow statement of task, Rosenblum questioned the extent to which the workshop planning committee would be concerned about saying anything about requirements. The workshop planning committee also broadly discussed the inability to avoid the question or discussion surrounding requirements when it comes to T&E. The workshop planning committee also spoke about other sectors where the consequences of AI would be high. Serrano mentioned that the question of liability resonated with him throughout the day and that we build these systems of systems inside some organization designed by committee. He asked, “Who is going to take ownership for how this thing should perform?” The workshop planning committee ended the day by discussing the break between what happens in the development community and the research community.

VERTICAL DATA SCIENCE

Bin Yu (University of California, Berkeley) defined vertical data science as the process of extracting reliable and reproducible information from data with an enriched, technical language to communicate and evaluate empirical evidence in the context of human decisions and domain knowledge. Yu introduced the predictability, computability, and stability (PCS) framework. She stated that PCS is a way to unify, streamline, and expand on ideas and best practices in both ML and statistics. She also spoke about the importance of documentation.

Yu broke down each part of the PCS framework. Concerning problem formulation, predictability reminds us to keep in mind future situations where AI/ML algorithms will be used while developing AI/ML algorithms. Concerning data collection and data comparability, predictability reminds us to keep in mind future situations where AI/ML algorithms will be used while developing AI/ML algorithms. Concerning data comparability, stability reminds us to keep in mind that there are multiple reasonable ways to clean or curate a given data set from the current situation. Regarding data

Page 13 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

partitioning, stability reminds us to keep in mind that there could be multiple reasonable ways to partition a given data set from the current situation to ensure that the test set is as similar to future situations as possible. Last, regarding other forms of data perturbations, stability reminds us to keep in mind that data perturbations should reflect future situations.

Regarding comparing different predictive techniques, Longstaff asked if any quantitative measures were currently incorporated into the framework to help choose the best algorithm or technique. Yu responded by identifying two measures, sensitivity and specificity. Longstaff followed up by asking how to reason about the trade-offs between predictability and stability. Yu responded that her team screens for predictive performance before seeking stability. Casterline and Yu discussed translating operational requirements into the mathematical statistics to which Yu referred. Shanahan asked about justified confidence and how doctors and nurses attain that. Yu responded that it is important to understand their work and profession as much as possible when developing models. Kolda and Yu talked about embedding T&E with operations. Yu ended her talk by speaking about the importance of documentation and metrics.

AN APPLIED RESEARCH PERSPECTIVE ON ADVERSARIAL ROBUSTNESS AND TESTING

Nathan VanHoudnos (Software Engineering Institute) spoke about AI security and making systems do the wrong thing. First, he introduced the Bieler taxonomy, where an adversary can make you learn, do, and reveal the wrong thing. Next, he compared these different things to data poisoning, adversarial patches, model inversion, and membership inference attacks. He then stated that in their laboratory, they focus on training systems to learn correctly, do things correctly, and not reveal secrets. Next, when it comes to verifying a system, VanHoudnos spoke about their “Train and Verify” project, where they try to make robust ML systems that do not reveal secrets, as well as private ML systems that are not fooled as easily. Last, he introduced several other projects focusing on protecting systems from many adversarial techniques mentioned above.

VanHoudnos defined AI corruption as a decrease in a quality attribute of an AI system. He then spoke about the different roles played by different people as teams try to accomplish different missions. The discussion then shifted toward evaluating ML models—specifically, the evaluations should reflect how models will be used in practice and specific scenarios of importance to the application of the model. Thought should also be given to metrics you care about when evaluating. Chellappa commented that one of the reasons he thinks many benchmarks are averages is because there is a desire to avoid somebody optimizing the algorithm for just one point on the plot that may be operationally relevant. VanHoudnos then spoke about different examples of AI corruption. Casterline and VanHoudnos discussed adversarial patches in classification and the idea of using an adversarial patch to test against a retrained model. Casterline compared it to a cat-and-mouse game and wondered if this is truly the right approach, to constantly devise counterattacks for the continual stream of adversarial attacks that continue to evolve no matter what is done. Longstaff asked where in the requirements process they would know that a certain quality attribute of an AI-enabled system will be tested. VanHoudnos responded that he would have to defer to the DevSecOps folks. Longstaff followed and provided his thoughts on the question: creating the quality attributes is a collaboration between a team of operators, testers, and development folks. VanHoudnos wrapped up by discussing the concept of justified confidence⁷ with Longstaff.

DEFENDING AI SYSTEMS AGAINST ADVERSARIAL ATTACKS

Bruce Draper (DARPA) began his presentation by talking about different types of adversarial attacks against data models. He then spoke about algorithmic defenses for AI systems—specifically, regarding five best practices.

The first best practice discussed was cyber defense. Draper stated that networks are vulnerable, and most AI systems are attached to a network. He also stated that

__________________

⁷ Justified confidence is about developing AI systems that are robust, reliable, and accountable, and ensuring that these attributes can be verified and validated. Northrop Grumman, 2021, “AI Development Aligns with US Department of Defense’s Ethics Principles,” https://news.northropgrumman.com/news/features/northrop-grumman-building-justified-confidence-for-integrated-artificial-intelligence-systems.

Page 14 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

it is relatively easier to attack a network than an AI. Therefore, he suggested that to defend the AI, the focus must be on defending the network.

The second best practice discussed was protecting the input data—specifically, sensor-inspired data. Draper described two types of attacks, one revolving around having access to the actual digital signal, which makes spoofing very easy. The other type of attack is physical. These attacks revolve around altering items in the physical world to trick a system. Draper noted that physical attacks are harder for an adversary to launch and easier to defend. He ended by stressing the idea of protecting your data.

The third best practice was about collecting inputs from multiple sources. Draper stated that it is harder to spoof multiple sensors than one sensor. He also noted that different types of sensors make it even harder to disrupt, and having different instances of sensors also offers some benefits.

The fourth best practice discussed was about protecting model development. Draper urged everyone to be wary of externally acquired models. They may have back doors, either unintentionally or from poisoning, and if an adversary has access to the model, it enables white-box attacks. Draper also stated that when training your models, you should avoid using untrusted training data, avoid boot-strapping from untrusted models, and keep information about training data private.

The fifth best practice was quality assurance post-fielding. Draper advised, when possible, to have a person double-check sampled AI outputs.

Draper then spoke about how one can increase system robustness. He introduced a few methods, such as adversarial training and randomized smoothing. Adversarial training is when you attack your sample during the training process. It does not slow you down at runtime but slows training. Randomized smoothing is where you wait to get the input and then make different versions of that input. It has the opposite pros and cons, where it makes training faster but will slow you down at runtime. Draper noted that both of these methods require a known threat model. The danger is that if the adversary does something you do not anticipate, these methods will not work. Draper also spoke about some methods against physical attacks, such as tile-based defenses and patch detection defenses.

Draper ended his talk by speaking about evaluation software and tools. He concluded with DARPA’s Guaranteeing AI Robustness Against Deception Armory, an evaluation tool that allows analysts to run adversarial AI experiments at scale quickly and repeatedly.

RECOGNITION SYSTEM EVALUATION

Ed Zelnio (Air Force Research Laboratory) spoke about imaging systems and the different types of data: sensor data, metadata, and labels. He also spoke about labeling—specifically, regarding granularity. Zelnio then spoke about different categories of target data and introduced the categories in library mission targets, library confusers, out-of-library confusers, and clutter. He also went over the difference between developmental and operational data.

Zelnio introduced “some things that would be nice to measure in terms of evaluation.” He spoke about measuring the reliability and confidence in a system, measuring understandability and trust of a system, measuring the robustness of a system, measuring the effectiveness of out-of-library confusers, measuring the performance of a performance model, figuring out what to do with limited data, and the need to talk about a sustainable end-to-end training process.

Zelnio ended by speaking about best practices. The first best practice is coming up with an expectation management agreement. These tell you under what operating conditions you can expect a given system to work. The second best practice is the use of a test harness than can help to reproduce training and aid in evaluating the algorithm and the training process. Last, the third best practice is testing to break, to see what does and does not work.

Page 15 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

Longstaff asked what has helped get customers over the barrier that allows them to increase their confidence in using an experimental system in an operational environment. Zelnio responded that the expectation management agreements are important in increasing that confidence. He also spoke about keeping demonstrations as relevant as possible to excite operators. Longstaff also asked if they could receive additional feedback from an operational customer over time that would allow retraining or retesting opportunities. Zelnio responded that it would be great to have a laboratory in the loop to help with this, but it happens more informally.

AFTC AI INFRASTRUCTURE NEEDS

Eileen Bjorkman (AFTC) was the workshop’s final speaker. Bjorkman spoke about AFTC’s current objective of looking at the unique infrastructure needs within the test center and across different organizations to set itself up to test autonomous systems.

Bjorkman spoke about three main things to think about in the testing process. First, test safety, particularly in making sure that an operator can contain a system if it begins to perform in ways that they do not expect. Second, early tester involvement and how testing strategies must be built into system design. The final need revolved around test infrastructure—specifically, instrumentation, data collection and storage, and range support.

Bjorkman also spoke about current T&E needs and how there is no enterprise-level T&E infrastructure to support autonomy testing. She also stated that there is no DoD enterprise-level software T&E infrastructure that supports the testing of AI. She then spoke about different investment areas that she thinks need to happen. These investment areas focused on architectures, frameworks, modular subsystems, data management, virtual ranges, agile workforce, and surrogate platforms, Bjorkman said.

Bjorkman ended by discussing an autonomous system and AI roadmap that showed funding for different programs over a 7-year timeline. Next, Casterline and Bjorkman began discussing the use of simulation work in their virtual environments. Longstaff asked if they have begun incorporating digital twins work into the testing process for autonomous systems. Bjorkman responded that she had seen that happening. Next, Shanahan, Casterline, and Bjorkman discussed data and the use of virtual testing environments. Bjorkman commented on how you cannot get enough replications of things in the real world to test a system fully. She posed an example about autonomous cars and how you cannot just go and drive every road 100 times. She also pointed out how often she has observed so many different tests that they perform where they cannot collect anywhere near sufficient live data. As a result, it forces them into a virtual or even a constructive environment.

WORKSHOP WRAP-UP AND DISCUSSION

The workshop planning committee began its wrap-up by discussing its final thoughts from the workshop. Casterline commented that she does not think that any of the systems are prepared for the iteration that they will have to facilitate. She also said that there is a lot to “grab from” and apply here regarding the DevSecOps model for software. Shanahan commented about the culture shift of iteration and adaptability. If the Air Force does not get that right, everything else is just another discussion about T&E. Chellappa commented about the idea of a centralized facility to test AI. He also commented that he does not think that we know what it means to test AI right now. He specifically pointed out the metrics mAP and Recall and commented how these are ideas from the 1970s. He questioned why they had become a metric for current AI systems. Kolda wanted to stress that not everything is in the data. She also warned against the idea of trusting AI too much.

Additionally, Kolda commented that humans need to evaluate the answers that come out of an AI system and not just blindly accept them. Casterline commented about a gap in the vernacular between algorithmic tests and measurements and operational relevance and tests. Kolda questioned how the process of continuous integration, evaluation, and feedback would work. Last, Robin Murphy offered her thoughts and spoke about how human work processes need to be considered in the discussion regarding AI corruption. She also commented about her fear that security is viewed as an algorithmic

Page 16 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

problem and that it is just going to come up with another algorithm that will detect when the AI is not working correctly.

Longstaff discussed the major questions from the statement of task. His first point, regarding the task of evaluating and contrasting current T&E, was that there is very little overlap between the way T&E is done commercially and the way that the workshop planning committee experienced it through the examples in the workshop presentations. Next, he focused on AI corruption and stated that the third question from the statement of task goes beyond being just a scientific technology question. It could also ask how DoD and the Air Force can improve the nature of how technological advances are incorporated. Rosenblum commented that, regarding the third question, he is worried that anything the workshop planning committee says will be outdated in a year or two owing to the rapid pace of technological change. Longstaff responded and stated that the workshop planning committee could point toward general trends instead of specific scientific advances. Last, Longstaff thanked everyone for their contributions, and staff member Ryan Murphy officially closed the workshop.

Page 17 Cite

Suggested Citation:"Testing, Evaluating, and Assessing Artificial IntelligenceEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26885.

×

DISCLAIMER This Proceedings of a Workshop—in Brief was prepared by EVAN ELWELL as a factual summary of what occurred at the workshop. The statements made are those of the rapporteur or individual workshop participants and do not necessarily represent the views of all workshop participants; the planning committee; or the National Academies of Sciences, Engineering, and Medicine

WORKSHOP PLANNING COMMITTEE MAY CASTERLINE (Co-Chair), NVIDIA; THOMAS A. LONGSTAFF (Co-Chair), Software Engineering Institute; CRAIG R. BAKER, Baker Development Group; ROBERT A. BOND, MIT Lincoln Laboratory; RAMA CHELLAPPA, Johns Hopkins University; TREVOR DARRELL, University of California, Berkeley; MELVIN GREER, Intel Corporation; TAMARA G. KOLDA, MathSci.ai; NANDI O. LESLIE, Raytheon Technologies; ROBIN R. MURPHY, Texas A&M University; DAVID S. ROSENBLUM, George Mason University; JOHN N. SHANAHAN, U.S. Air Force (Retired); HUMBERTO SILVA III, Sandia National Laboratories; REBECCA WILLETT, University of Chicago.

STAFF ELLEN CHOU, Director; GEORGE COYLE, Senior Program Officer; EVAN ELWELL, Research Associate; AMELIA GREEN, Senior Program Assistant (through July 2022); MARTA HERNANDEZ, Program Coordinator; RYAN MURPHY, Program Officer; ALEX TEMPLE, Program Officer; DONOVAN THOMAS, Finance Business Partner; CHARLES YI, Research Assistant.

REVIEWERS To ensure that it meets institutional standards for quality and objectivity, this Proceedings of a Workshop—in Brief was reviewed by LIDA BENINSON, National Academies of Sciences, Engineering, and Medicine; TED BOWLDS, U.S. Air Force (Retired); and JOHN N. SHANAHAN, U.S. Air Force (Retired). KATIRIA ORTIZ, National Academies of Sciences, Engineering, and Medicine, served as the review coordinator.

SPONSOR This workshop was supported by the U.S. Air Force.

For additional information regarding the workshop, visit https://www.nationalacademies.org/event/06-27-2022/testing-evaluating-and-assessing-artificial-intelligence-enabled-systems-under-operational-conditions-for-the-department-of-the-air-force-workshop.

Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial Intelligence–Enabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press, https://doi.org/10.17226/26885.

Division on Engineering and Physical Sciences