Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Proceedings of a Workshopâin Brief Testing, Evaluating, and Assessing Artificial IntelligenceâEnabled Systems Under Operational Conditions for the Department of the Air Force Proceedings of a Workshopâin Brief On June 28â30, 2022, the National Academies of evaluate, and assess AI-enabled systems. The purpose of Sciences, Engineering, and Medicineâs Air Force Studies this workshop was to hear about how the U.S. Air Force Board (AFSB) convened a hybrid workshop in support of (USAF) currently approaches AI testing and evaluation its consensus study on testing, evaluating, and assessing (T&E), industry approaches to testing AI, and challenges artificial intelligence (AI)âenabled systems under to AI testing. Exploration into other topic areas from the operational conditions. The goals of the study are as statement of task will be done in future data-gathering follows: meetings by the workshop planning committee. 1. Evaluate and contrast current testing and assessment 47TH CYBER TEST SQUADRON OVERVIEW methods employed by the Department of the Air The first speaker was Jacob Martinez (47th Cyberspace Force and in commercial industry. Test Squadron [CTS]). Martinez began by giving a brief overview of the 47th CTS, which is part of the 2. Consider examples of AI corruption under operational AFTC, and its two primary mission areas: providing conditions and against malicious cyber-attacks. test environments for hardware and software type 3. Recommend promising areas of science and cloud environments and conducting cybersecurity technology that may lead to improved detection and and resiliency activities for the Air Forceâs kinetic and mitigation of AI corruption. non-kinetic weapons. In essence, the 47th CTS looks The information summarized in this Proceedings of a at not only the physical capabilities but also software Workshopâin Brief reflects the opinions of individual capabilities. Martinez also noted that the 47th CTS is workshop participants. It should not be viewed as a primarily a âfee for serviceâ organization. He explained consensus of the workshopâs participants, the AFSB, that the squadron relies on normal, agile, and continuous or the National Academies. The workshop planning methods of T&E, with the intent to focus on continuous committee heard from a wide range of experts from T&E in the future. He also stated that the 47th CTS is government, industry, and academia to help inform them primarily a developmental testing (DT) organization. about the Air Force Test Centerâs (AFTCâs) ability to test, MARCH 2023 | 1
The discussion shifted toward Unified Platform (UP), a does have a mission in which they do cyber resilience project that aims to integrate cyber capabilities, systems, testing. The point was also made that resiliency infrastructure, and data analytics while allowing cyber testing will, in Martinezâs opinion, probably become operators to conduct numerous tasks across the full integrated with future AI/ML requirements as they may spectrum of cyber operations. It is also one of the five develop. The only issue is that acquiring and funding elements of the Joint Cyber Warfare Architecture. The technological concepts takes a long time. Martinez has 47th CTS worked on this project and looked at several usually seen, within the Department of Defense (DoD), a vendors to help support the application of AI/machine 2- to 3-year gap between the time it takes for a concept learning (ML) in UP. It determined that investments to be accepted, funded, and explored. to begin integrating AI/ML into UP are estimated to be anywhere from $75,000 to $255,000 per year in licensing 46TH TEST SQUADRON âKILL CHAIN DEVELOPMENTAL TESTâ costs alone. Dave Coppler (46th Test Squadron [TS]) talked to the workshop planning committee about the 46th TS, a Thomas A. Longstaff (Software Engineering Institute; subordinate to the AFTCâs 96th Test Wing, and the workshop planning committee co-chair) was curious importance of DT. He noted the squadronâs importance if, within UP especially, Martinezâs group is focusing in considering all stages of the âkill cycle,â also using more on the tools and techniques within UP or on what the term âkill chain DT.â He gave an overview of is within the development, security, and operations the organizationâs chain of command and mission (DevSecOps) chain on the testing side from the software statements. One of the squadronâs primary focuses is on factory. Martinez responded that the 47th CTS is tied into the testing of kill chainârelevant systems. the DevSecOps pipeline process. Discussion then ensued about ownership and responsibility. Martinez stated that, Coppler transitioned to talking about DT and why it is ultimately, the end user is the one who assumes the risk essential. He stated that DT is necessary government and takes responsibility. Rama Chellappa (Johns Hopkins work that helps to accelerate acquisition by leveraging University; workshop planning committee member) unique expertise, facilities, equipment, and capabilities. asked for an explanation of how they currently recruit The 46th TS supports the entire system life cycle to people who can be a step ahead and fully understand ensure that upgraded systems do not break any of the the implications of the system design, AI, and so on. systemâs initial capabilities. They can also provide Martinez responded that industry is paying individuals, upgrades to the software with new capabilities and with that level of expertise, more than what he can ensure that they work properly. They also provide highly provide. Instead of using high salaries to entice talent, qualified experts with proper clearances to engage he suggested using the PALACE Acquire (PAQ) Internship customers on any level and provide the necessary Program. Martinez stated, âby embracing and offering 1 support. Coppler also discussed the importance of the training positions and PAQ internship positions, we test environment that the 46th TS provides for DT. not only get the latest training from academia, but we also can hold those individuals for 2 or 3 years and Coppler fielded questions from the workshop planning invest in them, in their education, and they invest in us committee. Longstaff asked if, within the simulated by providing us new techniques and capability.â This emulated systems that the 46th TS is already using, it policy is not official, but an idea proposed by Martinez, is considering incorporating more AI systems behavior he clarified. Longstaff asked a final question regarding into its simulated systems (e.g., the F-15E, etc.). Coppler applying resilience testing to things that may have responded that until the F-15Es, F-22s, and F-35s start adaptive behavior. Martinez responded that the 47th CTS incorporating AI into their platforms, the TS has no desire to do that. Longstaff followed up by asking if 1 The PAQ Internship Program is a paid, full-time, 2- to 3-year the TS is thinking about doing any automated AI-based USAF program for graduates interested in a number of disciplines. More information can be found at the AFCS website, at https:// behavior within the hardware for testing. Coppler stated afciviliancareers.com/recentgraduates. MARCH 2023 | 2
that he thinks that is way off in the future. The TS is Kendrick mentioned that he has people meeting with the still in the very early stages of developing the art of the developers to ensure that the 45th TS understands the possible. Chellappa asked about annotation and who does developersâ test methodology and their data. The goal is it. Coppler responded that the 46th TS does provide the to see how they can extend the test data and ensure that truth data for physical things in real time, but it is not it covers all of the operational boundaries. involved when the AI, for example, takes a deeper look at how data are being generated and used. AI ASSURANCE Jane Pinelis (Office of the DoD Chief Digital and Artificial AI DT FOR COMMAND AND CONTROL Intelligence Officer) led the final presentation of the first The next speaker, Marshall Kendrick (Air Operations day. She opened by defining AI assurance. She described Center Combined Test Force), opened by saying that the AI assurance as the combination of T&E and responsible 45th TS is just getting started in the AI business. He AI. She explained that the AI assurance process provides then talked about the different efforts that the 45th TS is arguments and evidence to establish trustworthiness and undertaking, many of which are in the big data/algorithm justified confidence. She defined the goal as providing stage. In the future, he noted that most of the efforts stakeholders with âjustified confidenceâ that DoD have the potential to move into full AI/ML capabilities. AI-enabled systems meet requirements and support Last, Kendrick posed two questions that his organization missions through ethical action. Stakeholders include the has been tracking for the past few years: how to test AI warfighter, commander, program manager, regulators, and how to use AI to test and test better. taxpayers, and others. She also talked extensively about the existing partnerships that the Chief Digital and Kendrick talked about some of the squadron-level flight Artificial Intelligence Office (CDAO) has and the different programs his organization is involved in, such as Air support it provides to these stakeholders. Ops Command and Control (C2), a space flight that uses DoDâs Kobayashi Maru C2 program, and other programs. Pinelis moved on to talk about the AI T&E process. The He also discussed the need for real-time data processing, first step, algorithmic testing, is when reserved test as everything is constantly changing (potential threats, data are used against a vendorâs model in a laboratory environments, etc.). AI can assist in this effort, environment. Next, the model tests four areas: particularly with the Advanced Battle Management integrity testing, confidence assessment, robustness, System (ABMS) and the Joint All-Domain Command and resilience. Integrity testing shows the modelâs and Control (JADC2) vision. Kendrick then talked about effectiveness using metrics such as the number of ongoing efforts within the 45th TS. Lt. Gen. (Ret.) John false positives, F1 score, precision recall, and other N. Shanahan (USAF; workshop planning committee data points. Pinelis also talked about a new method member) asked if the 45th TS would play a role in called calibrating âmodel competency,â where someone helping to develop some of the C2 capabilities that the Air uses trained data on a specific data set deployed in an Force is working on. Kendrick responded that they could operational environment. She noted the importance absolutely play a role, particularly on the software side. of the model competency step in assessing âdomain Kendrick and Shanahan also discussed the operatorâs role adaptation,â or the modelâs ability to perform in throughout the test process, the need to identify risk, and different operational environments at the same level as who accepts the risk. observed in the bench testing environment. Confidence assessment calculates the distance between a data point Kendrick also discussed other ongoing efforts, such as and everything on which the model has previously been cloud-based C2. He explained that this effort comes trained. Pinelis mentioned that this type of test helps from the Air Forceâs Rapid Capabilities Office as part with things such as label prioritization. She then talked of the ABMS work. They have already built the test about robustnessâspecifically, natural perturbationsâ data sets and are working directly with developers. and how they are transitioning a tool from the Test MARCH 2023 | 3
Research Management Center that will help identify best practices, cloud-native test harnesses, a T&E bulk edge cases in a test set. Resilience was the final test, purchasing agreement, T&E tools, test products, and an where they specifically focused on adversarial action and AI Red Teaming handbook, among others). She ended whether it comes through adversarial AI or cyber. It also her presentation by briefly discussing the different measures the data setâs ability to diagnose and recover challenges of T&E and responsible AI. Longstaff asked from those attacks. how industry best practices would interact with a newly established T&E factory.2 Pinelis responded that they The second step is system integration, Pinelis said. This would absolutely continue to get industryâs tools and measures how well a model performs when plugged into host them in the factory. She also stated that they try to a legacy system not intended to interact with AI. The keep the CDAOâs tools available to industry for items they key things that the CDAO looks for are functionality, build for the CDAO, but they do not share the test data. reliability, interoperability, compatibility, and security. However, some tools are ones that the CDAO does not want widely advertised, for national security purposes. The third step Pinelis described is humanâsystem Discussion took place about how there are lessons to integration (HSI). This step involves inserting a human be learned from the private sectorâs safety community in the loopâthat is, when a model is mounted to a for using AI in safety systems. Chellappa asked about platform and works. They tied the observe, orient, decide, domain adaptation and how Pinelisâs group will tackle act loop to DoD AI ethical principles to describe the HSI it. Pinelis responded that they will do their best to train framework. She emphasized that human interactions the system with the data that they have but that a lot of with machines need to be maximally informative. emphasis should be placed on learning after the system is fielded. She also talked about privacy and how data The final step is an operational test. Pinelis described transformation and governance can be significant in this as both the easiest and the most challenging step. keeping data useful while ensuring that identity is not It is the toughest because, in her opinion, operationally recoverable. Last, Shanahan asked about the cultural testing AI-enabled systems, particularly autonomous shift between the traditional developmental testing/ ones, is very difficult. It is also the easiest because operational testing and how that is coming along. Pinelis the CDAO always gets to collaborate with somebody responded that integrative testing had been discussed for when doing it. She then stated that the theory and a long time but had not yet been implemented. Shanahan methods behind operational testing are extraordinarily also touched on an AI mishap database and whether any well developed and established. With AI, things have thought had been put into that. Pinelis affirmed that they changed slightly. Tactical testing is an important part had thought about that and are establishing a database of the culture shift that avoids doing one big test at for responsible AI that will be a repository not just for the end of the process and instead focuses on doing incidents but also for tools and data. smaller but more frequent tests in multiple contexts and environments. There is also a push to evaluate the quality DAY 1: WORKSHOP PLANNING COMMITTEE DISCUSSION of decision making as performance. This attempts to May Casterline (NVIDIA; workshop planning committee evaluate the quality of decision making. The final point co-chair) raised a go-back question to Kendrick on focused on the idea that one cannot test for everything whether or not the testing rigor that Pinelis described in and that test culture needs to shift to becoming more risk her presentation was captured in their requirements. accepting rather than risk averse. Kendrick responded that he has assessed whether rigor Pinelis noted that the CDAO was working with the 2 A T&E factory is a broad set of tools to empower non-experts in DoD to test a model when it arrives as a black box (i.e., when the modelâs Office of the Secretary of Defense Director, Operational inner workings are difficult to understand). K. Foy, 2022, âGraph Test and Evaluation, on various AI T&E products that Exploitation Symposium Emphasizes Responsible Artificial Intelligence,â Massachusetts Institute of Technology, https://www.ll.mit.edu/news/ would be available throughout DoD (to include T&E graph-exploitation-symposium-emphasizes-responsible-artificial- intelligence. MARCH 2023 | 4
has been properly addressed, but that his assessment DAY 2: MORNING DISCUSSION would probably not be the same as what Pinelis described The workshop planning committee opened the day with in her presentation. Kendrick pointed out that it is a recap and discussion of the previous day. There was difficult for a fee-for-service organization to solve discussion regarding unknowns, such as the lack of a problem when they need a contract before hiring, ownership regarding liability and requirements. One tasking, building, and testing are available to address workshop planning committee member commented on a the problem. Shanahan observed that a philosophical contrast in approaches between the CDAOâs office and the question needs answering at the Air Force level, writ test squadrons. Shanahan commented that at the end of large, on establishing âwho owns what partâ of this the day, the Air Force has to come in at an Air Force level difficulty and looking into the requirements process. and decide the best way forwardâthe test center versus Another point was that some of the language used, such the warfare centerâregarding roles and responsibilities. as F1, F2 scores, ROC curves, and false positives, is new Tamara G. Kolda (MathSci.ai; workshop planning for many people involved with Air Force T&E. He noted committee member) asked if there was a way to audit that this is not a typical T&E discussion. He followed decisions and collect them as AI systems deploy. Coppler on by saying that it sounds like the Air Force would like responded that without hooks in the AI algorithms, these terms to become part of the T&E discussion, but the test community has no idea how to look into those wondered how the Air Force builds toward that. algorithms and understand what they are doing. Kolda asked if the inputs and outputs of an AI system are Chellappa and Shanahan discussed how someone would logged. Coppler responded that they were. Bieber added know if a new AI system is performing much better that it is not always a given that one can check the than what is already out there. This thought was a inputs and outputs of an AI in a box. AI might be a larger central question for some in figuring out âwhat is good component of the software, and it has a fundamental enough?ââsomething that is still unresolved. Coppler problem: it cannot be instrumented after the fact because commented that during his time on active duty with the that might change how the software operates. 53rd wing, they would test âgood enoughâ by measuring against what they already had. Chad Bieber (CDAO) PRACTICAL GUIDE TO AI TESTING agreed with Coppler and added that there are many ways Bieber spoke briefly about his background as a tester to be good. Coppler jumped back in and posited that if and his previous work at the Joint Artificial Intelligence an AI/ML algorithm does not perform as expected when Center. tested, it may be doing something better than one ever thought possible. Longstaff resonated with that point Initially, Bieber spoke about metrics and metric and brought up his concern that sticking with the old development. When developing metrics, one needs regime of âtesting to requirements,â may result in the to understand what metrics developers are using, discarding of systems that yield surprisingly better understand how program management has defined results. requirements, and understand how to measure operational successâthat is, the importance of soliciting A final discussion ensued regarding the testing of large the end userâs assessments of operational performance. systems. Longstaff used JADC2 as an exampleâonce one He also talked about standards and how everyone starts incorporating more AI capabilities, the nature of makes their own tools and products. Unfortunately, this the entire system changes. How does one test that and does not allow much in the way of interchangeability. begin to think about what to do to test an integrated Bieber also talked about tools and the CDAOâs work system of that size and scaleâan integrated system establishing a T&E software factory, as well as its vision incorporating behavior and change based on how an for developing a âsuitcaseâ test kit, which would allow adversary changes? AI T&E in situ. Bieber explained that such a capability would not only be invaluable in assessing competency MARCH 2023 | 5
(domain adaptation) but would likely also lead to the development, to do testing at the same speed as the ability to âtune resultsâ under operational conditions. development process.â He also spoke about competency He then touched on modeling and simulation (M&S). He testing and the different ways one can do it. stated that M&S is vital to AI T&E. He talked specifically about the common worry, or complaint, regarding the The discussion then shifted to autonomous vehicles and exploding state of the AI space. Bieber does not think of testing metrics, such as the number of user interrupts, that as the biggest problem. AIâs unique problem is that to measure operational performance. Darrell spoke about it does not understand the performance across that space how coming up with a commonsense tool that could look well enough to predict behavior between two points, at and summarize a performance dump4 could be helpful. much less outside the area it tests. He also spoke about how it would be valuable to require some disclosure and the ability to benchmark against Bieber then presented a scenario regarding a dog- open systems. Bieber followed on and spoke about the finding uncrewed aerial vehicle (UAV) used by emergency challenges of a black-box system and being unable to services. Within this scenario, he talked about different look inside it. Although he did say that while it would metrics and their uses, such as mean average precision be useful to have full access to everything, the financial (mAP), average precision, Recall, and f-scores. Longstaff cost of having full access may not be feasible. Darrell asked if Bieber could contrast mAP to accuracy. Bieber suggested model cards and documentation confidentiality responded that precision, in the computer vision world, as middle-ground solutions. Bieber stated that the has a smaller, less overloaded definition than accuracy. CDAO requires model cards. In closing, Kolda and Bieber Casterline and Chellappa discussed the applicability engaged in discussion regarding data sequestration. of some of the metrics that Bieber mentioned and Kolda also asked about model learning and whether the commented that they are very computer-vision-centric. models that Bieberâs group receives are already trained. Chellappa stressed the need to understand the metrics Bieber responded that once the algorithm was delivered and think more about what would work for AI-based and deployed, it did not change. systems. The workshop planning committee also discussed the idea of an algorithm deployed in the field ROBUST AND RESILIENT AI that continuously learns during deployment. Bieber Olivia Brown (Massachusetts Institute of Technology mentioned SmartSensor, which does have the ability to 3 [MIT] Lincoln Laboratory [LL]) spoke about how AI retrain rapidly. Trevor Darrell (University of California, systems have great promise for DoD. However, they are Berkeley; workshop planning committee member) asked demonstrably brittle and often vulnerable to different for Bieberâs thoughts on the idea of merging the culture forms of data corruption, Brown said. She specifically of testing and development. He also asked for thoughts named post-sensor digital perturbations as a form of regarding identifying specific entity labels and not just corruption. Brown explained that there are sources of a broad category, such as identifying a T-72 versus a natural and adversarial forms of vulnerability. A natural tank. Bieber responded that he had seen the opposite source could be when an AI model trains on upright problem, where they have tried to use computer vision to chairs. When tipped over, the model could suffer a detect too far down the ontological hierarchy. Bieber also significant performance drop. Adversarial forms of stressed the need for continuous testing. He stated, âWe vulnerability could involve deliberately manipulating have to have the ability, if weâre doing continuous an imageâs pixels, causing the model to fail in correctly classifying inputs, according to Brown. 3 Smart Sensor is a CDAO project delivering an on-platform, AI-enabled autonomy package that allows a UAV to conduct automated surveillance and reconnaissance functions in contested environments. Satnews, 2022, âDoD CDAO Partners with USAF to Conduct Developmental Test 4 A performance dump of the system is a collection of data from Flight of AI and Autonomy-Enabled Unmanned Aerial Vehicle,â Satnews, a service processor after a failure of the system, an external https://news.satnews.com/2022/06/23/dod-cdao-partners-with-usaf- reset of the system, or a manual request. IBM, 2021, âInitiating to-conduct-developmental-test-flight-of-ai-and-autonomy-enabled- the Performance Dump,â https://www.ibm.com/docs/en/ unmanned-aerial-vehicle. power9/0009-ESS?topic=menus-initiating-performance-dump. March 2023 | 6
Brown then talked about the current way that machine does not necessarily have an answer, but that setting the models train. First, they undergo a design phase, where requirements is very important. She responded that MIT training data are collected and validated. The model LL was exploring ways to use simulators and to figure then tests on a test data set similar to the one on which out how to integrate those into the training process. it trained. The system then deploys. She noted that Regarding the second question, Brown stated that MIT they often observed that performance of the deployed LL is moving beyond natural images and looking at system in the operational domain was much worse radar and time series. Longstaff followed up and asked than predicted during the test phase. This degraded about data augmentation strategies. Brown responded performance reduces operator and user trust and that these strategies exist to train against a single type results in the system going offline, reoptimization, and of perturbation, but you will (normally) have a suite of ultimately redesign, Brown noted. them. Brown stated that the path to creating a more robust AI TRUST AND TRANSPARENCY system starts at the opposite side of the development Michael Wellman (University of Michigan) opened his process. Talking to operators at the beginning of the presentation with a brief discussion of his past work. systemâs design phase is essential. In this way, the He started with how trust and transparency are nothing developer understands the operational environment into new for AI. Trust in an AI system is ill-defined because which the system will deploy. This awareness allows the people have different ways of defining it, Wellman programmers to consider potential sources of variation in said. Moreover, trust goes beyond AI systemsâit the data that the system is likely to encounter. Next, the applies to any software system or system that generates developer should establish a testing process that avoids recommendations, information, or decisions. To experimenting against a test set similar to the systemâs Wellman, however, trust is not a necessary condition training. Instead, one should test against that training to use a system. Many instances exist where people use dataâs perturbations or that training distribution. Last, technology without understanding its full consequences, Brown advised training the model to perform better Wellman said. against perturbed data. Brown then spoke about the work at MIT LL in robust AI research that addresses Wellman then discussed an example of autonomous new ways to tackle natural and adversarial sources of AIâstock trading. In certain instances, companies have vulnerabilities. Brown highlighted tools like HydraZen 5 employed AI to control large trading accounts that act and the Responsible AI Toolbox (rAI-toolbox), which 6 autonomously in financial markets. Indeed, inserting a will help Brownâs team at MIT continue its research human in the loop is not feasible. By the time a human on evaluating AI robustness. The workshop planning can do anything, the opportunity evaporates. He cited a committee conversation then shifted toward different company, Knight Capital, where a software configuration use cases that utilized these tools. Brown concluded error led to a loss of around $400 million that took by describing MIT LLâs next steps in supporting the the company down. Nevertheless, even with that kind development of robust and responsible AI. of outcome, people did not stop trusting or using this technology, Wellman said. Longstaff asked if there was a way to specify a requirement that would allow them to test against the Wellman then discussed transparency in AI systems. requirement once the robustness training was complete. Specifically, he spoke about the common approach, He also asked how well the robustness pipeline works called the explanation approach, used to interrogate the with non-vision-oriented AI. Brown responded that she underlying model so that one can explain the decision or 5 See the Hydra-Zen site, at https://github.com/mit-ll-responsible-ai/ recommendation it produces. However, this approach has hydra-zen. some dangersâmainly that it is easy to come up with an 6 See the Responsible AI Toolbox site, at https://github.com/ mit-ll-responsible-ai/responsible-ai-toolbox. explanation that seems plausible and could be the reason March 2023 | 7
for an underlying decision, but that might not necessarily will be compelling. Sometimes that is worth embracing. have a causal connection. Wellman then presented a However, there is always going to be a matter of different approachâto limit oneself to models that are measured risk. Chellappa said that there are four things interpretable in the first place. In other words, the model to look at: domain adaptation, adversity of attacks, has a certain simplicity or structure that one can discern bias, and privacy. Wellman responded with adversarial directlyâthe explanation that the model deduces is approaches in black-box situations. He said that the risk causally related to an actual decision or recommendation. with domain adaptation is that things get deployed in He maintained, however, that it is not always possible to situations for which they were not designed. do this. EVALUATION OF AI Wellman then introduced strategic domains. This Thomas Strat (DZYNE Technologies) started by going approach considers decisions in worlds where the through the background of his company and presented outcome depends on other agentsâ actions. The finance some case studies about the types of AI work in which and trading example discussed earlier is one example. He DZYNE Technologies is currently involved. mentioned cybersecurity as a strategic domain because an attack or defense is always relative to the other DZYNE Technologies is a small company that designs, partyâs actions. Negotiation, monitoring, war gaming, or builds, and operates autonomous aircraft, Strat said. anything in conflict is also considered a strategic domain. These aircraft can range anywhere from small 6-pound Strategic domains present a transparency challengeâ aircraft to the largest UAVs that the military currently the decisions made in a strategic situation often operates. The company also has an AI group of around require unpredictability. So, something like debugging 25 individuals who help to deploy AI capabilities on their is more challenging. Wellman concluded that from a aircraft, Strat said. designerâs perspective, it requires extra care to preserve transparency. Strat then discussed semantic labeling from satellite imagery. Specifically, he talked about determining Wellman ended his presentation and opened up the which algorithm is better, given two different image discussion. Chris Serrano (HRL Laboratories) pointed classifications. A qualitative approach to answering the out that, while we may not have a theorem on whether question is useful because it allows you to look at the an attacker or a defender of a system will win, there data from a visual perspective and ask if they represent is undoubtedly an idea of how the cost grows when your intuition. A quantitative approach allows one to use defending a system versus attacking one. Wellman added several metrics to determine accuracy. Strat pointed out that cost also determines who wins in the end. Longstaff that while there are many commonly used metrics, there and Wellman discussed counterfactuals and how to is not one single obvious best metric to use in any given utilize them in dealing with the issue of inferring intent. situation. He then stated that it is seldom clear what Wellman explained that using counterfactual queries metrics to use from the outset, which depends on many could infer intentâin this instance, identifying whether factors. Strat also pointed out that to do an evaluation, someone is a scammer. Shanahan asked a question you must have some form of ground truth to compare it regarding Wellmanâs statement on trust not being a to, and ground truth is not always complete and correct. prerequisite for adoption. He asked if we are getting too The quality of that ground truth makes an important detailed or âcuteâ with some of our existing systems, difference, he said. Overall, some key challenges concern particularly given how basic their capabilities are right trade-offs between the evaluation metrics that someone now. Wellman said he framed his stock trading example chooses and the quality of ground truth making a big as a cautionary tale to show that it may not be possible to difference. Strat stated that some solutions to help stop a system without full trust or confidence because it with these challenges include having multiple metrics March 2023 | 8
and pretraining a model without annotation. Longstaff its first use. At this point, Strat said, it has amassed more asked if any metrics explain the quality of ground truth. than 50,000 hours of operational use by the military in Strat responded that he was not sure, but during his the Middle East. Strat then talked about the mishaps. time as a Defense Advanced Research Projects Agency Most of them have been mechanical, some were owing to (DARPA) program manager, they played around with hostile action, and a number were attributed to operator that. Chellappa stated that there are some models of label error or the human in the loop. According to Strat, zero noise, but it is hard to figure out how good the ground mishaps were attributed to the AI error. Instead, when truth is. the operators did not trust the AI, problems occurred, such as intervening in the aircraftâs landing approach. Strat spoke about another case study regarding the area Last, Strat presented a final video showing off ROBOPilot, of building extraction. This case study aimed to highlight an autonomous system that can fly an airplane. Over all of the buildings in an image and use brightness to the span of a few years, the system was developed and determine building height. Strat pointed out that as one trained by Stratâs team to fly an airplane with no human gets toward the perimeter of the image, off-axis pixels in the cockpit. Strat then spoke about the potential increase. He posed the question, âHow do you evaluate application of this robotic technology for the military. the accuracy of these data sets?â You do not have ground truth that covers city-size areas with any accuracy, Chellappa asked for Stratâs assessment of the efficacy according to Strat. of simulations. Strat stated that the answer to whether it is useful for AI algorithms is complicated, but why Additionally, said Strat, any algorithmâs accuracy will not would it not be? The more veracity the simulation has, be uniform across something the size of an entire city. the less reason there is to doubt its efficacy for training Cities are not uniform and have many factors that could or evaluating AI algorithms. Longstaff asked about affect the algorithm. For example, he stated that the ROBOPilot and how it compares to the full auto function heights of trees in a certain area could affect the ability that a 787 Max has. Strat responded that there is a to extract data properly. market for autonomous flight. He specifically mentioned aircraft that may have been deemed unsafe for human Strat then shifted the discussion toward autonomous flight. Strat also said that fly by wire is the way to go and vehicles. First, he considered how to evaluate progress, that he would not necessarily put ROBOPilot up against and that speed may not necessarily be the way to a fully integrated autopilot system. He ended his talk by measure that. He then talked about the DARPA Grand briefly discussing how interfacing with a human being Challenge. This challenge aimed to put autonomous is one of the most difficult challenges for AI because the vehicles to the test in a real-world environmentâin human brain is so complex. It is much easier to interface particular, an operationally relevant environment such with physics than it is with humans. As such, the as a desert. Strat said that he favors attempting system- ROBOPilot program is a much easier challenge to solve level tests in operational environments whenever than what DARPA set out to do with the ALIAS program. possible, as he believes that there is nothing more convincing than doing that. He then covered autonomous ASSURANCE: THE ROAD TO AUTONOMY aircraftâspecifically, a long-endurance air platform Jim Bellingham (Johns Hopkins Institute for Assured (LEAP). LEAP has been in operation since 2016 in the Autonomy) discussed his background in marine robotics Middle East in numerous combat operations. First, it and autonomous marine vehicles. He also talked was evaluated and tested using simulation and takeoff extensively about the application of AI and autonomy in tests at military bases. LEAP then moved on to formal many vehicles that he had helped develop and utilize. In operational assessments in theater in the hands of the this talk, he also referenced several tools. military, where it has been continually reassessed since March 2023 | 9
Bellingham explained that autonomy is everywhere: relatively fragile AI systems. He then spoke about how finance, logistics, military systems, the medical current evaluation techniques do not encourage AI/ML environment, and others. The big problem is assurance. systems to generalize. He also explained how current He explained that for industry, it is a trade-off between evaluation techniques do not reveal AI/ML fragilities. assurance and ensurance. He also said that assurance is a key to accelerating AI and autonomy. Turek identified how DoD needs do not align with the focus of the AI/ML industry, as follows: Bellingham wrapped up his presentation by stating that robotics and autonomy will transform society. He Industry is profit-driven, has access to massive amounts reviewed some current societal drivers regarding future of data, has a low cost of errors, and faces threats from conflict, such as the lack of guidelines for managing commercial adversaries. DoD is purpose-driven, has escalation and the changing geography for future conflict access to limited amounts of data, has a high cost of (land, sea, air, space, cyber, etc.). Bellingham also errors, and faces threats from active nation-state-level shared that an enormous amount of research needs to adversaries. be done regarding the connection between humans and AI. He ended by noting that getting ahead of the curve is Turek then highlighted what may be possible as future important to slow down adversaries. national securityârelevant capabilities: AI: CURRENT AND FUTURE CHALLENGES ⢠Trustworthy autonomous agents who sense and Matt Turek (DARPA) began discussing current AI act with superhuman speed and can adapt to new breakthrough applications, such as AlphaGo, DeepBlue, situations; and more. However, even with all of this success, we ⢠Intuitive AI teammates who can communicate may not be on the right trajectory with AI. For example, fluently in humanânative forms; he brought up self-driving carsâspecifically Teslaâs ⢠Agents that promote national security; and autopilot feature, and how it relies on computer vision, not multimodal sensing. He also spoke about how even ⢠Knowledge navigation for intelligence and when autopilot mode is engaged, Tesla holds the human accelerating defense technology development. drivers responsible. He continued by saying that users To realize some of these capabilities, Turek stressed the are not at a point where they can reliably delegate critical importance of investment in AI engineering, human decisions to autonomy. He mentioned that some people context, and theory to help build robust DoD AI systems. have been excited in parts of the AI/ML communityâbut they are not working in ways comparable to humans. Turek closed by stressing the importance of: He noted that state-of-the-art large language models lack basic comprehension, fail to answer simple but ⢠Developing theories of deep learning and ML; uncommon questions or match simple captions and ⢠Measuring real-world performance by developing pictures, do not understand social reasoning, do a rigorous experimental design that measures not understand psychological reasoning, and do not fundamental capabilities and produces generalizable understand how to track objects. systems; Turek then commented on his belief that the evaluation ⢠Focusing not only on performance but also on of AI/ML systems is broken. He talked about how we resource efficiency; are chasing very narrow benchmarks and optimizing ⢠Developing compositional models using principles performance against those benchmarks. According to approaches to exchange knowledge; and Turek, this just reinforces the building of narrow and March 2023 | 10
⢠Developing appropriately trustable AI systems for humans. Her second point was that effective teams that have predictable adherence to agreed-upon understand that each member has different roles and principles, processes, and alignment of purpose. responsibilities that avoid role confusion but back each other up as necessary. Cooke stated that AI should Shanahan responded to Turekâs comment that the trial- understand the whole task to provide effective backup. and-error approach to AI testing is no longer acceptable Her third point was that effective teams share knowledge by saying that human beings do an awful lot of trial- about the team goals and the current situation; over time, and-error learning. He then asked if DARPA has been this facilitates coordination and implicit communication. looking at hybrid approaches to solving the AI T&E Cooke stated that humanâAI team training should be problem. Turek responded that DARPA is interested in considered and that we should not expect a human to be hybrid AI, particularly across statistical and symbolic matched with an AI system and immediately know how approaches. Last, Longstaff asked Turek if we are going to work well together. Her fourth point was that effective in the right direction in regard to making advances in the teams have team members who are independent and fundamentals of AI. Turek responded that he does not thus need to interact or communicate, even when direct have a magic solution, but that his team is trying to set a communication is not possible. Cooke said this argues vision for things that they think need to be done. not necessarily for natural language but maybe some other communication model. The fifth thing we know is HUMAN AI: TEAMING IS UBIQUITOUS that interpersonal trust is important to human teams. Nancy Cooke (Arizona State University) was the dayâs Cooke stated that AI needs to explain, provide a reason final speaker. She began by discussing human-AI for its decision, and be explicable. teaming. She stated that AI could not be effectively developed or implemented without consideration of Cooke then spoke about the challenge with humanâ the human. AI does not operate in a vacuum and will AI teaming. She stated that research on humanâAI interface with multiple humans and other AI agents. teaming cannot wait until AI is developed; it is then too late to provide meaningful input. Instead, a research Cooke then spoke about different aspects of teaming, environment, or testbed, is needed to get ahead of specifically regarding team composition and role the curve and conduct research that can guide AI assignment, processes, development, and effectiveness development. She then discussed different examples measurement. She then talked about the Synthetic of physical and virtual synthetic test environments. Teammate Project on which she is working. The projectâs Next, she introduced a concept called the âWizard of objective is to develop a synthetic AI teammate to take Ozâ paradigm in which a human plays the role of the the place of air vehicle operators and work with two AI or even remotely operates a robot to simulate very humans in the remotely piloted aircraft system task, intelligent AI in a task environment. She also spoke about Cooke said. the importance of measures and models when measuring Cooke affirms taking humanâmachine teaming different aspects of humanâAI teaming effectiveness. seriously. She defined a team as two or more teammates Longstaff asked, regarding the Synthetic Teammate with heterogeneous roles and responsibilities who Project, how she would write a requirement for someone work independently toward a common goal. She then else to develop that pilot program? He also asked how commented on what is currently known regarding she would test what she got back from the developer humanâAI teaming. Her first point was that team to know if she got the right product. Cooke responded members have different roles and responsibilities and that she would write down the details and results about that this argues against having AI replicate humans. It her teamâs experiment and say, make it better so that also upholds that narrower AI allows AI to do what it it succeeds. She also stated that they would test it the is best at, such as big data analytics and visualization March 2023 | 11
same way her team tested it the first time. Robin R. planning committee member) said that he was struck by Murphy (Texas A&M University; workshop planning the fact that many presentations made it seem hard to committee member) jumped in and asked if the way separate any discussion of T&E from the requirements to ensure that the synthetic agent is aware of its team against which the T&E is being performed. Owing to responsibilities would be to develop an Adaptive Control the narrow statement of task, Rosenblum questioned of ThoughtâRational model of the entire operational the extent to which the workshop planning committee space. Cooke responded that she thinks so. Chellappa would be concerned about saying anything about asked how she sees humanâAI as different, better, or requirements. The workshop planning committee also more complicated than humanâcomputer interaction broadly discussed the inability to avoid the question or (HCI). Cooke responded that much is known about discussion surrounding requirements when it comes human systems integration that can be brought to bear to T&E. The workshop planning committee also spoke on humanâAI systems that people do not consider on about other sectors where the consequences of AI would HCI when one person is interacting with a product. be high. Serrano mentioned that the question of liability She also stated that HCI had not done much in massive resonated with him throughout the day and that we system areas such as JADC2. Shanahan asked if Cooke build these systems of systems inside some organization had a separately controlled experiment where it was a designed by committee. He asked, âWho is going to three-AI team with no humans involved. He also asked take ownership for how this thing should perform?â if there were any takeaways regarding their work with The workshop planning committee ended the day by HSI. Cooke responded that they had not done any three- discussing the break between what happens in the agent teaming. She also stated that one of her takeaways development community and the research community. was that AI is too often optimized on task work when it is important, in these complex systems, to optimize VERTICAL DATA SCIENCE teamwork. Casterline commented that there are concepts Bin Yu (University of California, Berkeley) defined vertical of multiple agents being able to learn how to work data science as the process of extracting reliable and better to serve an objective function in robotics, game reproducible information from data with an enriched, theory, and more. Robin Murphy asked about metrics. technical language to communicate and evaluate Specifically, how can they estimate whether one team is empirical evidence in the context of human decisions more likely to produce the right answers than another? and domain knowledge. Yu introduced the predictability, Cooke responded that they have metrics and are trying to computability, and stability (PCS) framework. She stated develop more, such as domain-independent measures. that PCS is a way to unify, streamline, and expand on Longstaff and Cooke talked about AI training in the ideas and best practices in both ML and statistics. She context of humanâanimal teaming. Robin Murphy asked also spoke about the importance of documentation. about the feasibility of predicting the performance of a humanâAI team. Cooke responded that she could not Yu broke down each part of the PCS framework. think of a way to do it without seeing them perform, Concerning problem formulation, predictability potentially in a training scenario. reminds us to keep in mind future situations where AI/ML algorithms will be used while developing AI/ DAY 2: WRAP-UP DISCUSSION ML algorithms. Concerning data collection and data Casterline started the dayâs wrap-up discussion by comparability, predictability reminds us to keep in mind stating that she found it interesting how one presentation future situations where AI/ML algorithms will be used spoke about how people will not really be able to trust while developing AI/ML algorithms. Concerning data AI, so they just have to accept the risk, versus when AI comparability, stability reminds us to keep in mind that trust is really a requirement and more rigor is necessary. there are multiple reasonable ways to clean or curate a David S. Rosenblum (George Mason University; workshop given data set from the current situation. Regarding data March 2023 | 12
partitioning, stability reminds us to keep in mind that VanHoudnos defined AI corruption as a decrease in a there could be multiple reasonable ways to partition a quality attribute of an AI system. He then spoke about given data set from the current situation to ensure that the different roles played by different people as teams the test set is as similar to future situations as possible. try to accomplish different missions. The discussion Last, regarding other forms of data perturbations, then shifted toward evaluating ML modelsâspecifically, stability reminds us to keep in mind that data the evaluations should reflect how models will be used perturbations should reflect future situations. in practice and specific scenarios of importance to the application of the model. Thought should also be given Regarding comparing different predictive techniques, to metrics you care about when evaluating. Chellappa Longstaff asked if any quantitative measures were commented that one of the reasons he thinks many currently incorporated into the framework to help benchmarks are averages is because there is a desire to choose the best algorithm or technique. Yu responded avoid somebody optimizing the algorithm for just one by identifying two measures, sensitivity and specificity. point on the plot that may be operationally relevant. Longstaff followed up by asking how to reason about VanHoudnos then spoke about different examples of the trade-offs between predictability and stability. AI corruption. Casterline and VanHoudnos discussed Yu responded that her team screens for predictive adversarial patches in classification and the idea of performance before seeking stability. Casterline and Yu using an adversarial patch to test against a retrained discussed translating operational requirements into the model. Casterline compared it to a cat-and-mouse mathematical statistics to which Yu referred. Shanahan game and wondered if this is truly the right approach, asked about justified confidence and how doctors and to constantly devise counterattacks for the continual nurses attain that. Yu responded that it is important stream of adversarial attacks that continue to evolve to understand their work and profession as much as no matter what is done. Longstaff asked where in the possible when developing models. Kolda and Yu talked requirements process they would know that a certain about embedding T&E with operations. Yu ended her talk quality attribute of an AI-enabled system will be tested. by speaking about the importance of documentation and VanHoudnos responded that he would have to defer to metrics. the DevSecOps folks. Longstaff followed and provided his thoughts on the question: creating the quality attributes AN APPLIED RESEARCH PERSPECTIVE ON ADVERSARIAL is a collaboration between a team of operators, testers, ROBUSTNESS AND TESTING and development folks. VanHoudnos wrapped up by Nathan VanHoudnos (Software Engineering Institute) discussing the concept of justified confidence7 with spoke about AI security and making systems do the Longstaff. wrong thing. First, he introduced the Bieler taxonomy, where an adversary can make you learn, do, and reveal DEFENDING AI SYSTEMS AGAINST ADVERSARIAL ATTACKS the wrong thing. Next, he compared these different Bruce Draper (DARPA) began his presentation by talking things to data poisoning, adversarial patches, model about different types of adversarial attacks against data inversion, and membership inference attacks. He then models. He then spoke about algorithmic defenses for AI stated that in their laboratory, they focus on training systemsâspecifically, regarding five best practices. systems to learn correctly, do things correctly, and not reveal secrets. Next, when it comes to verifying a system, The first best practice discussed was cyber defense. VanHoudnos spoke about their âTrain and Verifyâ Draper stated that networks are vulnerable, and most AI project, where they try to make robust ML systems that systems are attached to a network. He also stated that do not reveal secrets, as well as private ML systems 7 Justified confidence is about developing AI systems that are robust, that are not fooled as easily. Last, he introduced several reliable, and accountable, and ensuring that these attributes can be verified and validated. Northrop Grumman, 2021, âAI Development other projects focusing on protecting systems from many Aligns with US Department of Defenseâs Ethics Principles,â https://news. northropgrumman.com/news/features/northrop-grumman-building- adversarial techniques mentioned above. justified-confidence-for-integrated-artificial-intelligence-systems. March 2023 | 13
it is relatively easier to attack a network than an AI. runtime. Draper noted that both of these methods require Therefore, he suggested that to defend the AI, the focus a known threat model. The danger is that if the adversary must be on defending the network. does something you do not anticipate, these methods will not work. Draper also spoke about some methods against The second best practice discussed was protecting the physical attacks, such as tile-based defenses and patch input dataâspecifically, sensor-inspired data. Draper detection defenses. described two types of attacks, one revolving around having access to the actual digital signal, which makes Draper ended his talk by speaking about evaluation spoofing very easy. The other type of attack is physical. software and tools. He concluded with DARPAâs These attacks revolve around altering items in the Guaranteeing AI Robustness Against Deception Armory, physical world to trick a system. Draper noted that an evaluation tool that allows analysts to run adversarial physical attacks are harder for an adversary to launch AI experiments at scale quickly and repeatedly. and easier to defend. He ended by stressing the idea of protecting your data. RECOGNITION SYSTEM EVALUATION Ed Zelnio (Air Force Research Laboratory) spoke about The third best practice was about collecting inputs from imaging systems and the different types of data: multiple sources. Draper stated that it is harder to spoof sensor data, metadata, and labels. He also spoke about multiple sensors than one sensor. He also noted that labelingâspecifically, regarding granularity. Zelnio different types of sensors make it even harder to disrupt, then spoke about different categories of target data and and having different instances of sensors also offers introduced the categories in library mission targets, some benefits. library confusers, out-of-library confusers, and clutter. He also went over the difference between developmental The fourth best practice discussed was about protecting and operational data. model development. Draper urged everyone to be wary of externally acquired models. They may have back Zelnio introduced âsome things that would be nice doors, either unintentionally or from poisoning, and if to measure in terms of evaluation.â He spoke about an adversary has access to the model, it enables white- measuring the reliability and confidence in a system, box attacks. Draper also stated that when training your measuring understandability and trust of a system, models, you should avoid using untrusted training data, measuring the robustness of a system, measuring the avoid boot-strapping from untrusted models, and keep effectiveness of out-of-library confusers, measuring the information about training data private. performance of a performance model, figuring out what to do with limited data, and the need to talk about a The fifth best practice was quality assurance post- sustainable end-to-end training process. fielding. Draper advised, when possible, to have a person double-check sampled AI outputs. Zelnio ended by speaking about best practices. The first best practice is coming up with an expectation Draper then spoke about how one can increase system management agreement. These tell you under what robustness. He introduced a few methods, such as operating conditions you can expect a given system adversarial training and randomized smoothing. to work. The second best practice is the use of a test Adversarial training is when you attack your sample harness than can help to reproduce training and aid in during the training process. It does not slow you down evaluating the algorithm and the training process. Last, at runtime but slows training. Randomized smoothing is the third best practice is testing to break, to see what where you wait to get the input and then make different does and does not work. versions of that input. It has the opposite pros and cons, where it makes training faster but will slow you down at March 2023 | 14
Longstaff asked what has helped get customers over the begun incorporating digital twins work into the testing barrier that allows them to increase their confidence process for autonomous systems. Bjorkman responded in using an experimental system in an operational that she had seen that happening. Next, Shanahan, environment. Zelnio responded that the expectation Casterline, and Bjorkman discussed data and the use management agreements are important in increasing that of virtual testing environments. Bjorkman commented confidence. He also spoke about keeping demonstrations on how you cannot get enough replications of things as relevant as possible to excite operators. Longstaff in the real world to test a system fully. She posed an also asked if they could receive additional feedback from example about autonomous cars and how you cannot an operational customer over time that would allow just go and drive every road 100 times. She also pointed retraining or retesting opportunities. Zelnio responded out how often she has observed so many different tests that it would be great to have a laboratory in the loop to that they perform where they cannot collect anywhere help with this, but it happens more informally. near sufficient live data. As a result, it forces them into a virtual or even a constructive environment. AFTC AI INFRASTRUCTURE NEEDS Eileen Bjorkman (AFTC) was the workshopâs final WORKSHOP WRAP-UP AND DISCUSSION speaker. Bjorkman spoke about AFTCâs current objective The workshop planning committee began its wrap-up of looking at the unique infrastructure needs within the by discussing its final thoughts from the workshop. test center and across different organizations to set itself Casterline commented that she does not think that any of up to test autonomous systems. the systems are prepared for the iteration that they will have to facilitate. She also said that there is a lot to âgrab Bjorkman spoke about three main things to think about fromâ and apply here regarding the DevSecOps model for in the testing process. First, test safety, particularly in software. Shanahan commented about the culture shift making sure that an operator can contain a system if of iteration and adaptability. If the Air Force does not it begins to perform in ways that they do not expect. get that right, everything else is just another discussion Second, early tester involvement and how testing about T&E. Chellappa commented about the idea of a strategies must be built into system design. The final centralized facility to test AI. He also commented that need revolved around test infrastructureâspecifically, he does not think that we know what it means to test AI instrumentation, data collection and storage, and range right now. He specifically pointed out the metrics mAP support. and Recall and commented how these are ideas from the 1970s. He questioned why they had become a metric Bjorkman also spoke about current T&E needs and for current AI systems. Kolda wanted to stress that not how there is no enterprise-level T&E infrastructure to everything is in the data. She also warned against the support autonomy testing. She also stated that there is idea of trusting AI too much. no DoD enterprise-level software T&E infrastructure that supports the testing of AI. She then spoke about different Additionally, Kolda commented that humans need to investment areas that she thinks need to happen. These evaluate the answers that come out of an AI system and investment areas focused on architectures, frameworks, not just blindly accept them. Casterline commented modular subsystems, data management, virtual ranges, about a gap in the vernacular between algorithmic agile workforce, and surrogate platforms, Bjorkman said. tests and measurements and operational relevance and tests. Kolda questioned how the process of continuous Bjorkman ended by discussing an autonomous system integration, evaluation, and feedback would work. Last, and AI roadmap that showed funding for different Robin Murphy offered her thoughts and spoke about programs over a 7-year timeline. Next, Casterline and how human work processes need to be considered in the Bjorkman began discussing the use of simulation work in discussion regarding AI corruption. She also commented their virtual environments. Longstaff asked if they have about her fear that security is viewed as an algorithmic March 2023 | 15
problem and that it is just going to come up with another technology question. It could also ask how DoD and the algorithm that will detect when the AI is not working Air Force can improve the nature of how technological correctly. advances are incorporated. Rosenblum commented that, regarding the third question, he is worried that anything Longstaff discussed the major questions from the the workshop planning committee says will be outdated statement of task. His first point, regarding the task in a year or two owing to the rapid pace of technological of evaluating and contrasting current T&E, was that change. Longstaff responded and stated that the there is very little overlap between the way T&E is done workshop planning committee could point toward commercially and the way that the workshop planning general trends instead of specific scientific advances. committee experienced it through the examples in Last, Longstaff thanked everyone for their contributions, the workshop presentations. Next, he focused on AI and staff member Ryan Murphy officially closed the corruption and stated that the third question from the workshop. statement of task goes beyond being just a scientific March 2023 | 16
DISCLAIMER This Proceedings of a Workshopâin Brief was prepared by EVAN ELWELL as a factual summary of what occurred at the workshop. The statements made are those of the rapporteur or individual workshop participants and do not necessarily represent the views of all workshop participants; the planning committee; or the National Academies of Sciences, Engineering, and Medicine WORKSHOP PLANNING COMMITTEE MAY CASTERLINE (Co-Chair), NVIDIA; THOMAS A. LONGSTAFF (Co-Chair), Software Engineering Institute; CRAIG R. BAKER, Baker Development Group; ROBERT A. BOND, MIT Lincoln Laboratory; RAMA CHELLAPPA, Johns Hopkins University; TREVOR DARRELL, University of California, Berkeley; MELVIN GREER, Intel Corporation; TAMARA G. KOLDA, MathSci.ai; NANDI O. LESLIE, Raytheon Technologies; ROBIN R. MURPHY, Texas A&M University; DAVID S. ROSENBLUM, George Mason University; JOHN N. SHANAHAN, U.S. Air Force (Retired); HUMBERTO SILVA III, Sandia National Laboratories; REBECCA WILLETT, University of Chicago. STAFF ELLEN CHOU, Director; GEORGE COYLE, Senior Program Officer; EVAN ELWELL, Research Associate; AMELIA GREEN, Senior Program Assistant (through July 2022); MARTA HERNANDEZ, Program Coordinator; RYAN MURPHY, Program Officer; ALEX TEMPLE, Program Officer; DONOVAN THOMAS, Finance Business Partner; CHARLES YI, Research Assistant. REVIEWERS To ensure that it meets institutional standards for quality and objectivity, this Proceedings of a Workshopâin Brief was reviewed by LIDA BENINSON, National Academies of Sciences, Engineering, and Medicine; TED BOWLDS, U.S. Air Force (Retired); and JOHN N. SHANAHAN, U.S. Air Force (Retired). KATIRIA ORTIZ, National Academies of Sciences, Engineering, and Medicine, served as the review coordinator. SPONSOR This workshop was supported by the U.S. Air Force. For additional information regarding the workshop, visit https://www.nationalacademies.org/event/06-27-2022/testing- evaluating-and-assessing-artificial-intelligence-enabled-systems-under-operational-conditions-for-the-department-of- the-air-force-workshop. Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2023. Testing, Evaluating, and Assessing Artificial IntelligenceâEnabled Systems Under Operational Conditions for the Department of the Air Force: Proceedings of a Workshopâin Brief. Washington, DC: The National Academies Press, https://doi.org/10.17226/26885. Division on Engineering and Physical Sciences Copyright 2023 by the National Academy of Sciences. All rights reserved.