Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Proceedings of a Workshopâin Brief Artificial Intelligence and Justified Confidence Proceedings of a Workshopâin Brief WORKSHOP OVERVIEW participants; the planning committee; or the National On September 28-30, 2022, the National Academies Academies. of Sciences, Engineering, and Medicineâs Board on Army Research and Development (BOARD) convened a Jennie Hwang, H-Technologies Group, planning workshop focused on Artificial Intelligence and Justified committee co-chair, commenced the workshop by Confidence in the Army that was structured to address examining the terminology of âjustified confidenceâ the three framing questions of the statement of task: in the context of AI. Although âconfidenceâ carries connotations of an intangible feeling, Dr. Hwang 1. Examples of how industry and other branches of the asserted that justified confidence in AI requires a military have successfully integrated ML/AI [machine fusion of six fundamental components: software, learning/artificial intelligence] tools into a C2 hardware, data, computing, communication, and human [command and control] architecture, particularly in integration. Furthermore, she stressed the significance of an MDO [Multi-Domain Operations] environment. acknowledging and assessing uncertainty, as well as the 2. How does the Army define success? How does it anticipation of future developments in the field. measure progress in these areas? What gaps exist in the Army achieving success? The current state of AI involves a fierce global competition, stated Dr. Hwang. It is a global race 3. What obstacles exist to achieving success and how between two countriesâthe United States and Chinaâ might the Army overcome them? in which the concept of âwinningâ is relative to the The workshop was organized and attended by the adversariesâ capabilities at a particular point in time. planning committee. This Proceedings of a Workshopâ Dr. Hwang noted that the Army is a key part of this in Brief is a factual summary of the presentations and competition and has recently developed competitive AI ensuing discussions. The statements made are those of strategies as part of the Third Offset Strategy.1 Dr. Hwang the rapporteur or individual workshop participants and 1 U.S. Department of Defense, 2014, âReagan National Defense Forum Keynote,â November 15, https://www.defense.gov/News/Speeches/ do not necessarily represent the views of all workshop Speech/Article/606635. MARCH 2023 | 1
characterized the Armyâs current AI efforts as technology Third, on a holdout set (a labeled data set with the same driven and warfighter focused. Overall, she assessed that distribution as the model), label the nonconformity set at AI will play a critical role in the Armyâs new operating each point and identify a threshold value on the holdout concept to be prepared to fight anytime, anywhere, set such that a specified percentage (e.g., 95 percent) of and achieve overmatch. AI, she noted, is integral to the nonconformity scores fall below the value. After these overmatch goal of âavoiding a fair fight.â steps, it is possible to compute the nonconformity score for any candidate label in a new set with unknown Dr. Hwang delineated several overarching goals for the labels. The promise of conformal prediction is that there workshop: identify methods to improve the robustness is a marginal guarantee (a probability statement that of AI and ML tools in C2, as well as ways to foster soldier averages over the randomness of examples) (e.g., a 95 trust in the technology; study AI/ML vulnerabilities and percent chance) that a prediction interval will contain the limitations; and examine opportunities for materiel and label on a new example. non-materiel solutions to AI challenges. Conformal prediction has shortcomings, including ROBUST AND EQUITABLE UNCERTAINTY ESTIMATION its marginal guarantees and assumptions about Aaron Roth, University of Pennsylvania, noted that distributions, argued Dr. Roth. Marginal guarantees are while there currently exist many successful black box averages over all data pointsâthat is, âfor 95 percent of methods for making predictions, they are imperfect, and people on which we make predictions, our prediction set it can thus be desirable to predict ahead of time where contains their true label.â The issue is that the specific these methods are likely to make mistakes. One way to data point or subgroup may fall outside the confidence achieve this, stated Dr. Roth, is by creating prediction interval. For instance, a demographic group comprising sets. Prediction sets are sets of labels in which it is likely less than 5 percent of a population might have zero that a true label falls, and these are useful when exact percent coverage under the model. One potential way to point prediction is not possible. For example, given three mitigate this, noted Dr. Roth, is by separately calibrating grainy images of small rodents, it may not be clear if for each group. Dr. Roth pointed out, however, that they are squirrels, weasels, or muskratsâbut it can be groups of interest often overlap. The goal, he asserted, is confidently stated that they are not trucks. In addition to give meaningful statements about data points that are to providing a reasonable range of answers, prediction in multiple relevant groups. Furthermore, for conformal sets predict uncertainty in two ways. First, the size of the prediction to work, new data must be drawn from the prediction set itself quantifies a degree of uncertainty, same distribution as past dataâposing a problem for and second, it indicates the location of uncertainty. unanticipated distribution shifts in new data. Overall, the goal is that the prediction set contains the true label within a selected probability (e.g., 95 percent). Dr. Roth stated that prediction set multivalidity is one way to create stronger than marginal guarantees. Dr. Roth characterized conformal prediction as a simple, Prediction set mulitvalidity involves dividing the data elegant method to affix prediction sets to black box into different groups that might intersectâin which models. He stated that conformal prediction serves as an a particular data point can be in multiple groups add-on to existing point-prediction models. Conformal simultaneously. For any prediction, the goal is to prediction takes several steps. First, start with an have the true label in the prediction set 95 percent arbitrary model that makes point predictions. Second, of the timeânot merely overall, but conditional on pick a nonconformity score. The nonconformity score membership in any pre-specified set of groups. evaluates a feature vector at a potential label. Large values of the nonconformity score indicate that a label is Dr. Roth presented an algorithm that can parameterize very different from what the model predicts, while small with an arbitrary collection of intersecting groups. values demonstrate similarity to the modelâs predictions. The algorithm takes, as input, any sequence of models MARCH 2023 | 2
for making point predictions, trains on historic data, Dr. Chung presented two DARPA projects that and does not require a holdout set. No matter the placed robots in complex, real-world environments: sequence of examples, for any predicted threshold in Subterranean Challenge (Sub-T) and Offensive Swarm any subset of groups, the difference between empirical Enabled Tactics (OFFSET) program. Sub-T involved coverage (how frequently the model covers the label) teams of robots conducting an underground scavenger and target coverage (e.g., 95 percent) will tend to zero hunt, with the aim to discover robotic technologies to at the statistically optimal rate. Unlike split conformal enable actionable situational awareness. Robots dealt prediction, which cannot train on the holdout set, this with dynamic terrain, austere navigation, degraded model can train on 100 percent of the data, enabling sensing, severe communications, endurance limits, and faster learning. This model provides correct coverage in terrain obstacles. DARPA binned tools for addressing individual and intersecting groups within a data set and these challenges into four technology impact areas: can tolerate unanticipated distribution shifts, resulting autonomy, perception, networking, and mobilityâwith in more informative (narrower) prediction intervals. AI playing a role in each, Dr. Chung said. Aaron Luttman, Pacific Northwest National Laboratory, planning committee member, questioned whether the Dr. Chung highlighted several insights from the modelâs tighter prediction intervals made a substantive Sub-T program. First, the regular attrition of robots difference in real-world applications. Dr. Roth insisted emphasized the importance of resilience. Dr. Chung that the modelâs enhanced predictions are not merely of stated that attrition prompted DARPA to consider academic interest because there exist real-life examples strategies at the concept of operations (CONOPS) level where the tighter coverage enabled by increased attention for measuring faith in each element (in this case, each to subgroups leads to different decisions. A more explicit robot) within a system. Second, Sub-T demonstrated explanation of the model and its results can be found in that data gathering and situational awareness are not several papers.2 synonymous: robots can explore, gather data, and generate maps without extracting any useful information. ACCELERATING AUTONOMY FOR REAL-WORLD ROBOTICS IN Third, Dr. Chung emphasized the growing importance of COMPLEX ENVIRONMENTS systems integration. Nearly all teams had high-quality Timothy Chung, Microsoft Corporation, discussed component technologies, but superior systems integration his previous work at the Defense Advanced Research distinguished the top-performing teams in Sub-T. Projects Agency (DARPA) leading programs to accelerate autonomy for real-world robotics in conflict DARPAâs OFFSET program, Dr. Chug explained, environments. Dr. Chung noted that while robots sought insights into humanâmachine teaming as currently operate successfully in isolated, designated safe well as the autonomy necessary to support an urban zones (which give developers more control), developers infantry mission. In the program, teams developed are still learning to operate robots in congested swarm systems architectures focused on higher-level environments with dynamic objects, more clutter, representation of collaborative autonomy tasks (swarm and hard physical limits. Looking ahead, particularly tactics), resulting in simple designations of high-level to applications of interest to the Army, robots will swarm behavior. For example, commanders could execute missions in contested environments featuring scribble a circle on their tablet to request an overhead deliberately adversarial agents, challenging effects, and drone scan that would identify air and ground robots high levels of uncertainty. with the appropriate sensor configuration. This reduced the cognitive burden on swarm commanders. OFFSET 2 V. Gupta, C. Jung, G. Noarov, M.M. Pai, and A. Roth, 2021, âOnline also created a library of collaborative autonomy software, Multivalid Learning: Means, Moments, and Prediction Intervals,âÂ arXiv preprint, arXiv:2101.01739; O. Bastani, V. Gupta, C. Jung, G. Noarov, virtual swarm environments, and unique swarm data R. Ramalingam, and A. Roth, 2022, âPractical Adversarial Multivalid sets. Dr. Chungâs overarching takeaways are summarized Conformal Prediction,âÂ arXiv preprint, arXiv:2206.01067; C. Jung, G. Noarov, R. Ramalingam, and A. Roth, 2022, âBatch Multivalid Conformal in Box 1. Prediction,âÂ arXiv preprint, arXiv:2209.15145. MARCH 2023 | 3
BOX 1 Accelerating Autonomy for Real-World Robotics in Complex Environments Overarching Takeaways Discussed â¢ Test for the capabilities you want, not the technology you currently possess. â¢ Confidence rests within the system, not with the narrow AI application. â¢ Autonomy can help fill gaps in systems capabilities today. Autonomous behaviors operating within the constraints of current technology are significant augmenters. â¢ Industry remains focused on the Department of Defense (DoD) AI mission set, but would benefit from further guidance. DoD could offer up use cases, relevant data, and test environments to private industry. PROMISE AND LIMITATIONS OF ARTIFICIAL INTELLIGENCE AND limited. Dr. Madni advocated for exploiting AI/ML in MACHINE LEARNING: IMPLICATIONS FOR COMMAND AND nominal situations, using humans to aid AI/ML in novel CONTROL OPERATIONS situations, and using AI/ML to aid humans in memory Azad Madni, University of Southern California, advocated recall and computation-intensive tasks. for augmented intelligence as a solution to address challenges in C2. Dr. Madni noted that while AI/ML According to Dr. Madni, AI is most useful for reducing holds significant potential to serve as a force multiplier and eliminating stressful and repetitive tasks, integrating in C2, the operational context is very different from the large quantities of data, detecting and responding to controlled laboratory environment within which most AI/ situations that are too fast for humans, and identifying ML applications operate. infrequently occurring events and conditions. He delineated several high payoff AI/ML applications for C2 Dr. Madni delineated several concerns with AI/ML (see Box 2). applications that are particularly salient in operational contexts. AI/ML applications deal with novel situations Dr. Madni also highlighted two considerations for poorly, engage in abnormal system behavior when humanâmachine collaboration. He observed that confronted with outliers, struggle to adapt to changing there are often trade-offs between machine optimality contexts, are ethically and legally unaware, lack casual (âcreating the perfect algorithmâ) and humanâmachine reasoning capabilities (currently), and do not possess optimality. Many supposedly âoptimalâ algorithms are human imagination and creativity. not amenable to incorporating the input of a human in the loop. Dr. Madni stressed the significance of common Augmented intelligence, argued Dr. Madni, has the frameworks in response to a comment by Conrad Tucker, potential to capitalize on the strengths of both humans Carnegie Mellon University, planning committee co- and AI while overcoming their respective limitations. chair, on the challenge of ensuring interoperability given While AI offers fast computation, infallible recall, that many algorithms require fixed data inputs. Dr. fast search, and pattern recognition, it struggles to Madni pointed to the creation of a shared ontology as an contextualize information and process outliers, and it important step in ensuring that the AI/ML community lacks causal and common sense reasoning. Humans can is making a common set of underlying physical and contextualize, generate creative options, and deal with semantic assumptions across all models. outliers and ambiguity, yet they are prone to distraction and fatigue, and their recall and cognitive capacities are MARCH 2023 | 4
BOX 2 High-Payoff Artificial Intelligence/Machine Learning Applications for Command and Control Azad Madni outlined the following high-payoff AI/ML applications for C2: â¢ Fusing data from multiple sensors to understand complicated tactical situations. â¢ Simulating multiple courses of action and quickly evaluating results. â¢ Generating a variety of options for commanders and presenting trade-off analyses. â¢ Providing dynamic context management. â¢ Applying multiple metrics to assess combat readiness in particularized contexts. â¢ Using image processing to detect and identify weapons, motion, scene changes, etc. IMPACT OF WORLD STATE AWARENESS ON JOINT HUMANâ mental models are an integral part of HAI. Each agent AUTOMATION DECISION MAKING (human or autonomous) possesses a unique mental Karen Feigh, Georgia Institute of Technology, presented model of its own capabilities and role on the team. The her findings on the impact of world state awareness in shared mental model is the overlapping space in which joint humanâautomation decision making. Dr. Feigh agents understand each otherâs roles, capabilities, and delineated the current conception of AI use, where the informational constraints. From these considerations, the AI queries some set of sensors, generates a suggestion, study focused on creating a shared situational awareness presents the suggestion to the human for evaluation, and by improving the humanâs understanding of the world then the human either approves (AI executes) or vetoes state awareness (WSA) on which the automation based its (AI iterates again). Dr. Feigh stated that the humanâs suggestions. role in this process is often difficult and sometimes even impossible. While much humanâautomation interaction The study augmented the common conception of AI (HAI) research focuses on supporting the human development by incorporating steps to measure the through improvements in AI suggestion evaluation and degree of shared situational awareness between the explainability, Dr. Feigh noted that there are ways to aid human and AI, to measure shared assessments of the human even if the AI and its suggestion-evaluation suggestions, and shared assessments of final decisions. mechanisms are unaltered. The study found that increasing WSA improved overall task performance and was statistically significant in Dr. Feigh presented a study that examined ways to predicting shared situational awareness, final agreement, introduce transparency into black box AI deployments the humanâs initial judgment, and the humanâs final to improve humansâ collaborative performance with an decision. The results also demonstrated that as WSA AI teammate.3 Two considerations grounded the study. increases, humans are less trusting of AI capabilities and First, decision making is merely one phase of a cognitive better able to discern when the AI is mistaken. process cycle in which all phases are interdependent. Thus, expecting a human to approve an isolated BUILDING FOUNDATIONS FOR TRUST IN ARTIFICIAL decision often results in poor results. Second, shared INTELLIGENCE PRODUCTS 3 D.K. Srivastava, J.M. Lilly, and K.M. Feigh, âImproving Human Heather Frase, Center for Security and Emerging Situation Awareness in AI-Advised Decision Making,â paper presented Technology, discussed support systems for trust in at 2022 IEEE 3rd International Conference on Human-Machine Systems, https://2022.hci.international/ai-hci. AI. Dr. Frase asserted that trust involves multiple MARCH 2023 | 5
support systems working together, including communal system that accounts for device risk and complexity. The resources, trusted companies, trusted products, and FDA system prioritizes breakthrough systems, ensures trusted users. According to Dr. Frase, the current support post-approval monitoring and knowledge building, system for AI trust contains gaps in each of these areas. and uses tiered ranking to assess risk and incremental It has a minimal recognition of different types of abuse change systematically. Dr. Frase suggested that DoD and no comprehensive understanding of behavior in should adopt a similar process for AI C2 products. Dr. operational conditions. AI companies display inconsistent Hwang gestured to the efforts of the National Institute adoption of best practices, stated Dr. Frase. There is no of Standards and Technology (NIST) as positive progress standard method to assess trustworthiness across AI toward this goal. Dr. Frase asserted that NIST has products. Furthermore, there is no mechanism to identify made consistent improvements in its risk management and restrict malicious users. framework and is increasingly emphasizing processes implemented by companies to achieve trusted AI. AI products are particularly difficult to trust, argued Dr. Frase, because they lack the historic understanding HOW DO ORGANIZATIONS ACCELERATE MACHINE LEARNING that comes with steady, incremental progress. While INTEGRATION? existing test design science is capable of handling Benjamin Harvey, AI Squared and Johns Hopkins complex systems with large numbers of variables, it University, discussed ways that organizations can relies on historic knowledge of systems for efficient accelerate ML integration. Dr. Harvey approached the testing. Radars, for example, have undergone 80 years issue from a background in the Intelligence Community of incremental changes, performance testing, and (IC) as well as private industry. While at the National operator experience. AI, by contrast, has minimal historic Security Agency (NSA), Dr. Harvey oversaw the understanding. integration of AI into mission production applications as the chief of operations for data science. He recalled Dr. Frase argued that it is possible to accelerate the that data scientists at NSA were frustrated because while trust-building process for AI by creating the appropriate they were achieving excellent results in a controlled infrastructure to accumulate and share historic experimental setting, the AI/ML capabilities were not knowledge. Dr. Frase asserted that infrastructure getting to the end users (analysts and warfighters). is critical for a number of reasons. It ensures that information about AI performance and testing is Dr. Harvey stated that while investments in AI are discoverable and available, shares knowledge across massive, two out of three AI projects fail, primarily programs, stores and sanitizes information for use for three reasons. First, deploying and integrating ML across multiple classifications, and stores and monitors models is difficult, he asserted. At NSA, for example, post-production data. To meet the needs of AI for C2 in integrating a single model required coordinating across particular, Dr. Frase recommended the following steps: ML engineers, data scientists, development operations, identify related internal AI programs; identify similar front-end developers, application managers, and so on. joint AI programs; share, gather, and store information; Dr. Harvey noted that it entailed aggravatingly large and identify methods to store and leverage post- amounts of time and money. Second, it is a challenge to deployment AI monitoring, performance, and behavior build ML applications that teams actually want to adopt. data. Dr. Harvey recalled the following assessment of an IC analyst: If the results of the model are not actionable, Dr. Frase also stressed the importance of instituting a relevant, timely, and contextualized, analysts will not tiered and triaged classification process for testing and use the model. Dr. Harvey stressed the importance of demonstration of AI products, similar to the Food and building applications that effectively communicate Drug Administrationâs (FDAâs) classification of medical to end users. Third, Dr. Harvey averred that most devices. FDA regulates medical devices via a tiered organizations focus on the front end of the ML pipeline March 2023 | 6
(data preparation and labeling, building sophisticated moving faster requires greater pre-existing confidence models) and neglect the crucial âlast mileâ of integration and trust on the part of the users. Dr. Harvey posited and optimization. that observational studies facilitate trust, because end users become part of the development process and gain The last mile of ML, stated Dr. Harvey, presents several observational and experimental experience with ML significant challenges. Most organizations, including models. The core lesson, Dr. Harvey stated, was that the DoD, seek to integrate AI/ML into legacy applications. data and model do not need to be perfect prior to putting Often, there is minimal access to the code base of such the application in the hands of end users. older systems, making integration a challenge and adding months to the process. Additionally, Dr. Harvey Dr. Harvey also observed that the IC often favors a top- argued that unsatisfactory results dissuade end users down approach to requirements, rather than the bottom- from trusting or using ML applications. Siloed teams up approach of observational studies. He stated that the and long timelines are often the culprits behind failures top-down approach slows the pace of innovation so that to quickly acquire feedback and iterate on the model to by the time applications reach end users in the IC, they better address end user needs, stipulated Dr. Harvey. are often no longer relevant. He recommended that the IC and DoD employ agile ML development operations to To overcome obstacles in integration and optimization assess what is of value to the end userâa direction in (the last mile), Dr. Harvey recommended that which industry is already moving. organizations take the following four steps: EVALUATING, INTERPRETING, AND MONITORING MACHINE 1. Align multiple data sources and models to integrate LEARNING MODELS collectively into legacy applications. Ankur Taly, Google, Inc., presented on evaluating, interpreting, and modeling ML models, most of which 2. Embed ML results directly into web applications are black boxes to humans. Dr. Taly noted that the to put ML in the hands of users, eschewing the current procedures for evaluating ML models entail back traditional development of trying to perfect the learning. Dr. Taly pointed out several shortcomings, model prior to deployment. however, such as variations in test accuracy and test sets 3. Create the governance tools necessary to customize that are not representative of deployment. The goal, Dr. how their applications display ML results. Taly argued, is to have the capability to drill down to individual slices of the data and to assess whether the 4. Continuously acquire collaborative feedback on ML test data are representative of what the user would see in model performance. deployment. During the discussion, Dr. Harvey remarked on accelerating integration through observational studies Dr. Taly characterized several existing approaches and on the differences between the integration to interpreting model predictions as ânaÃ¯ve.â In approaches of industry and the IC. Dr. Harvey touted interpreting model predictions, the aim is to understand observational studies as a method to accelerate why the model made a prediction and to be able to integration of ML models into user workflows. In attribute a modelâs prediction to features of the input. observational studies, data scientists do model training Unsatisfactory approaches, in Dr. Talyâs estimation, and quickly get the model into the hands of a select include ablation and feature gradients. Ablations, which number of users (a far different approach than the usual drop individual features and assess how predictions method of endlessly fine-tuning the model prior to change, are computationally expensive, require deployment). Data scientists observe for a few weeks as unrealistic inputs, and are misleading when features the select users test the model, then quickly reiterate and interact, he argued. Dr. Taly assessed simple feature deploy at a larger scale. Dr. Luttman commented that gradients to be insufficient as well. As an alternative to March 2023 | 7
either of these, Dr. Taly noted with approval the method distribution of feature attribution between a serving of âintegrated gradients.â This method integrates window and a reference window), offers several benefits, image gradients along a straight pathâfrom baseline according to Dr. Taly. He stated that it is importance to inputsâwith the goal of providing a post hoc weighted, applies to all feature representations, accounts justification for a model (baselines are information- for feature interactions, is applicable to many feature less inputs for the mode, such as black images for groups, and stabilizes the monitoring of features across image models or empty text for text models). Dr. Taly models. In his concluding remarks and in response to touted integrated gradients as an easy to apply, widely several queries during the discussion, Dr. Taly offered applicable, and agreed-upon technique for attributing a some of his main takeaways (see Box 3). deep networkâs prediction to its inputs feature. BUILDING WELL-CALIBRATED TRUST IN ARTIFICIAL INTELLIGENCE Continuous monitoring of models is significant, affirmed SYSTEMS Dr. Taly, because prediction data may differ significantly Paul Scharre, Center for a New American Security, from test data and because the distribution of features articulated his vision of well-calibrated trust in AI and labels may drift over the course of production (due systems. In well-calibrated trust, designers and testers to outliers, bugs, etc.). He noted two ways to monitor understand the actual capabilities and limitations of ML modelsâmonitoring of feature distribution and a system, its performance boundaries, and potential monitoring of prediction distribution. Dr. Taly assessed points of failure. They modify the system to ensure these approaches to have at least three problems. that unavoidable failures are not hazardous for the First, they have difficulties dealing with multiple- user. Operators undergo appropriate levels of training feature representations. Second, large feature drifts to acquire a full understanding of the systemâs do not necessarily result in substantive performance abilities and limitations. They absorb its performance changes. Third, it is difficult to assess drift that occurs information through an appropriately communicative in correlation between features. Dr. Taly noted that humanâmachine interface. Operators take responsibility an alternative is attribution-based monitoring, which for the system and employ it in appropriate settings. monitors trends of feature attribution scores for every Policymakers and commanders understand the feature. Assessing feature-attribution drift (comparing capabilities and risks of systems that they authorize for employment. Finally, Dr. Scharre explained, defense BOX 3 Evaluating, Interpreting, and Monitoring Machine Learning Models Key Takeaways Discussed â¢ Test accuracy alone can be a misleading indication of a modelâs performance. â¢ Probe the modelâs reasoning and individual predictions, not just the aggregate results. â¢ Monitor models while they are in production. â¢ Utilize training data attributions, which identify which examples in the training data help or hurt the predictive capability of the model. â¢ Explainability is about counterfactuals. Undo a feature of an input and signify âif not for that feature, then the result would have been this prediction.â Each explanation is about a counterfactual, which also makes it context dependent. March 2023 | 8
institutions would ensure that those interacting with immature technology. Dr. Scharre suggested that DoD the system throughout its design, development, testing, adjust its existing (and admittedly successful) processes training, and employment phases understand its for building trustworthy systems to account for the capabilities and limitations, and they foster a culture of unique challenges of ML. This will take new tools, responsibility. infrastructure, processes, and greater investments of resources. For example, Dr. Scharre recommended Dr. Scharre asserted that DoD maintains effective that DoD implement the necessary processes and procedures for creating well-calibrated trust during the authorities to retrain algorithms continuously through development pipeline, which consists of five phases: field deployment (real-life data). He also encouraged development, testing, operator training, authorization, greater flexibility from DoD in incorporating industry and use. Furthermore, DoD possesses significant best practices. Overall, Dr. Scharre concluded that well- experience in the deployment of complex systems in calibrated trust is a worthy goal for DoD, which ensures high-risk applications, noted Dr. Scharre. But ML, he that those interacting with AI systems are cognizant of argued, presents a new set of challenges. In addition their limits and understand appropriate settings in which to the challenge of integrating ML into broader digital to deploy them. systems, the unique ML pipeline (consisting of data, model training, testing and refinement, use, and model PLANNING COMMITTEE DISCUSSION refinement) exposes new vulnerabilities at each phase. During closing discussion, planning committee members returned to a few general topics from the first day of the Dr. Scharre provided the following takeaways on the ML workshop: the technological maturity of AI, the special pipeline: challenges of AI in C2, the Armyâs efforts to accelerate the AI pipeline, and the Armyâs integration and testing of â¢ Data: Securing training data is paramount because AI. adversarial access could expose the model to poisoning or exploitation. Dr. Hwang assessed that the current state of the art is far from full autonomy: the continued necessity for â¢ Model training: Models should have appropriate a human in the loop was a crosscutting theme of the goals and sufficient compute power to achieve them. presentations. She stipulated that the human would â¢ Testing and refinement: Emergent behaviors are remain a factor for C2 applications in particular. Marvin common, and models are vulnerable to manipulation. Langston, Langston Associates LLC, planning committee Thus, it is essential to red-team prior to deployment member, stated that the presentations reflected the and fine-tune post-deployment. reality that while AI can be highly successful at explicit â¢ Use: Minute changes in AI goals can result in and specific tasks, it is not yet capable of presenting drastically altered behavior by the model. A model courses of action in the dynamic and unpredictable can also possess features that upset or aggravate environment of C2. In fact, he asserted, C2 may be users, even if they do not inhibit its ability to the most complex application of AI. Emphasizing this accomplish the mission. point, Catherine Crawford, IBM Corporation, planning committee member, noted that deployed AI will confront â¢ Model refinement: AI systems do not react well to not merely a challenging operating environment, but the novelty and will experience performance degradation efforts of an AI-capable adversary. AI, still in its nascent in operational environments. stages, can leverage insights from HAI and natural Dr. Scharre concluded with remarks on several language processing, noted Dr. Langston and Erin Chiou, crosscutting themes from his presentation. Despite Arizona State University, planning committee members. its enormous promise, he noted that ML remains an March 2023 | 9
Basic questions about AI in C2 remain unanswered, Several planning committee members discussed noted Dr. Tucker. For example, he wondered whether possibilities to accelerate the Armyâs integration and point estimation or distribution (two methods discussed testing of AI systems. Dr. Crawford noted that there during the presentations) was more relevant to the is currently a âchasmâ between development and C2 perspective. Dr. Luttman asserted that although in deployment. To bridge this gap, Dr. Hwang suggested practical applications, distributions do not often come that the Army tailor its training and use cases to specific into play, having a distribution can be a useful tool environments, similar to the environmental focus of for an operator to use in explaining why they took a DARPAâs Sub-T and OFFSET programs. Environmental specific course of action. Dr. Crawford posited that for factors inevitably influence the complexity of tasks that some decision makers, the variability presented in a AI is required to accomplish, she noted. Dr. Crawford distribution might be unnecessary, unhelpful, and even pointed out possible improvements through greater unsettling. Thus, Dr. Crawford stated, it is imperative interface between DoD and industry. DoD can provide that models cognitively align with the humans that use industry with more explicit descriptions of actual use them. cases for deployed AI, while industry can accelerate the DoD space by advising on the state of the art. Several planning committee members voiced concerns Furthermore, Dr. Luttman highlighted the presentation with the existing AI pipeline. Dr. Tucker stated that of Dr. Harvey to illustrate the potential of field DoD has trouble attracting talented students, who experimentation. Dr. Luttman suggested that the Army are reluctant to work on the governmentâs longer would benefit from integrating AI with end users as early timeframes. The faster tempo of industry appeals to as possible and by constantly iterating its models in light those looking to accelerate their careers. Dr. Luttman of real-world experience. suggested that some DoD researchers at the National Laboratories might find existing regulations burdensome. March 2023 | 10
DISCLAIMER This Proceedings of a Workshopâin Brief was prepared by CLEMENT MULOCK as a factual summary of what occurred at the workshop. The statements made are those of the rapporteur or individual workshop participants and do not necessarily represent the views of all workshop participants; the planning committee; or the National Academies of Sciences, Engineering, and Medicine. PLANNING COMMITTEE JENNIE S. HWANG (NAE) (Co-Chair), H-Technologies Group; CONRAD TUCKER (Co-Chair), Carnegie Mellon University; ERIN K. CHIOU, Arizona State University; CATHERINE H. CRAWFORD, IBM Corporation; MARVIN J. LANGSTON, Langston Assoc. LLC; NANDI O. LESLIE, Raytheon Technologies; AARON B. LUTTMAN, Pacific Northwest National Laboratory; JOHN M. MATSUMURA, RAND Corporation; TODD D. MURPHEY, Northwestern University. STAFF WILLIAM âBRUNOâ MILLONIG, Director/Scholar, BOARD; NIA JOHNSON, Program Officer, Intelligence Community Studies Board; THO NGUYEN, Senior Program Officer, Computer Science and Telecommunications Board; CAMERON MALCOM, Research Associate, BOARD; TINA M. LATIMER, Program Coordinator, BOARD; TRAVON C. JAMES, Senior Program Assistant, BOARD; CLEMENT (MAC) MULOCK, Program Assistant, BOARD. REVIEWERS To ensure that it meets institutional standards for quality and objectivity, this Proceedings of a Workshopâin Brief was reviewed by JENNIE S. HWANG (NAE), H-Technologies Group; RYAN MURPHY, National Academies of Sciences, Engineering, and Medicine; and CONRAD TUCKER, Carnegie Mellon University. JAYDA WADE, National Academies of Sciences, Engineering, and Medicine, served as the review coordinator. SPONSOR This workshop was sponsored by the Deputy Assistant Secretary of the Army. For additional information regarding the workshop, visit https://www.nationalacademies.org/our-work/artificial- intelligence-and-justified-confidence-a-workshop.. Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshopâin Brief. Washington, DC: The National Academies Press. https://doi.org/10.17226/26887. Division on Engineering and Physical Sciences Copyright 2023 by the National Academy of Sciences. All rights reserved.