Page 1 Cite

Suggested Citation:"Artificial Intelligence and Justified Confidence: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26887.

×

Proceedings of a Workshop—in Brief

Artificial Intelligence and Justified Confidence

Proceedings of a Workshop—in Brief

WORKSHOP OVERVIEW

On September 28-30, 2022, the National Academies of Sciences, Engineering, and Medicine’s Board on Army Research and Development (BOARD) convened a workshop focused on Artificial Intelligence and Justified Confidence in the Army that was structured to address the three framing questions of the statement of task:

Examples of how industry and other branches of the military have successfully integrated ML/AI [machine learning/artificial intelligence] tools into a C2 [command and control] architecture, particularly in an MDO [Multi-Domain Operations] environment.
How does the Army define success? How does it measure progress in these areas? What gaps exist in the Army achieving success?
What obstacles exist to achieving success and how might the Army overcome them?

The workshop was organized and attended by the planning committee. This Proceedings of a Workshop—in Brief is a factual summary of the presentations and ensuing discussions. The statements made are those of the rapporteur or individual workshop participants and do not necessarily represent the views of all workshop participants; the planning committee; or the National Academies.

Jennie Hwang, H-Technologies Group, planning committee co-chair, commenced the workshop by examining the terminology of “justified confidence” in the context of AI. Although “confidence” carries connotations of an intangible feeling, Dr. Hwang asserted that justified confidence in AI requires a fusion of six fundamental components: software, hardware, data, computing, communication, and human integration. Furthermore, she stressed the significance of acknowledging and assessing uncertainty, as well as the anticipation of future developments in the field.

The current state of AI involves a fierce global competition, stated Dr. Hwang. It is a global race between two countries—the United States and China—in which the concept of “winning” is relative to the adversaries’ capabilities at a particular point in time. Dr. Hwang noted that the Army is a key part of this competition and has recently developed competitive AI strategies as part of the Third Offset Strategy.¹ Dr. Hwang

__________________

¹ U.S. Department of Defense, 2014, “Reagan National Defense Forum Keynote,” November 15, https://www.defense.gov/News/Speeches/Speech/Article/606635.

Page 2 Cite

Suggested Citation:"Artificial Intelligence and Justified Confidence: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26887.

×

characterized the Army’s current AI efforts as technology driven and warfighter focused. Overall, she assessed that AI will play a critical role in the Army’s new operating concept to be prepared to fight anytime, anywhere, and achieve overmatch. AI, she noted, is integral to the overmatch goal of “avoiding a fair fight.”

Dr. Hwang delineated several overarching goals for the workshop: identify methods to improve the robustness of AI and ML tools in C2, as well as ways to foster soldier trust in the technology; study AI/ML vulnerabilities and limitations; and examine opportunities for materiel and non-materiel solutions to AI challenges.

ROBUST AND EQUITABLE UNCERTAINTY ESTIMATION

Aaron Roth, University of Pennsylvania, noted that while there currently exist many successful black box methods for making predictions, they are imperfect, and it can thus be desirable to predict ahead of time where these methods are likely to make mistakes. One way to achieve this, stated Dr. Roth, is by creating prediction sets. Prediction sets are sets of labels in which it is likely that a true label falls, and these are useful when exact point prediction is not possible. For example, given three grainy images of small rodents, it may not be clear if they are squirrels, weasels, or muskrats—but it can be confidently stated that they are not trucks. In addition to providing a reasonable range of answers, prediction sets predict uncertainty in two ways. First, the size of the prediction set itself quantifies a degree of uncertainty, and second, it indicates the location of uncertainty. Overall, the goal is that the prediction set contains the true label within a selected probability (e.g., 95 percent).

Dr. Roth characterized conformal prediction as a simple, elegant method to affix prediction sets to black box models. He stated that conformal prediction serves as an add-on to existing point-prediction models. Conformal prediction takes several steps. First, start with an arbitrary model that makes point predictions. Second, pick a nonconformity score. The nonconformity score evaluates a feature vector at a potential label. Large values of the nonconformity score indicate that a label is very different from what the model predicts, while small values demonstrate similarity to the model’s predictions.

Third, on a holdout set (a labeled data set with the same distribution as the model), label the nonconformity set at each point and identify a threshold value on the holdout set such that a specified percentage (e.g., 95 percent) of nonconformity scores fall below the value. After these steps, it is possible to compute the nonconformity score for any candidate label in a new set with unknown labels. The promise of conformal prediction is that there is a marginal guarantee (a probability statement that averages over the randomness of examples) (e.g., a 95 percent chance) that a prediction interval will contain the label on a new example.

Conformal prediction has shortcomings, including its marginal guarantees and assumptions about distributions, argued Dr. Roth. Marginal guarantees are averages over all data points—that is, “for 95 percent of people on which we make predictions, our prediction set contains their true label.” The issue is that the specific data point or subgroup may fall outside the confidence interval. For instance, a demographic group comprising less than 5 percent of a population might have zero percent coverage under the model. One potential way to mitigate this, noted Dr. Roth, is by separately calibrating for each group. Dr. Roth pointed out, however, that groups of interest often overlap. The goal, he asserted, is to give meaningful statements about data points that are in multiple relevant groups. Furthermore, for conformal prediction to work, new data must be drawn from the same distribution as past data—posing a problem for unanticipated distribution shifts in new data.

Dr. Roth stated that prediction set multivalidity is one way to create stronger than marginal guarantees. Prediction set mulitvalidity involves dividing the data into different groups that might intersect—in which a particular data point can be in multiple groups simultaneously. For any prediction, the goal is to have the true label in the prediction set 95 percent of the time—not merely overall, but conditional on membership in any pre-specified set of groups.

Dr. Roth presented an algorithm that can parameterize with an arbitrary collection of intersecting groups. The algorithm takes, as input, any sequence of models

Page 3 Cite

Suggested Citation:"Artificial Intelligence and Justified Confidence: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26887.

×

for making point predictions, trains on historic data, and does not require a holdout set. No matter the sequence of examples, for any predicted threshold in any subset of groups, the difference between empirical coverage (how frequently the model covers the label) and target coverage (e.g., 95 percent) will tend to zero at the statistically optimal rate. Unlike split conformal prediction, which cannot train on the holdout set, this model can train on 100 percent of the data, enabling faster learning. This model provides correct coverage in individual and intersecting groups within a data set and can tolerate unanticipated distribution shifts, resulting in more informative (narrower) prediction intervals. Aaron Luttman, Pacific Northwest National Laboratory, planning committee member, questioned whether the model’s tighter prediction intervals made a substantive difference in real-world applications. Dr. Roth insisted that the model’s enhanced predictions are not merely of academic interest because there exist real-life examples where the tighter coverage enabled by increased attention to subgroups leads to different decisions. A more explicit explanation of the model and its results can be found in several papers.²

ACCELERATING AUTONOMY FOR REAL-WORLD ROBOTICS IN COMPLEX ENVIRONMENTS

Timothy Chung, Microsoft Corporation, discussed his previous work at the Defense Advanced Research Projects Agency (DARPA) leading programs to accelerate autonomy for real-world robotics in conflict environments. Dr. Chung noted that while robots currently operate successfully in isolated, designated safe zones (which give developers more control), developers are still learning to operate robots in congested environments with dynamic objects, more clutter, and hard physical limits. Looking ahead, particularly to applications of interest to the Army, robots will execute missions in contested environments featuring deliberately adversarial agents, challenging effects, and high levels of uncertainty.

Dr. Chung presented two DARPA projects that placed robots in complex, real-world environments: Subterranean Challenge (Sub-T) and Offensive Swarm Enabled Tactics (OFFSET) program. Sub-T involved teams of robots conducting an underground scavenger hunt, with the aim to discover robotic technologies to enable actionable situational awareness. Robots dealt with dynamic terrain, austere navigation, degraded sensing, severe communications, endurance limits, and terrain obstacles. DARPA binned tools for addressing these challenges into four technology impact areas: autonomy, perception, networking, and mobility—with AI playing a role in each, Dr. Chung said.

Dr. Chung highlighted several insights from the Sub-T program. First, the regular attrition of robots emphasized the importance of resilience. Dr. Chung stated that attrition prompted DARPA to consider strategies at the concept of operations (CONOPS) level for measuring faith in each element (in this case, each robot) within a system. Second, Sub-T demonstrated that data gathering and situational awareness are not synonymous: robots can explore, gather data, and generate maps without extracting any useful information. Third, Dr. Chung emphasized the growing importance of systems integration. Nearly all teams had high-quality component technologies, but superior systems integration distinguished the top-performing teams in Sub-T.

DARPA’s OFFSET program, Dr. Chug explained, sought insights into human–machine teaming as well as the autonomy necessary to support an urban infantry mission. In the program, teams developed swarm systems architectures focused on higher-level representation of collaborative autonomy tasks (swarm tactics), resulting in simple designations of high-level swarm behavior. For example, commanders could scribble a circle on their tablet to request an overhead drone scan that would identify air and ground robots with the appropriate sensor configuration. This reduced the cognitive burden on swarm commanders. OFFSET also created a library of collaborative autonomy software, virtual swarm environments, and unique swarm data sets. Dr. Chung’s overarching takeaways are summarized in Box 1.

__________________

² V. Gupta, C. Jung, G. Noarov, M.M. Pai, and A. Roth, 2021, “Online Multivalid Learning: Means, Moments, and Prediction Intervals,” arXiv preprint, arXiv:2101.01739; O. Bastani, V. Gupta, C. Jung, G. Noarov, R. Ramalingam, and A. Roth, 2022, “Practical Adversarial Multivalid Conformal Prediction,” arXiv preprint, arXiv:2206.01067; ^C. Jung, G. Noarov, R. Ramalingam, and A. Roth, 2022, “Batch Multivalid Conformal Prediction,” arXiv preprint, arXiv:2209.15145.

Page 4 Cite

Suggested Citation:"Artificial Intelligence and Justified Confidence: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26887.

×

PROMISE AND LIMITATIONS OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING: IMPLICATIONS FOR COMMAND AND CONTROL OPERATIONS

Azad Madni, University of Southern California, advocated for augmented intelligence as a solution to address challenges in C2. Dr. Madni noted that while AI/ML holds significant potential to serve as a force multiplier in C2, the operational context is very different from the controlled laboratory environment within which most AI/ML applications operate.

Dr. Madni delineated several concerns with AI/ML applications that are particularly salient in operational contexts. AI/ML applications deal with novel situations poorly, engage in abnormal system behavior when confronted with outliers, struggle to adapt to changing contexts, are ethically and legally unaware, lack casual reasoning capabilities (currently), and do not possess human imagination and creativity.

Augmented intelligence, argued Dr. Madni, has the potential to capitalize on the strengths of both humans and AI while overcoming their respective limitations. While AI offers fast computation, infallible recall, fast search, and pattern recognition, it struggles to contextualize information and process outliers, and it lacks causal and common sense reasoning. Humans can contextualize, generate creative options, and deal with outliers and ambiguity, yet they are prone to distraction and fatigue, and their recall and cognitive capacities are limited. Dr. Madni advocated for exploiting AI/ML in nominal situations, using humans to aid AI/ML in novel situations, and using AI/ML to aid humans in memory recall and computation-intensive tasks.

According to Dr. Madni, AI is most useful for reducing and eliminating stressful and repetitive tasks, integrating large quantities of data, detecting and responding to situations that are too fast for humans, and identifying infrequently occurring events and conditions. He delineated several high payoff AI/ML applications for C2 (see Box 2).

Dr. Madni also highlighted two considerations for human–machine collaboration. He observed that there are often trade-offs between machine optimality (“creating the perfect algorithm”) and human–machine optimality. Many supposedly “optimal” algorithms are not amenable to incorporating the input of a human in the loop. Dr. Madni stressed the significance of common frameworks in response to a comment by Conrad Tucker, Carnegie Mellon University, planning committee co-chair, on the challenge of ensuring interoperability given that many algorithms require fixed data inputs. Dr. Madni pointed to the creation of a shared ontology as an important step in ensuring that the AI/ML community is making a common set of underlying physical and semantic assumptions across all models.

Page 5 Cite

Suggested Citation:"Artificial Intelligence and Justified Confidence: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26887.

×

IMPACT OF WORLD STATE AWARENESS ON JOINT HUMAN–AUTOMATION DECISION MAKING

Karen Feigh, Georgia Institute of Technology, presented her findings on the impact of world state awareness in joint human–automation decision making. Dr. Feigh delineated the current conception of AI use, where the AI queries some set of sensors, generates a suggestion, presents the suggestion to the human for evaluation, and then the human either approves (AI executes) or vetoes (AI iterates again). Dr. Feigh stated that the human’s role in this process is often difficult and sometimes even impossible. While much human–automation interaction (HAI) research focuses on supporting the human through improvements in AI suggestion evaluation and explainability, Dr. Feigh noted that there are ways to aid the human even if the AI and its suggestion-evaluation mechanisms are unaltered.

Dr. Feigh presented a study that examined ways to introduce transparency into black box AI deployments to improve humans’ collaborative performance with an AI teammate.³ Two considerations grounded the study. First, decision making is merely one phase of a cognitive process cycle in which all phases are interdependent. Thus, expecting a human to approve an isolated decision often results in poor results. Second, shared mental models are an integral part of HAI. Each agent (human or autonomous) possesses a unique mental model of its own capabilities and role on the team. The shared mental model is the overlapping space in which agents understand each other’s roles, capabilities, and informational constraints. From these considerations, the study focused on creating a shared situational awareness by improving the human’s understanding of the world state awareness (WSA) on which the automation based its suggestions.

The study augmented the common conception of AI development by incorporating steps to measure the degree of shared situational awareness between the human and AI, to measure shared assessments of suggestions, and shared assessments of final decisions. The study found that increasing WSA improved overall task performance and was statistically significant in predicting shared situational awareness, final agreement, the human’s initial judgment, and the human’s final decision. The results also demonstrated that as WSA increases, humans are less trusting of AI capabilities and better able to discern when the AI is mistaken.

BUILDING FOUNDATIONS FOR TRUST IN ARTIFICIAL INTELLIGENCE PRODUCTS

Heather Frase, Center for Security and Emerging Technology, discussed support systems for trust in AI. Dr. Frase asserted that trust involves multiple

__________________

³ D.K. Srivastava, J.M. Lilly, and K.M. Feigh, “Improving Human Situation Awareness in AI-Advised Decision Making,” paper presented at 2022 IEEE 3rd International Conference on Human-Machine Systems, https://2022.hci.international/ai-hci.

Page 6 Cite

Suggested Citation:"Artificial Intelligence and Justified Confidence: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26887.

×

support systems working together, including communal resources, trusted companies, trusted products, and trusted users. According to Dr. Frase, the current support system for AI trust contains gaps in each of these areas. It has a minimal recognition of different types of abuse and no comprehensive understanding of behavior in operational conditions. AI companies display inconsistent adoption of best practices, stated Dr. Frase. There is no standard method to assess trustworthiness across AI products. Furthermore, there is no mechanism to identify and restrict malicious users.

AI products are particularly difficult to trust, argued Dr. Frase, because they lack the historic understanding that comes with steady, incremental progress. While existing test design science is capable of handling complex systems with large numbers of variables, it relies on historic knowledge of systems for efficient testing. Radars, for example, have undergone 80 years of incremental changes, performance testing, and operator experience. AI, by contrast, has minimal historic understanding.

Dr. Frase argued that it is possible to accelerate the trust-building process for AI by creating the appropriate infrastructure to accumulate and share historic knowledge. Dr. Frase asserted that infrastructure is critical for a number of reasons. It ensures that information about AI performance and testing is discoverable and available, shares knowledge across programs, stores and sanitizes information for use across multiple classifications, and stores and monitors post-production data. To meet the needs of AI for C2 in particular, Dr. Frase recommended the following steps: identify related internal AI programs; identify similar joint AI programs; share, gather, and store information; and identify methods to store and leverage post-deployment AI monitoring, performance, and behavior data.

Dr. Frase also stressed the importance of instituting a tiered and triaged classification process for testing and demonstration of AI products, similar to the Food and Drug Administration’s (FDA’s) classification of medical devices. FDA regulates medical devices via a tiered system that accounts for device risk and complexity. The FDA system prioritizes breakthrough systems, ensures post-approval monitoring and knowledge building, and uses tiered ranking to assess risk and incremental change systematically. Dr. Frase suggested that DoD should adopt a similar process for AI C2 products. Dr. Hwang gestured to the efforts of the National Institute of Standards and Technology (NIST) as positive progress toward this goal. Dr. Frase asserted that NIST has made consistent improvements in its risk management framework and is increasingly emphasizing processes implemented by companies to achieve trusted AI.

HOW DO ORGANIZATIONS ACCELERATE MACHINE LEARNING INTEGRATION?

Benjamin Harvey, AI Squared and Johns Hopkins University, discussed ways that organizations can accelerate ML integration. Dr. Harvey approached the issue from a background in the Intelligence Community (IC) as well as private industry. While at the National Security Agency (NSA), Dr. Harvey oversaw the integration of AI into mission production applications as the chief of operations for data science. He recalled that data scientists at NSA were frustrated because while they were achieving excellent results in a controlled experimental setting, the AI/ML capabilities were not getting to the end users (analysts and warfighters).

Dr. Harvey stated that while investments in AI are massive, two out of three AI projects fail, primarily for three reasons. First, deploying and integrating ML models is difficult, he asserted. At NSA, for example, integrating a single model required coordinating across ML engineers, data scientists, development operations, front-end developers, application managers, and so on. Dr. Harvey noted that it entailed aggravatingly large amounts of time and money. Second, it is a challenge to build ML applications that teams actually want to adopt. Dr. Harvey recalled the following assessment of an IC analyst: If the results of the model are not actionable, relevant, timely, and contextualized, analysts will not use the model. Dr. Harvey stressed the importance of building applications that effectively communicate to end users. Third, Dr. Harvey averred that most organizations focus on the front end of the ML pipeline

Page 7 Cite

Suggested Citation:"Artificial Intelligence and Justified Confidence: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26887.

×

(data preparation and labeling, building sophisticated models) and neglect the crucial “last mile” of integration and optimization.

The last mile of ML, stated Dr. Harvey, presents several significant challenges. Most organizations, including DoD, seek to integrate AI/ML into legacy applications. Often, there is minimal access to the code base of such older systems, making integration a challenge and adding months to the process. Additionally, Dr. Harvey argued that unsatisfactory results dissuade end users from trusting or using ML applications. Siloed teams and long timelines are often the culprits behind failures to quickly acquire feedback and iterate on the model to better address end user needs, stipulated Dr. Harvey.

To overcome obstacles in integration and optimization (the last mile), Dr. Harvey recommended that organizations take the following four steps:

Align multiple data sources and models to integrate collectively into legacy applications.
Embed ML results directly into web applications to put ML in the hands of users, eschewing the traditional development of trying to perfect the model prior to deployment.
Create the governance tools necessary to customize how their applications display ML results.
Continuously acquire collaborative feedback on ML model performance.

During the discussion, Dr. Harvey remarked on accelerating integration through observational studies and on the differences between the integration approaches of industry and the IC. Dr. Harvey touted observational studies as a method to accelerate integration of ML models into user workflows. In observational studies, data scientists do model training and quickly get the model into the hands of a select number of users (a far different approach than the usual method of endlessly fine-tuning the model prior to deployment). Data scientists observe for a few weeks as the select users test the model, then quickly reiterate and deploy at a larger scale. Dr. Luttman commented that moving faster requires greater pre-existing confidence and trust on the part of the users. Dr. Harvey posited that observational studies facilitate trust, because end users become part of the development process and gain observational and experimental experience with ML models. The core lesson, Dr. Harvey stated, was that the data and model do not need to be perfect prior to putting the application in the hands of end users.

Dr. Harvey also observed that the IC often favors a top-down approach to requirements, rather than the bottom-up approach of observational studies. He stated that the top-down approach slows the pace of innovation so that by the time applications reach end users in the IC, they are often no longer relevant. He recommended that the IC and DoD employ agile ML development operations to assess what is of value to the end user—a direction in which industry is already moving.

EVALUATING, INTERPRETING, AND MONITORING MACHINE LEARNING MODELS

Ankur Taly, Google, Inc., presented on evaluating, interpreting, and modeling ML models, most of which are black boxes to humans. Dr. Taly noted that the current procedures for evaluating ML models entail back learning. Dr. Taly pointed out several shortcomings, however, such as variations in test accuracy and test sets that are not representative of deployment. The goal, Dr. Taly argued, is to have the capability to drill down to individual slices of the data and to assess whether the test data are representative of what the user would see in deployment.

Dr. Taly characterized several existing approaches to interpreting model predictions as “naïve.” In interpreting model predictions, the aim is to understand why the model made a prediction and to be able to attribute a model’s prediction to features of the input. Unsatisfactory approaches, in Dr. Taly’s estimation, include ablation and feature gradients. Ablations, which drop individual features and assess how predictions change, are computationally expensive, require unrealistic inputs, and are misleading when features interact, he argued. Dr. Taly assessed simple feature gradients to be insufficient as well. As an alternative to

Page 8 Cite

Suggested Citation:"Artificial Intelligence and Justified Confidence: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26887.

×

either of these, Dr. Taly noted with approval the method of “integrated gradients.” This method integrates image gradients along a straight path—from baseline to inputs—with the goal of providing a post hoc justification for a model (baselines are information-less inputs for the mode, such as black images for image models or empty text for text models). Dr. Taly touted integrated gradients as an easy to apply, widely applicable, and agreed-upon technique for attributing a deep network’s prediction to its inputs feature.

Continuous monitoring of models is significant, affirmed Dr. Taly, because prediction data may differ significantly from test data and because the distribution of features and labels may drift over the course of production (due to outliers, bugs, etc.). He noted two ways to monitor ML models—monitoring of feature distribution and monitoring of prediction distribution. Dr. Taly assessed these approaches to have at least three problems. First, they have difficulties dealing with multiple-feature representations. Second, large feature drifts do not necessarily result in substantive performance changes. Third, it is difficult to assess drift that occurs in correlation between features. Dr. Taly noted that an alternative is attribution-based monitoring, which monitors trends of feature attribution scores for every feature. Assessing feature-attribution drift (comparing distribution of feature attribution between a serving window and a reference window), offers several benefits, according to Dr. Taly. He stated that it is importance weighted, applies to all feature representations, accounts for feature interactions, is applicable to many feature groups, and stabilizes the monitoring of features across models. In his concluding remarks and in response to several queries during the discussion, Dr. Taly offered some of his main takeaways (see Box 3).

BUILDING WELL-CALIBRATED TRUST IN ARTIFICIAL INTELLIGENCE SYSTEMS

Paul Scharre, Center for a New American Security, articulated his vision of well-calibrated trust in AI systems. In well-calibrated trust, designers and testers understand the actual capabilities and limitations of a system, its performance boundaries, and potential points of failure. They modify the system to ensure that unavoidable failures are not hazardous for the user. Operators undergo appropriate levels of training to acquire a full understanding of the system’s abilities and limitations. They absorb its performance information through an appropriately communicative human–machine interface. Operators take responsibility for the system and employ it in appropriate settings. Policymakers and commanders understand the capabilities and risks of systems that they authorize for employment. Finally, Dr. Scharre explained, defense

Page 9 Cite

Suggested Citation:"Artificial Intelligence and Justified Confidence: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26887.

×

institutions would ensure that those interacting with the system throughout its design, development, testing, training, and employment phases understand its capabilities and limitations, and they foster a culture of responsibility.

Dr. Scharre asserted that DoD maintains effective procedures for creating well-calibrated trust during the development pipeline, which consists of five phases: development, testing, operator training, authorization, and use. Furthermore, DoD possesses significant experience in the deployment of complex systems in high-risk applications, noted Dr. Scharre. But ML, he argued, presents a new set of challenges. In addition to the challenge of integrating ML into broader digital systems, the unique ML pipeline (consisting of data, model training, testing and refinement, use, and model refinement) exposes new vulnerabilities at each phase.

Dr. Scharre provided the following takeaways on the ML pipeline:

Data: Securing training data is paramount because adversarial access could expose the model to poisoning or exploitation.
Model training: Models should have appropriate goals and sufficient compute power to achieve them.
Testing and refinement: Emergent behaviors are common, and models are vulnerable to manipulation. Thus, it is essential to red-team prior to deployment and fine-tune post-deployment.
Use: Minute changes in AI goals can result in drastically altered behavior by the model. A model can also possess features that upset or aggravate users, even if they do not inhibit its ability to accomplish the mission.
Model refinement: AI systems do not react well to novelty and will experience performance degradation in operational environments.

Dr. Scharre concluded with remarks on several crosscutting themes from his presentation. Despite its enormous promise, he noted that ML remains an immature technology. Dr. Scharre suggested that DoD adjust its existing (and admittedly successful) processes for building trustworthy systems to account for the unique challenges of ML. This will take new tools, infrastructure, processes, and greater investments of resources. For example, Dr. Scharre recommended that DoD implement the necessary processes and authorities to retrain algorithms continuously through field deployment (real-life data). He also encouraged greater flexibility from DoD in incorporating industry best practices. Overall, Dr. Scharre concluded that well-calibrated trust is a worthy goal for DoD, which ensures that those interacting with AI systems are cognizant of their limits and understand appropriate settings in which to deploy them.

PLANNING COMMITTEE DISCUSSION

During closing discussion, planning committee members returned to a few general topics from the first day of the workshop: the technological maturity of AI, the special challenges of AI in C2, the Army’s efforts to accelerate the AI pipeline, and the Army’s integration and testing of AI.

Dr. Hwang assessed that the current state of the art is far from full autonomy: the continued necessity for a human in the loop was a crosscutting theme of the presentations. She stipulated that the human would remain a factor for C2 applications in particular. Marvin Langston, Langston Associates LLC, planning committee member, stated that the presentations reflected the reality that while AI can be highly successful at explicit and specific tasks, it is not yet capable of presenting courses of action in the dynamic and unpredictable environment of C2. In fact, he asserted, C2 may be the most complex application of AI. Emphasizing this point, Catherine Crawford, IBM Corporation, planning committee member, noted that deployed AI will confront not merely a challenging operating environment, but the efforts of an AI-capable adversary. AI, still in its nascent stages, can leverage insights from HAI and natural language processing, noted Dr. Langston and Erin Chiou, Arizona State University, planning committee members.

Page 10 Cite

Suggested Citation:"Artificial Intelligence and Justified Confidence: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26887.

×

Basic questions about AI in C2 remain unanswered, noted Dr. Tucker. For example, he wondered whether point estimation or distribution (two methods discussed during the presentations) was more relevant to the C2 perspective. Dr. Luttman asserted that although in practical applications, distributions do not often come into play, having a distribution can be a useful tool for an operator to use in explaining why they took a specific course of action. Dr. Crawford posited that for some decision makers, the variability presented in a distribution might be unnecessary, unhelpful, and even unsettling. Thus, Dr. Crawford stated, it is imperative that models cognitively align with the humans that use them.

Several planning committee members voiced concerns with the existing AI pipeline. Dr. Tucker stated that DoD has trouble attracting talented students, who are reluctant to work on the government’s longer timeframes. The faster tempo of industry appeals to those looking to accelerate their careers. Dr. Luttman suggested that some DoD researchers at the National Laboratories might find existing regulations burdensome.

Several planning committee members discussed possibilities to accelerate the Army’s integration and testing of AI systems. Dr. Crawford noted that there is currently a “chasm” between development and deployment. To bridge this gap, Dr. Hwang suggested that the Army tailor its training and use cases to specific environments, similar to the environmental focus of DARPA’s Sub-T and OFFSET programs. Environmental factors inevitably influence the complexity of tasks that AI is required to accomplish, she noted. Dr. Crawford pointed out possible improvements through greater interface between DoD and industry. DoD can provide industry with more explicit descriptions of actual use cases for deployed AI, while industry can accelerate the DoD space by advising on the state of the art. Furthermore, Dr. Luttman highlighted the presentation of Dr. Harvey to illustrate the potential of field experimentation. Dr. Luttman suggested that the Army would benefit from integrating AI with end users as early as possible and by constantly iterating its models in light of real-world experience.

Page 11 Cite

Suggested Citation:"Artificial Intelligence and Justified Confidence: Proceedings of a Workshop - in Brief." National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshop–in Brief. Washington, DC: The National Academies Press. doi: 10.17226/26887.

×

DISCLAIMER This Proceedings of a Workshop—in Brief was prepared by CLEMENT MULOCK as a factual summary of what occurred at the workshop. The statements made are those of the rapporteur or individual workshop participants and do not necessarily represent the views of all workshop participants; the planning committee; or the National Academies of Sciences, Engineering, and Medicine.

PLANNING COMMITTEE JENNIE S. HWANG (NAE) (Co-Chair), H-Technologies Group; CONRAD TUCKER (Co-Chair), Carnegie Mellon University; ERIN K. CHIOU, Arizona State University; CATHERINE H. CRAWFORD, IBM Corporation; MARVIN J. LANGSTON, Langston Assoc. LLC; NANDI O. LESLIE, Raytheon Technologies; AARON B. LUTTMAN, Pacific Northwest National Laboratory; JOHN M. MATSUMURA, RAND Corporation; TODD D. MURPHEY, Northwestern University.

STAFF WILLIAM “BRUNO” MILLONIG, Director/Scholar, BOARD; NIA JOHNSON, Program Officer, Intelligence Community Studies Board; THO NGUYEN, Senior Program Officer, Computer Science and Telecommunications Board; CAMERON MALCOM, Research Associate, BOARD; TINA M. LATIMER, Program Coordinator, BOARD; TRAVON C. JAMES, Senior Program Assistant, BOARD; CLEMENT (MAC) MULOCK, Program Assistant, BOARD.

REVIEWERS To ensure that it meets institutional standards for quality and objectivity, this Proceedings of a Workshop—in Brief was reviewed by JENNIE S. HWANG (NAE), H-Technologies Group; RYAN MURPHY, National Academies of Sciences, Engineering, and Medicine; and CONRAD TUCKER, Carnegie Mellon University. JAYDA WADE, National Academies of Sciences, Engineering, and Medicine, served as the review coordinator.

SPONSOR This workshop was sponsored by the Deputy Assistant Secretary of the Army.

For additional information regarding the workshop, visit https://www.nationalacademies.org/our-work/artificial-intelligence-and-justified-confidence-a-workshop..

Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2023. Artificial Intelligence and Justified Confidence: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. https://doi.org/10.17226/26887.

Division on Engineering and Physical Sciences