Page 63 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

3
Research and Development Priorities

Research and development (R&D) activities have been a critical piece of the National Nuclear Strategy Administration (NNSA) strategy in Science-Based Stockpile Stewardship, including the development of better mathematical models, numerical algorithms, parallel programming tools, high-performance computing (HPC) operating systems, and more. These investments include higher-risk research activities to explore new approaches, and are key to developing more sophisticated computation models, addressing computing technology challenges, and attracting top talent into the NNSA program.

MATHEMATICS AND COMPUTATIONAL SCIENCE R&D

As discussed in Chapter 1, NNSA has significant requirements for HPC beyond exascale to support efforts on all aspects of the weapons life cycle, including design, production, certification, and safety issues. The overall drivers for these requirements stem from pursuing future considerations of possible new designs, new manufacturing processes, aging of the stockpile, and potential needs for rapid response to new global threats. As discussed in Chapter 2, the requirements for increased HPC must be addressed against a backdrop of both technological and ecosystem disruptions. This section discusses the role of applied mathematics and computational science R&D in delivering the computational capabilities needed to meet NNSA mission requirements. Applied mathematics and computational science have played a significant role in the development of a wide range of simulation technologies that have been critical to the success of the Advanced

Page 64 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

Simulation and Computing program,¹ a role that the committee anticipates will become increasingly important in the future. However, it is also important to appreciate that applied mathematics and computational science have advanced computational capabilities “beyond forward simulation,” including large-scale optimization (for design and control), solution of inverse problems (for parameter estimation), and uncertainty quantification. Continued advances in these areas are required to support future mission needs across all aspects of the NNSA life cycle, from discovery, design exploration, and optimization, to manufacturing and certification, as well as deployment and surveillance.

Applied Mathematics and Computational Science to Enable Forward Simulations for NNSA Mission Problems

Simulation plays a critical role in scientific discovery and forms the backbone of engineering design and analysis. High-fidelity simulations are critical for the design of nuclear explosive packages (NEPs) as well as supporting other aspects of the weapons life cycle. Original designs of nuclear weapons were tightly coupled to tests. Early simulations were calibrated to test data with little predictive capability outside a narrow envelope defined by available data. Requirements to explore new design concepts and to ascertain reliability of the existing stockpile in the absence of testing has driven a need for improved predictive capability. For the past three decades, increasing computational capability, primarily in terms of number of processors, has driven a wave of weak-scaling-based advances where larger HPC systems allowed researchers to solve larger problems with higher-fidelity physics models. NNSA effectively exploited this trend, developing a sophisticated simulation methodology that significantly improved the quality of simulations. Applied mathematics and computational science contributed to this success, developing robust, accurate, and scalable discretizations for complex multiphysics applications to effectively utilize massively parallel HPC architecture and verification, validation, and uncertainty quantification (VVUQ) methodology targeted at assessing the fidelity of simulation results.

In spite of major advances in simulation capability, many problems remain that are beyond reach even with exascale computing. Examples of these types of problems include not only high-fidelity simulations of NEPs, but also simulations of more fundamental problems in weapons science that inform models in full-system simulations. Examples of the latter, as discussed earlier, include modeling of high explosives, behavior

___________________

¹ It is well documented that advances in algorithms over the past decades have led to computational speedups that have paralleled the exponential growth in computing power according to Moore’s law. U. Rude, K. Willcox, L.C. McInnes, and H. De Sterck, 2018, “Research and Education in Computational Science and Engineering,” SIAM Review 60(3):707–754.

Page 65 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

of materials under extreme conditions, turbulence and turbulent mixing, and radiation/matter interaction.

A fundamental concern in many areas such as the examples mentioned above is that relying on a weak-scaling approach based on larger versions of architectures currently being deployed at the exascale is no longer a viable strategy. Increasing fidelity by increasing spatial resolution necessitates an increase in temporal resolution. Consequently, even if one assumes ideal scaling, this weak-scaling paradigm results in increasing time to solution. For example, increasing the resolution of a three-dimensional high-explosive detonation simulation by an order of magnitude will increase resources needed to store the solution by three orders of magnitude and increase the time to solution by an order of magnitude (or more if any aspect of the simulation fails to scale ideally). For simulations that require several weeks or more to complete now, the time-to-solution for a higher-resolution version of the same problem will be on the order of a year or more for a single simulation, making it, for all practical purposes, infeasible. Strong scaling, in which additional resources are used for a fixed-size problem, has been shown to have only limited success because performance quickly becomes limited by communication costs. Several of the problems mentioned above, as well as many of the engineering analyses relevant to the complete weapons life cycle, share this characterization.

This breakdown of the weak-scaling paradigm coupled with the disruptions in NNSA computing discussed in Chapter 2 will necessitate rethinking how simulations are performed. Advancing the state of the art in forward simulation will rely on significantly different algorithmic approaches and methods to exploit novel hardware advances effectively. Improved methodology for traditional simulation such as more accurate discretization, faster solvers, and better coupling approaches will undoubtedly play a role, but these need to be augmented with other types of approaches.

Exactly what other approaches will ultimately prove useful remains an open question. One potential area would be the incorporation of machine learning or other types of statistical data-driven approaches. The basic idea would be to replace some computationally expensive component of a simulation with a relatively inexpensive data-driven model based on either experiment or data computed from a separate simulation. Although potentially expensive to train, the resulting model could dramatically reduce simulation costs as discussed in the artificial intelligence (AI) R&D section later in this chapter. Another potential approach would be to develop multiscale techniques for different processes that are able to capture the effects of finer-scale behavior on the larger-scale dynamics, reducing the range of scales needed for simulations. In both cases, quantifying the fidelity of the new model and how it impacts the uncertainty of the overall simulation needs to be assessed.

Page 66 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

Another issue that arises within the context of improved forward simulations is the need for improved models for key physical processes (discussed in Chapter 1). Data-driven multiscale approaches based on finer-scale simulations are needed to systematically define coarse-grained models for use in systems-level simulations with quantified fidelity. Of particular interest in this context are situations where there is important fine-scale behavior that cannot be readily captured by larger-scale models. Microstructure, small voids, and the presence of cracks have significant impact on the dynamic response of solids, which can alter the behavior of solid high explosives as well as other aspects of the weapons systems. Traditional models for fluid mechanics fail to accurately predict the internal structure of strong shock waves. Simulating systems of this type will require simulation methodology that uses different physical descriptions at different scales with systematic ways to identify what model is appropriate in a given part of the problem and to couple different types of representations dynamically.

VVUQ is also a critical element of NNSA mission problems and becomes even more important with the increased use of data-driven modeling and machine learning. Models must have quantified fidelity—with a clearly defined metric of what it means to be trusted—and the impact of an individual model’s fidelity on overall simulation accuracy must be characterized. While the AI methods discussed in the section on AI R&D later in this chapter may provide new opportunities to augment, enhance, and accelerate physics-based simulators, their overall utility will be severely limited without rigorous VVUQ.

Meeting these challenges will require significant investment in applied mathematics and computational science. However, it is also important to recognize that this investment must support a broader range of mathematical sciences than in the past. There is a continued need for the mathematics that supports multiphysics, scalable algorithms, but also an increasing need for the mathematics that supports multiscale modeling, machine learning, AI, and statistical and data-driven modeling. There is also a need to address the challenges and opportunities of integrating data-driven models with traditional simulation methodologies to develop more effective predictive capabilities, recognizing the essential role that VVUQ must play in NNSA applications.

FINDING 3: Bold and sustained research and development investments in hardware, software, and algorithms—including higher-risk research activities to explore new approaches—are critical if NNSA is to meet its future mission needs.

FINDING 3.1: Physics-based simulators will remain essential as the core of NNSA predictive simulation. However, given disruptions in computing technology and the HPC ecosystem combined with the end of the weak-scaling era, novel

Page 67 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

mathematical and computational science approaches will be needed to meet NNSA mission requirements.

FINDING 3.2: VVUQ and trustworthiness remain of paramount importance to NNSA applications. VVUQ will become increasingly important as simulation methodology shifts toward more complex systems that incorporate models of different fidelity, including data-driven approaches.

Algorithms for Novel Architectures

Future computer architectures are expected to be considerably more heterogeneous than today’s systems. Systems may incorporate a variety of different capabilities such as accelerators designed for machine learning. Applied mathematics research will be needed to develop new algorithmic approaches to effectively utilize these novel architectures and integrate those approaches into multiphysics/multiscale simulations. Custom-designed hardware targeted toward specific applications provides another novel approach to obtain improved performance. In this case, co-design of hardware and algorithms will be essential. Designing effective custom hardware will require a close partnership between hardware architects and applied mathematicians and computational science researchers. The emergence of highly heterogeneous architectures will also drive a need for theory and methods to achieve optimal management of heterogeneous models/data over hierarchical and distributed compute and network resources.

FINDING 3.3: Novel architectures can have a significant impact on NNSA computing; however, mathematical research will be needed to effectively exploit these new architectures. Involvement of applied mathematicians and computational scientists early in the development cycle for novel architectures will be important for reducing development time for these types of systems.

FINDING 3.4: An end to transistor density scaling is likely to motivate industry to develop novel computer architectures for which today’s numerical algorithms, software libraries, and programming models are ill suited.

Applied Mathematics and Computational Science Beyond Forward Simulation

As noted above, computational science has long encompassed more than just forward simulation. The past decades have seen advances in large-scale optimization, inverse problems, and uncertainty quantification–sometimes referred to as “outer-loop” applications of computational science because they require many forward model

Page 68 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

evaluations^2,3,4—with impact on NNSA design and VVUQ workflows. As the sophistication of predictive simulations continues to increase, there is a corresponding need to advance the mathematics of these outer-loop methods. Even as high-fidelity multiphysics simulations continue to mature, they are typically much too expensive to be used routinely in outer-loop applications such as design optimization, control, or autonomous experimentation. Surrogate and reduced-order modeling have received considerable attention in the past decades in the applied mathematics and computational science research communities, and provide a class of approaches that lead to increased simulation speed to address these challenges; however, it remains an outstanding mathematical challenge to quantify the limitations of these types of models and characterize their overall fidelity, especially for nonlinear multiphysics systems. As discussed in the section on AI R&D later in this chapter, AI-based approaches provide exciting opportunities to take computational technologies such as surrogates to a new level, but continued investment in applied mathematics and computational science that have physics-based modeling at their core remains essential.

Another nontraditional application of computation within NNSA is support for experimental facilities. As the capabilities of NNSA experimental facilities continue to advance, HPC can play a major role in experimental design, optimizing experimental controls, and analyzing the flood of data being generated by these facilities. Mathematics in support of facilities is still in its infancy and substantial development will be needed to realize this potential.

Integration of predictive simulation capabilities together with experimental data paves the way for digital twins, another key opportunity area for NNSA. A digital twin is a computational model or set of coupled models that evolves over time to persistently represent the structure, behavior, and context of a unique physical system or process.⁵ Digital twins are characterized by a dynamic and continuous two-way flow of information between the computational model and the physical system. Data streams from the physical system are assimilated into the computational model to reduce uncertainties and improve its predictions, which in turn is used as a basis for controlling the physical system, optimizing data acquisition, and providing decision support. Digital twins have the potential to support the NNSA mission in a number of ways, including monitoring the condition of the stockpile, real-time monitoring, and adaptive control of

___________________

² D.E. Keyes, 2011, “Exaflop/s: The Why and the How,” Comptes Rendus Mécanique 339(2):70–77, https://doi.org/https://doi.org/10.1016/j.crme.2010.11.002.

³ Department of Energy, 2014, DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges, February 10, https://www.osti.gov/servlets/purl/1222713.

⁴ B. Peherstorfer, K. Willcox, and M. Gunzburger, 2018, “Survey of Multifidelity Methods in Uncertainty Propagation, Inference, and Optimization,” SIAM Review 60(3):550–591, https://doi.org/10.1137/16m1082469.

⁵ AIAA Digital Engineering Integration Committee, 2020, “Digital Twin: Definition and Value,” AIAA and AIA position paper, American Institute of Aeronautics and Astronautics (AIAA) and Aerospace Industries Association (AIA), https://www.aiaa.org/advocacy/Policy-Papers/Institute-Position-Papers.

Page 69 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

manufacturing processes, and improving control of hypersonic vehicles. While digital twins provide an exciting opportunity to drive improved decision making, realizing a digital twin at the scale, fidelity, and level of trustworthiness required for NNSA mission problems requires investment to address foundational applied mathematical and computational science challenges. Such challenges include managing and representing data, models, and decisions that cross multiple temporal and spatial scales; predictive modeling of complex systems that comprise multiple interacting subsystems; VVUQ for predictive digital twins; scalable algorithms for data assimilation, prediction, and control; and integrating complex data streams within the digital twin.⁶

FINDING 3.5: Recent advances in applied mathematics and computational science have the potential for impact on NNSA mission problems far beyond traditional roles in physics-based simulation.

RECOMMENDATION 2: NNSA should foster and pursue high-risk, high-reward research in applied mathematics, computer science, and computational science to cultivate radical innovation and ensure future intellectual leadership needed for its mission.

RECOMMENDATION 2.1: NNSA should strengthen efforts in applied mathematics and computational science research and development. Potential areas include using novel architectures, data-driven modeling, optimization, inverse problems, uncertainty quantification, reduced-order modeling, multiscale modeling, mathematical support for experiments, and digital twins.

COMPUTER SCIENCE R&D

In this section, the committee considers the computer science R&D questions, which are driven from below by the technology considerations and from above by the application and algorithm requirements. The first set of issues are based on the role of computer science research in co-designing future systems, methods, and applications with an even deeper understanding of the technology and ecosystem constraints than in previous generations. The following section looks at computer science research beyond traditional high-performance modeling and simulation. The last section acknowledges the

___________________

⁶ S.A. Niederer, M.S. Sacks, M. Girolami, and K. Willcox, 2021, “Scaling Digital Twins from the Artisanal to the Industrial,” Nature Computational Science 1(5):313–320, https://doi.org/10.1038/s43588-021-00072-5.

Page 70 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

important but separate problem of maintaining a robust software engineering program in the future.

Co-Design of Future High-Performance Computing Systems

More than 50 years of advances in HPC have taught us that each generation of HPC systems is accompanied by major new challenges of scale and complexity. Overcoming these challenges often necessitates changes in algorithmic approaches, systems software architecture, programming models, compilation techniques, and application development methods. For example, although the message passing interface remains an important and stable part of the HPC programming environment, changes in node architecture, memory systems, resilience characteristics, and usage models continue to require innovations in programming systems. Furthermore, if device-level performance benefits stall and architectural specialization becomes commonplace—as it has for machine learning—then NNSA will need to take a leadership role in the design and testing of scientific computing hardware and the development of a retargetable software stack.

Co-design has been a tenet of the HPC approach leading to exascale but has highly leveraged commodity components and placed much of the burden of hardware and systems software on vendors. This balance will shift toward the laboratories in the post-exascale era, likely requiring them to take on hardware accelerator design, system integration, and expanded system software development. Such roles are not new. In the past, NNSA has been a leader in transitioning supercomputing technologies and architectures from vector supercomputers to microprocessor-based parallel computers in the 1990s, multicore nodes in the 2000s, and graphics processing units (GPUs) in the 2010s. NNSA laboratories have also built HPC operating systems, developed compilers to support standard and novel programming languages, designed runtime systems to manage different types of parallelism, created communication libraries for both production and exploratory use, and built autotuners to intelligently search through possible implementations to find one best suited to a given piece of hardware. They have also deployed systems with low-power processors customized for HPC (e.g., the BlueGene line), with server processors adapted to add lightweight communication, and with processors from gaming or graphics markets (RoadRunner and GPU-based systems). The decisions behind each of these deployments may seem obvious in retrospect, but they required visionary leaders who were able to forge a path amid technology and business risks and achieve success with the support of creative research teams who solved anticipated and unanticipated challenges to effectively utilize each generation of systems.

Research into algorithms that exploit new levels or degrees of parallelism, avoid data movement at any level of the memory and communication hierarchy, tolerate

Page 71 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

various kinds of hardware failures, or dynamically balance computational or communication load are not stand-alone computer science problems but must often take into account higher-level application structures and mathematics. For example, optimized versions may preserve all of the original program dependencies and therefore leave numerical properties unchanged or may rely on properties like associativity of addition, which is true in exact arithmetic but not when floating-point roundoff is considered, or they may compute different but equally useful answers. Computations moved to hardware accelerators may not compute the same result as the main processor (central processing unit), especially if narrower data types are used (e.g., 8-bit floating-point formats are now being considered for machine-learning algorithms). Performance optimizations must also take into account hierarchical data structures, matrices that are so sparse (filled with mostly zeros) that only nonzero values and their locations are stored, or unstructured meshes representing the intricacies of a complex mechanical device.

Addressing foreseeable technology challenges requires a spectrum of high- and low-risk approaches and the technical expertise to design, build, and resolve challenges that arise. There are many open research questions. For example, can NNSA leverage AI hardware for non-AI workloads? Even if the AI architectures cannot be used, can some features, such as low-precision arithmetic, save memory and increase computational rates on NNSA problems? Will future semiconductor devices and packaging require new algorithms in response to changes in the relative cost of memory and computing operations—for example, devices that reduce energy consumption at the cost of higher latency. Are there new classes of parallel algorithms or cost models better suited to future machines? Can machine learning be used to produce performance portable software? How will detectable system failures and silent errors affect future post-exascale systems, and will they require new algorithms and software? Should new storage technology be integrated into hardware-controlled memory hierarchy or exposed to software control, and at what benefit to NNSA applications? Should NNSA computing infrastructure be configured to be resilient to natural and human-induced disasters? Are there ideas, tools, or lessons from cloud computing (specifically, PaaS [platform as a service] and SaaS [software as a service] models) that can aid in answering these questions?

The technology challenges facing the field of computing writ large will require advanced research on all levels of computer system design and use, from semiconductor device technology and computer architectures to the programming tools and abstract cost models used to design efficient algorithms. To ensure that NNSA has access to highly capable, world-leading computing systems that are suitable to their future workloads, they will need to consider much more aggressive models of co-design and strategically partner with industry, universities, and other laboratories.

Page 72 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

FINDING 3.6: Co-design of hardware and systems for high-performance scientific computing applications has been a modest success to date and will be more important in the future and need to be deeper. Technological and market trends are likely to shift the balance of co-design to the laboratories, requiring more innovation and engineering in the areas of hardware design, system integration, and system software.

Computer Science Research Beyond High-Performance Computing

The DOE laboratories are among the world leaders in high-performance computing, especially as it pertains to modeling and simulation. Other areas of computer science research such as networking, distributed systems, computer architecture, cybersecurity, user interfaces, databases, graphics, and software engineering have relevance to NNSA problems but are not as well represented in the laboratories. For example, NNSA supports work on languages like FORTRAN and C++, but major innovations in high-productivity scientific computing for non-HPC problems, such as Python and Julia, as well as new models for collaborative science such as Jupyter notebooks, have come primarily from outside groups. These technology developments shape how students are trained and how data-intensive scientific research is conducted outside the laboratories, raising the possibility that future generations of weapons designers will demand higher-level programming interfaces and semi-automated tools. Methods for synthesizing code or hardware designs from higher-level specifications or generating test sets automatically using program verification techniques are largely absent in laboratory research, as are new models of wide area networking, hardware support for secure computing, and platforms for cleaning and analyzing large, messy data sets.

There are also many problems related to the management and analysis of large-scale data sets from experiments (such as the National Ignition Facility, the Dual Axis Radiographic Hydrodynamic Test Facility, the Z Machine, as well as other experiments) and simulations. The challenges of analyzing massive scientific data sets are compounded by data complexity that results from heterogeneous methods and devices for data generation and capture and the inherently multiscale, multiphysics nature of many sciences, resulting in data with hundreds of attributes or dimensions and spanning multiple spatial and temporal scales.⁷ Research in the management and analysis of extreme-scale scientific data may overlap with HPC, but with different hardware requirements than modeling and simulation applications. For example, the need for high-speed input/output drives both hardware configuration, raises opportunities for new storage technologies, and requires the operating systems and system libraries to effectively use such a system. The

___________________

⁷ Office of Science Financial Assistance, 2010, “Scientific Data Management and Analysis at Extreme Scale,” https://science.osti.gov/ascr/Funding-Opportunities/-/media/grants/pdf/foas/2010/DE-FOA-0000256.pdf.

Page 73 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

data rates from future experiments and simulations, as well as feedback between them, may require new tools and methods for data management and data analytics, including visual analysis, and scientific workflow tools for scientific discovery.

Even more striking, machine-learning algorithms and quantum algorithms described in later sections are not historical areas of strength, although both are growing within the laboratories. This delay emphasizes the need to have a vibrant research program, allowing for progress on known research challenges, but also allowing for the exploration of completely different approaches to the high-level mission problems in NNSA.

Software Development Is Not Computer Science Research

The practice of high-quality software engineering is essential to producing and maintaining computer applications and the underlying levels of software that can reliably make predictions about weapons systems and various component problems. There is a natural tendency to equate software engineering practice with computer science research. Software engineering practice is about reducing risk through use of known tools and techniques, defining clear interfaces, and adhering to standards, rigorously testing and documenting code and having a robust process for software management and releases. Computer science research, even in software engineering, is about exploring new ideas, testing hypotheses, and taking risks. NNSA needs a cadre of both software engineers and computer science researchers, as each plays a distinct role in meeting NNSA’s mission.

Numerical libraries are important for many high-performance scientific applications and offer the potential to exploit the underlying computer systems without the application developer understanding the architectural details. Existing numerical libraries will need to be rewritten and new algorithms developed in light of emerging architectural changes, including increased concurrency, heterogeneous components, power management, and multiple types of memory. Because of the enhanced levels of concurrency on future systems, algorithms will need to embrace asynchrony to generate the number of required independent operations. New and evolving application requirements also require extensions to libraries, just as linear algebra libraries primarily created for simulation have been adapted for machine learning.

RECOMMENDATION 2.2: NNSA should strengthen efforts in computer science R&D to build a substantial, sustained, and broad-based intramural research program that is positioned to address the technological challenges associated with post-exascale systems and co-design of those systems to ensure that the laboratories are positioned for leadership in computing breakthroughs relevant to NNSA mission problems.

Page 74 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT

The computational demands of machine-learning applications and the availability of optimized hardware for this workload should be considered when planning for post-exascale NNSA computing systems and activities. One important class of machine-learning methods, deep learning, has dominated recent success in AI (Box 3-1) perceptual tasks (e.g., image recognition, image classification, machine translation); demonstrated human-level, or better, skill in areas thought to require deep expertise (e.g., Go and chess); and produced intriguing results in scientific problems (e.g., protein structure prediction⁸) and other areas (e.g., automated generation of text and software using large language models⁹). A series of town halls led by the Department of Energy laboratories in 2019 on AI for Science (AI4Sci) and in 2022 on AI for Science, Energy, and Security (AI4SES) covered many of the opportunities and challenges of using AI in science, energy, and security applications, including a report produced for the earlier AI4Sci meetings.¹⁰ The security aspects of these 2022 meetings covered problems of relevance to NNSA, although necessarily limited in scope owing to the unclassified nature of the meetings. The NNSA laboratories have also had internal discussion about the use of these methods.

The scale and speed of advances in AI applications have been remarkable, but equally important is a growing understanding that not all problems have sufficient observational data or the necessary constraints for automated training and that open problems remain when using AI in complex environments with multiple physics constraints or for safety-critical problems requiring strict confidence metrics. In these situations, AI methods may be used in concert with traditional simulations. For example, neural networks can be trained on data from simulations to produce surrogates to computational functions (or even entire simulations), achieving nonlinear improvements of multiple orders of magnitude in time-to-solution for HPC applications. Such surrogates can be used to accelerate design space exploration for problems such as materials.¹¹ As this example illustrates, AI methods can augment conventional computational simulation, enabling new approaches to old problems and providing paths to tackling previously intractable problems. From a workload perspective, both high-fidelity simulations and AI methods must be supported.

The recent success of deep-learning methods in the AI community can be attributed in large part to the growth in computing performance in the past few decades. The

___________________

⁸ J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, et al., 2021. “Highly Accurate Protein Structure Prediction with AlphaFold,” Nature 596(7873):583–589.

⁹ “Introducing ChatGPT,” https://openai.com/blog/chatgpt.

¹⁰ R. Stevens, V. Taylor, J. Nichols, A.B. MacCabe, K. Yelick, and D. Brown, 2020, “AI for Science,” Argonne Scientific Publications, Argonne National Laboratory, https://www.anl.gov/ai-for-science-report.

¹¹ A. Agrawal and A. Choudhary, 2019, “Deep Materials Informatics: Applications of Deep Learning in Materials Science,” MRS Communications 9(3):779–792.

Page 75 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

BOX 3-1 Artificial Intelligence and Machine Learning Terminology

The term artificial intelligence, or AI, is used broadly to refer to the design and construction of intelligent agents that realize perceptual and goal-seeking activities, and includes computer vision, speech processing, and more.^a Confusingly, AI has also become synonymous in many circles with machine learning, a class of methods often used in AI that “learn” via automated extraction of models, either from training data or within a constrained setting such as optimizing for game play. Yet more specifically, AI is sometimes used to refer to deep learning, the subclass of machine-learning methods based on multilayer neural networks, which have achieved success in AI problems such as computer vision, speech recognition, natural language translation, robotics, and playing games of strategy.

Applications of deep learning have also had tremendous commercial impact, resulting in a market for so-called AI hardware. While graphics processing units are commonly used for deep learning, these more specialized AI chips are optimized for matrix and tensor operations, often using low-precision arithmetic.

In a report for the AI4Sci town halls, the AI terminology is very broad: “In this report and in the Department of Energy laboratory community, we use the term ‘AI for Science’ to broadly represent the next generation of methods and scientific opportunities in computing, including the development and application of AI methods (e.g., machine learning, deep learning, statistical methods, data analytics, automated control, and related areas) to build models from data and to use these models alone or in conjunction with simulation and scalable computing to advance scientific research.”^b

_____________

^a S.J. Russell and P. Norvig, 2021, Artificial Intelligence: A Modern Approach, 4th ed., Hoboken, NJ: Pearson.

^b R. Stevens, V. Taylor, J. Nichols, A.B. MacCabe, K. Yelick, and D. Brown, 2020, “AI for Science,” Argonne Scientific Publications, Argonne National Laboratory, https://www.anl.gov/ai-for-science-report.

demand for large-scale, highly optimized computing systems for deep-learning applications has already motivated substantial use of DOE HPC systems for model training on scientific problems, from the exploration of novel materials and cancer treatments to the identification of extreme climate events and rare astronomical phenomena. The NNSA laboratories are exploring the use of machine learning methods and hardware optimized for deep learning. This research needs to continue, and if AI proves to be broadly applicable, it should be a part of the workload used to design and select future computing systems.

Opportunities

There are several areas for exploration of AI methods in the NNSA mission, using the AI term broadly as in the AI4Sci report. In some cases, AI may provide a solution to replace manual processes, to augment traditional simulations, or to provide useful tools to aid human decision makers. Examples of AI-enabled capabilities that could advance the NNSA mission include the following:

AI as surrogates for simulations of physical systems, ranging from practical problems such as optimized representations of neutron group structures, as

Page 76 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

already demonstrated by the laboratories, to building emulators for the three-dimensional evolution of engineered systems—a capability that is not realizable today and is estimated to require hundreds of exaflop years to train with current methods.
AI in the loop for automated adaptive real-time control of experimental facilities and manufacturing processes that currently do not admit to real-time control or that require increasingly scarce human expertise owing to an aging workforce.
AI to connect across the life cycle for end-to-end intelligent decisions, such as designing to explicitly account for manufacturing and aging, rather than suboptimal proxies for manufacturing/aging issues.
AI to accelerate all stages of the complex physics cycle, encompassing hypothesis, design, execution/control, diagnosis, and analysis. For example, AI-based surrogates may be used to screen large collections of candidates (e.g., materials, structures) that cannot reasonably be evaluated via conventional simulation.
AI to shorten the time to solution at all stages of nuclear weapon design and deployment: Discover, Design, Manufacture, Deploy with AI injection at all phases.
AI for managing surveillance via automated analysis of multimodal data, leveraging, for example, self-supervised learning methods to detect unusual events.
AI for enabling digital twins that integrate heterogeneous models and data from multiple sources, while leveraging edge computing and integration across edge/HPC.

A crosscutting theme is the use of AI methods to learn previously unknown relationships among entities (“serendipitous models”¹²) that can be evaluated far faster than by conventional means and/or without explicit programming—in the process, automating and accelerating previously manual steps to enable more rapid exploration of far larger design spaces.

FINDING 3.7: Rapid innovation in AI methods, driven by advances in computing performance and growth in data sets, is producing frequent technological surprises that NNSA should continue to investigate and track. These advances may

___________________

¹² K.E. Willcox, O. Ghattas, and P. Heimbach, 2021, “The Imperative of Physics-Based Modeling and Inverse Theory in Computational Science,” Nature Computational Science 1:166–168. https://doi.org/10.1038/s43588-021-00040-z.

Page 77 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

benefit the NNSA mission but will likely complement rather than replace traditional physics-based simulations in the post-exascale era.

Challenges

Realizing these potential advances will be far from straightforward. The following factors, in particular, tend to make the application of AI methods to NNSA problems challenging:

VVUQ is paramount in NNSA applications. Explainability, predictive capability, and trust are essential. For example, if AI-based surrogates are used directly to issue predictions and support decisions, they must use techniques that capture the underlying physics, including statistical properties, with known levels of confidence.
Data are typically sparse and indirect in NNSA applications. Many problems are data-poor. Sensing technology is advancing, but many problems within the NNSA mission will never have abundant experimental data (e.g., previously archived nuclear tests). Simulation data may be explored for training, but the cost of generating sufficient simulation-based training data may be prohibitive with current methods and confidence levels considered carefully.
NNSA applications often involve complex systems that engage multiple physics across multiple scales. Coupling among components and between physical phenomena can lead to nonlinear (emergent) behavior. Nonlinearities are often most severe in the most critical conditions (e.g., conditions approaching failure).
Many NNSA applications are characterized by long life cycles, which from concept to design to manufacturing to deployment can span decades. The challenges of computational modeling of multiscale, multiphysics complex systems are exacerbated by the long time horizons over which predictions must remain accurate. For example, computational techniques that support surveillance applications must accurately characterize and predict system performance over decadal time scales.
Rare events drive decision making in weapons design. Data around rare events (e.g., failures) are typically sparse, indirect, and expensive to acquire. Rare events also pose the largest challenges for predictive simulations. Acceptable probabilities of weapon system failure are typically orders of magnitude smaller than for mainstream AI applications.
The classified nature of many elements of the NNSA mission is likely to hinder the large-scale sharing and aggregation of data across components and life-cycle elements that will be important for effective AI, especially when

Page 78 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

combined with other institutional drivers for data silos (organizational, security, proprietary).
The development and application of AI methods in the NNSA context will require new data infrastructure and AI-ready instrumentation that can interface with the rest of the AI ecosystem.¹³ For example, manufacturing data availability, data resources, and data management processes must be advanced in order to realize the benefits of machine learning in enabling manufacturing for NNSA applications, especially as new sensors produce data at unprecedented rates.

Overcoming these challenges to realize capabilities such as those sketched earlier will require sustained investment in both foundational and applied AI R&D. Workshops, such as AI4SES, are an important opportunity within the post-exascale computing landscape, but they cannot be pursued in isolation. Future methods to advance AI4SES workshops must capitalize on advances in mainstream AI while at the same time deeply integrating physics and VVUQ via the computational science methods of the section on Mathematics and Computational Science R&D earlier in this chapter. Opportunities at the intersection of AI and computational science abound. The NNSA laboratories have been exploring the use of AI for certain mission problems, but the combination of enormous opportunities and enormous unanswered questions suggests that the current level of effort is insufficient.

RECOMMENDATION 2.3: NNSA should expand research in AI to explore the use of these methods both for predictive science and for emerging applications, such as manufacturing and control of experiments, and develop machine learning techniques that provide the confidence in results required for NNSA applications.

QUANTUM COMPUTING AND QUANTUM TECHNOLOGY R&D

In presentations to the study committee, the national laboratories asserted that quantum technology (Box 3-2) may play an important future role in enhancing the advancement of their mission-driven computational requirements, but that most of that impact was near the end timeframe of this study. Specifically, the Lawrence Livermore National Laboratory (LLNL) team suggested (see Figure 3-1) hybrid techniques that might be useful for calculation of physics model inputs such as equation of state, transport coefficients, and

___________________

¹³ “Complex Physics Report-Out,” from AI4SES Workshop, June 2022.

Page 79 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

BOX 3-2 Quantum Computing

It is important to recognize that the phrase “quantum computing” is used to refer to a variety of technologies with markedly different computational abilities and expected dates of availability for use. Most familiar (with availability furthest in the future) is what is called error-corrected gate-level quantum computing. This technology offers the ability to execute algorithms, most famously Shor’s algorithm for factoring numbers, and is called a fault-tolerant quantum computer. Because of the technical challenges in realizing such a computer, scientists are developing two other technologies that may have applicability sooner. One such technology was envisioned by Richard Feynman in 1981, and essentially uses a quantum computer to mimic the behavior of complex physical molecules and then use the quantum computer’s ability to observe the quantum behavior of the molecule. One method to mimic such molecular behavior is analog quantum simulation, in which the analog quantum behavior of a machine is used to model a molecule. Last, there is a new model of computing being investigated by researchers called noisy intermediate-scale quantum computers, which tries to use hybrid combinations of non-error-corrected gate-level quantum computers and classical computers to achieve some interesting computations. One such interesting computation is an alternative method of emulating molecules in which the attributes of the molecular system are mapped in a more controlled way onto a digital quantum simulation. Mappings of more general optimization problems are also possible, leading to potential applications in areas such as logistics and finance. An accessible discussion of these concepts can be found in Preskill’s 2021 paper, “Quantum Computing 40 Years Later.”^a

_____________

^a J. Preskill, 2021, “Quantum Computing 40 Years Later,” ArXiv 2106.10522v2.

plasma turbulence in about 10 years from our study date. The LLNL team also suggested that partial differential equation solutions could be relevant on fully error-corrected machines in about 20 years (Figure 3-1). The quantum advantage for PDEs is still an open question, but may be in accuracy of solution rather than speed.¹⁴ Los Alamos National Laboratory noted that while no practical speedups have been observed to date, some quantum simulation algorithms show promise. All laboratories highlighted an appropriate level of investment in various areas of quantum research.

Importantly, post-exascale classical HPC will still be required to solve applications in a hybrid classical-quantum computing model in which quantum hardware accelerates key core kernels, while classical computing provides the full solution by integrating and computing upon results from quantum solutions to many small subproblems.

Furthermore, practical quantum applications will likely emerge on quantum machines using a continuum of error-mitigation techniques ranging from partially fault-tolerant to fully fault-tolerant methods. NNSA DOE algorithms and software research should examine this continuum to bridge the gap between current NISQ machines and future fully error-corrected machines.¹⁵

___________________

¹⁴ A.M. Childs, J.-P. Liu, and A. Ostrander, 2021, “High-Precision Quantum Algorithms for Partial Differential Equations,” Quantum 5:574, https://doi.org/10.22331/q-2021-11-10-574.

¹⁵ National Academies of Sciences, Engineering, and Medicine, 2019, Quantum Computing: Progress and Prospects, Washington, DC: The National Academies Press, https://doi.org/10.17226/25196.

Page 80 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

**FIGURE 3-1** Quantum computing hopes.
SOURCE: From a briefing provided to the committee by LLNL on April 8, 2022.

Sustained research in quantum algorithms, software, and hardware is needed to adapt NNSA applications to a hybrid classical-quantum computing model. In particular, classical optimization algorithms need to be separated into classical and quantum components that best leverage quantum hardware, and the classical component needs to be adapted to the output of the quantum hardware. For example, the classical component can both take advantage of quantum solutions and adjust for errors in the quantum computation.

Within this context, DOE quantum test beds are a key resource for this research, and a diversity of quantum technologies should be made available to scientists in order to future-proof algorithms and software as these technologies develop.

Given the technical and economic limits of scaling classical computing, quantum approaches should be explored to determine if specific problems or subproblems could be solved more efficiently or accurately. However, by the very nature of hybrid classical-quantum calculations, if a significant performance gain is expected, the vast majority of the work must be performed by the quantum kernels and achieving this will require substantial reengineering of software and algorithms.

Last, in order to justify large-scale deployment of quantum accelerators, algorithms and software research is needed to broaden the applications that can benefit from this technology.

Page 81 Cite

Suggested Citation:"3 Research and Development Priorities." National Academies of Sciences, Engineering, and Medicine. 2023. Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration. Washington, DC: The National Academies Press. doi: 10.17226/26916.

×

FINDING 3.8: Quantum technology has the potential to improve the fundamental understanding of material properties needed by important NNSA applications. Analog quantum simulation or digital quantum simulation will likely be available before general quantum computers.

FINDING 3.9: Major breakthroughs in quantum algorithms and systems are needed to make quantum computing practical for multiphysics stockpile modeling. Quantum computing is more likely to serve as a special-purpose accelerator than to replace today’s broadly applicable HPC systems.

RECOMMENDATION 2.4: NNSA should continue to invest in and track quantum computing research and development for future integration into its computational toolkit; these technologies should be considered an additional computational tool rather than a replacement for current approaches.