Automated Research
Workflows for Accelerated
Discovery
Closing the Knowledge Discovery Loop
________
Committee on Realizing Opportunities for
Advanced and Automated Workflows in
Scientific Research
Board on Research Data and Information
Board on Mathematical Sciences and
Analytics
Committee on Applied and Theoretical
Statistics
Computer Science and
Telecommunications Board
Division on Engineering and Physical
Sciences
Policy and Global Affairs
Consensus Study Report
NATIONAL ACADEMIES PRESS 500 Fifth Street, NW, Washington, DC 20001
This activity was supported by contracts between the National Academy of Sciences and Schmidt Futures. Any opinions, findings, conclusions, or recommendations expressed in this publication do not necessarily reflect the views of any organization or agency that provided support for the project.
International Standard Book Number-13: 978-0-309-68652-5
International Standard Book Number-10: 0-309-68652-0
Digital Object Identifier: https://doi.org/10.17226/26532
This publication is available from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu.
Copyright 2022 by the National Academy of Sciences. National Academies of Sciences, Engineering, and Medicine and National Academies Press and the graphical logos for each are all trademarks of the National Academy of Sciences. All rights reserved.
Printed in the United States of America.
Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. https://doi.org/10.17226/26532.
The National Academy of Sciences was established in 1863 by an Act of Congress, signed by President Lincoln, as a private, nongovernmental institution to advise the nation on issues related to science and technology. Members are elected by their peers for outstanding contributions to research. Dr. Marcia McNutt is president.
The National Academy of Engineering was established in 1964 under the charter of the National Academy of Sciences to bring the practices of engineering to advising the nation. Members are elected by their peers for extraordinary contributions to engineering. Dr. John L. Anderson is president.
The National Academy of Medicine (formerly the Institute of Medicine) was established in 1970 under the charter of the National Academy of Sciences to advise the nation on medical and health issues. Members are elected by their peers for distinguished contributions to medicine and health. Dr. Victor J. Dzau is president.
The three Academies work together as the National Academies of Sciences, Engineering, and Medicine to provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions. The National Academies also encourage education and research, recognize outstanding contributions to knowledge, and increase public understanding in matters of science, engineering, and medicine.
Learn more about the National Academies of Sciences, Engineering, and Medicine at www.nationalacademies.org.
Consensus Study Reports published by the National Academies of Sciences, Engineering, and Medicine document the evidence-based consensus on the study’s statement of task by an authoring committee of experts. Reports typically include findings, conclusions, and recommendations based on information gathered by the committee and the committee’s deliberations. Each report has been subjected to a rigorous and independent peer-review process and it represents the position of the National Academies on the statement of task.
Proceedings published by the National Academies of Sciences, Engineering, and Medicine chronicle the presentations and discussions at a workshop, symposium, or other event convened by the National Academies. The statements and opinions contained in proceedings are those of the participants and are not endorsed by other participants, the planning committee, or the National Academies.
Rapid Expert Consultations published by the National Academies of Sciences, Engineering, and Medicine are authored by subject-matter experts on narrowly focused topics that can be supported by a body of evidence. The discussions contained in rapid expert consultations are considered those of the authors and do not contain policy recommendations. Rapid expert consultations are reviewed by the institution before release.
For information about other products and activities of the National Academies, please visit www.nationalacademies.org/about/whatwedo.
COMMITTEE ON REALIZING OPPORTUNITIES FOR ADVANCED AND AUTOMATED WORKFLOWS IN SCIENTIFIC RESEARCH
DANIEL ATKINS (NAE) (Chair), W. K. Kellogg Professor Emeritus of Information and Professor Emeritus of Electrical Engineering and Computer Science, University of Michigan
ILKAY ALTINTAS, Chief Data Science Officer, San Diego Supercomputer Center and Founding Fellow, Halicioglu Data Science Institute, University of California, San Diego
SHREYAS CHOLIA, Group Leader, Usable Software Systems Group, Lawrence Berkeley National Laboratory
MERCÈ CROSAS, Secretary of Open Government, Government of Catalunya
ALFRED HERO, R. Jamison and Betty Williams Professor of Engineering, Department of Electrical Engineering and Computer Science, University of Michigan
REBECCA LAWRENCE, Managing Director, F1000 Research Ltd, London, UK
BRADLEY MALIN (NAM), Accenture Professor of Biomedical Informatics, Biostatistics and Computer Science, Vanderbilt University
LARA MANGRAVITE, President, Sage Bionetworks
BRIAN NOSEK, Executive Director, Center for Open Science*
TAPIO SCHNEIDER, Theodore Y. Wu Professor of Environmental Science and Engineering and Jet Propulsion Laboratory Senior Research Scientist, California Institute of Technology
*Resigned from committee, May 26, 2020
Principal Project Staff
TOM ARRISON, Study Director and Director, Board on Research Data and Information
EMI KAMEYAMA, Program Officer, Board on Research Data and Information
GEORGE STRAWN, Scholar, Board on Research Data and Information
ESTER SZTEIN, Deputy Director, Board on Research Data and Information
OLIVIA TORBERT, Senior Program Assistant, Board on Research Data and Information (until March 2022)
JON EISENBERG, Senior Director, Computer and Telecommunications Board
MICHELLE SCHWALBE, Director, Board on Mathematical Sciences and Analytics
PAULA TARNAPOL WHITACRE, Consultant Writer
BOARD ON RESEARCH DATA AND INFORMATION
SARAH M. NUSSER (Chair), Professor, Department of Statistics, Iowa State University
AMY BRAND, Director, MIT Press
BONNIE CARROLL, Retired Founder and Strategic Consultant, Information International Associates, Inc. (CODATA Secretary General)*
STUART I. FELDMAN, Chief Scientist, Schmidt Futures
IAN T. FOSTER, Senior Scientist, Argonne National Laboratory and Distinguished Fellow and the Arthur Holly Compton Distinguished Service Professor of Computer Science, University of Chicago.
RAMANATHAN GUHA, Google Fellow and Vice President, Google, Inc.
JOHN HILDEBRAND (NAS), Regents Professor of Neuroscience, University of Arizona (NAS Foreign Secretary)*
SALLIE A. KELLER (NAE), Distinguished Professor in Biocomplexity, Director of the Social and Decision Analytics Division within the Biocomplexity Institute and Initiative, University of Virginia
MARY LEE KENNEDY, Executive Director, Association of Research Libraries
BAREND MONS, Chair in Biosemantics, Leiden University Medical Center
MICHAEL STEBBINS, President, Science Advisors, LLC
*Denotes ex-officio member
COMPUTER SCIENCE AND TELECOMMUNICATIONS BOARD
LAURA M. HAAS (NAE) (Chair), Dean, Manning College of Information & Computer Sciences, University of Massachusetts, Amherst
DAVID E. CULLER (NAE), Professor, Electrical Engineering and Computer Science, University of California, Berkeley
ERIC HORVITZ (NAE), Chief Scientific Officer, Microsoft Research
CHARLES ISBELL, Dean of Computing and John P. Imlay Jr. Chair, Georgia Institute of Technology
ELIZABETH MYNATT, Distinguished Professor and Executive Director, College of Computing, Georgia Institute of Technology
CRAIG PARTRIDGE, Professor and Department Chair, Colorado State University
DANIELA RUS (NAE), Andrew (1956) and Erna Viterbi Professor, Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology
FRED B. SCHNEIDER (NAE), Samuel B Eckert Professor of Computer Science, Cornell University
MARGO I. SELTZER (NAE), Canada 150 Research Chair and Cheriton Family Chair, Computer Science, University of British Columbia
NAMBIRAJAN SESHADRI (NAE), Professor of Practice, Electrical and Computer Engineering, University of California, San Diego
MOSHE Y. VARDI (NAS/NAE), George Distinguished Service Professor in Computational Engineering, Rice University
BOARD ON MATHEMATICAL SCIENCES AND ANALYTICS
MARK L. GREEN (Chair), Distinguished Research Professor in the Department of Mathematics, University of California, Los Angeles
HÉLÈNE BARCELO, Deputy Director, Mathematical Sciences Research Institute
BONNIE BERGER (NAS), Simons Professor of Mathematics, Department of Mathematics, Computer Science and AI Lab, Massachusetts Institute of Technology
RUSSEL E. CAFLISCH (NAS), Director of the Courant Institute, Professor in the Mathematics Department, New York University
DAVID CHU, Adjunct Staff, Institute for Defense Analyses
DUANE COOPER, Associate Professor of Mathematics, Morehouse College
JAMES (JIM) CURRY, Professor, Applied Mathematics Department, University of Colorado Boulder
RONALD FRICKER, Professor of Statistics and Associate Dean, College of Science, RR
TRACHETTE JACKSON, Professor, University of Michigan
LYDIA KAVRAKI (NAM), Noah Harding Professor of Computer Science, Bioengineering, Electrical and Computer Engineering, and Mechanical Engineering, Rice University
TAMARA KOLDA, Distinguished Member of Technical Staff, Sandia National Laboratories
PETER KOUMOUTSAKOS, Professor for Computational Science, ETH Zurich
RACHEL KUSKE, Professor of Mathematics and Department Chair, Georgia Institute of Technology
YANN A. LECUN (NAS/NAE), Professor, Courant Institute of Mathematical Sciences and Center for Data Science, New York University
JILL C. PIPHER, Vice President for Research and Elisha Benjamin Andrews Professor of Mathematics, Department of Mathematic, Brown University
YORAM SINGER, Chief AI Scientist, WorldQuant
TATIANA TORO, Craig McKibben & Sarah Merner Professor of Mathematics, University of Washington
LANCE WALLER, Rollins Professor and Chair, Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University
AMIE WILKINSON, Professor of Mathematics, University of Chicago
KAREN E. WILLCOX, Director, Institute for Computational Engineering and Sciences, Professor of Aerospace Engineering and Engineering Mechanics, University of Texas at Austin
This page intentionally left blank.
Preface
Exponential improvement in computing and communication technology has revolutionized the conduct of research beginning in 1946 when ENIAC began computing ballistic trajectories at a 3,000× speedup over manual methods. The tools for research have evolved from a few warehouse-size automated calculators to a distributed global cyberinfrastructure supporting observation, control, calculation, information access, publishing, and collaboration. The growing capacity and capability of computers and data repositories enables multiscale, multiscience analysis, modeling, and prediction at ever-increasing temporal and spatial resolution. The concept of e-science or research collaboratory laboratories without walls, reducing barriers of participation, time, and distance (both physical and disciplinary), is pervasive. These concepts have taken us beyond just automating traditional research methods to the possibility of tackling new problems in new ways.
While this stepwise revolution is enabled by hardware and software engineering advances, their meaningful use in research requires intentional nurturing. As the technology advances, early adopters find ways to use it for competitive advantage, and others take note. But broad adoption of new tools and methods, and especially supporting the required cyberinfrastructure (e-research infrastructure), the human resources, and changes in the community of practice require community-driven leadership from research sponsors, the researchers themselves, and their institutions. A combination of bottom-up, community-initiated workshops together with institutionally commissioned studies strives to articulate new opportunities and how to realize them. It is hoped that these motivate research sponsors and performers, scholarly publishers, professional societies, archivers, indexers, and other stakeholders to work together to bring the new capabilities
to scale and into meaningful applications. Past examples of such cooperation include the National Science Foundation (NSF) and Department of Energy supercomputer centers, NSFNet, the NSF-ARPA (Advanced Research Projects Agency)-NIH (National Institutes of Health) digital library initiatives, and many more recent initiatives around cyberinfrastructure-enhanced research and data science.
In this spirit, this National Academies of Sciences, Engineering, and Medicine (National Academies) consensus study, commissioned by Schmidt Sciences, is intended as a contribution to a next step in the transformative application of computing to scientific discovery. Leveraging what has come before, researchers are now incorporating artificial intelligence (AI) and the automation of scientific instruments into the research workflow. The focus of this report is not only the use of methods of AI and machine learning (ML) as a component in a workflow, but also the use of these methods to design experiments and to automatically control them. The goal is to use them in an iterative loop by using experiments or observation data to test and learn about a model, and then to use AI and ML methods to generate the design for the next data collection. This closed loop iterates and, in examples we present in the report, accelerates discovery by orders of magnitude. We refer to these as automated research workflows (ARWs).
Realizing the potential of ARWs is a complex mix of technology, funding, policy, regulation, ethics, education, reward structures, and the overall sociology of varied research communities of practice. ARWs offer benefits that go beyond accelerating exploration, including enhanced capture of provenance, integrity, reproducibility, and dissemination. But achieving the benefit of ARWs depends upon progress in addressing the same fundamental issues that persist in most other explorations of the next big thing in cyberinfrastructure-enhanced computing. These include long-term sustainability of cyberinfrastructure (computing, networking, and now critically, data and programs); reducing barriers and increasing incentives for interdisciplinary collaboration; addressing security challenges; and educating current practitioners and students to design and responsibly use ARWs.
This report is built upon the contributions of many people: 9 dedicated members of the study committee, the 23 agenda speakers and many other participants in a workshop on March 16–17, 2020, and National Academies staff. This report, like many things since March 2020, has been produced under unusual circumstances. We converted the workshop from onsite in Washington to distance independent on 3 days’ notice with surprisingly good results. We engaged 120 people for 2 days over 9 time zones and gathered more input than usual through real-time recordings and the mining of chat streams. The subsequent report was then produced through a series of distributed meetings. Although a report like this can be successfully produced in a never-in-the-same-room way, we hope that post-pandemic studies will include some use of trust-building face-to-face meetings and social events.
Sixty years ago, I programmed my first computer, an IBM 1620, and since then have had the enormous privilege to participate in the astonishing computer revolution. As part of that, I have had the honor of chairing several study committees of extraordinary people, such as this committee, to help describe emerging opportunities and challenges for transforming research or learning. My thanks to all who have made this possible. I conclude with the hope that this report will indeed be a contribution to empowering transformative research and its application to a better world.
Daniel E. Atkins
Chair, Committee on Realizing Opportunities for Advanced and Automated Workflows in Scientific Research
This page intentionally left blank.
Acknowledgments
This Consensus Study Report was reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise. The purpose of this independent review is to provide candid and critical comments that will assist the National Academies of Sciences, Engineering, and Medicine in making each published report as sound as possible and to ensure that it meets the institutional standards for quality, objectivity, evidence, and responsiveness to the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process.
We thank the following individuals for their review of this report: Carole Goble, University of Manchester; Brooks Hanson, American Geophysical Union; Daniel Katz, University of Illinois; Gary King, Harvard University; Robert Murphy, Carnegie Mellon University; Kristin Persson, Lawrence Berkeley National Laboratory; Beth Plale, Indiana University; and Margo Seltzer, University of British Columbia.
Although the reviewers listed above provided many constructive comments and suggestions, they were not asked to endorse the conclusions or recommendations of this report, nor did they see the final draft before its release. The review of this report was overseen by Philip Neches, California Institute of Technology. He was responsible for making certain that an independent examination of the report was carried out in accordance with the standards of the National Academies and that all review comments were carefully considered. Responsibility for the final content rests entirely with the authoring committee and the National Academies.
This page intentionally left blank.
Contents
Scope of the Study and Organization of This Report
2 CONTEXT FOR AUTOMATED RESEARCH WORKFLOWS
Building Automated Research Workflows: Current State of the Art
Implementing Automated Research Workflows: A Changing Scientific Paradigm
Policy and Industry Context for Automated Research Workflows
3 AUTOMATED RESEARCH WORKFLOWS IN ACTION
Social and Behavioral Sciences
Abbreviations and Acronyms
ADHO | Alliance of Digital Humanities Organizations |
AI | artificial intelligence |
ARW | automated research workflow |
ATLAS | A Toroidal LHC ApparatuS |
CSTCloud | China Science and Technology Cloud |
CWL | Common Workflow Language |
DARPA | Defense Advanced Research Projects Agency |
DOD | Department of Defense |
DOE | Department of Energy |
DUA | data use agreement |
EC | European Commission |
EOSC | European Open Science Cloud |
EU | European Union |
FAIR | findable, accessible, interoperable, and reusable |
GFLOPS | billion floating point operations per second |
HPC | high-performance computing |
HTC | high-throughput computing |
iREDS | Institutional Re-Engineering of Ethical Discourse in STEM |
LHC | Large Hadron Collider |
ML | machine learning |
NASEM | National Academies of Sciences, Engineering, and Medicine |
NIH | National Institutes of Health |
NSF | National Science Foundation |
OSTP | Office of Science and Technology Policy |
RDA | Research Data Alliance |
SDSS | Sloan Digital Sky Survey |
UKRI | United Kingdom Research and Innovation |