Page 1 Cite

Suggested Citation:"Summary." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

Summary

Human society increasingly looks to the research enterprise to help address a wide range of complex and interrelated challenges, including climate change and other environmental issues, emerging infectious diseases and other threats to health, food insecurity, and economic and social disparities. In addition, researchers are working to expand our understanding of fundamental scientific questions, including the origins of the universe, the nature of matter, and the evolution of language. The expanding capability, capacity, integration, and ubiquity of advanced information technology supporting research, also known as cyberinfrastructure, provide ever more opportunity for discovery through modeling, simulation, and prediction, now aided by the practical application of machine learning (ML) and artificial intelligence (AI). Accelerating progress depends on leveraging the exponential growth in the amount and variety of data available through the development and use of sophisticated computational approaches. Computation and automation of data acquisition allow automated systems to analyze data and extract knowledge, to create predictive models, and to use those models to guide the acquisition of additional data, closing and automating the loop in a typical research workflow.

The committee uses the term automated research workflow (ARW) to describe scientific research processes emerging across a variety of disciplines and fields. ARWs integrate computation, laboratory automation, and tools from AI in the performance of tasks that make up the research process, such as designing experiments, observations, and simulations; collecting and analyzing data; and learning from the results to inform further experiments, observations, and simulations. Although the specific tools and resources used and the tasks performed vary by field, the common goal of researchers implementing ARWs is to

Page 2 Cite

Suggested Citation:"Summary." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

accelerate scientific knowledge generation, potentially by orders of magnitude, while achieving greater control and reproducibility in the scientific process.

The tools and techniques being developed under the large umbrella of ARWs promise to transform the centuries-old serial method of research investigation into processes in which thousands or even millions of simulations or experiments are iterated rapidly in closed loops, with the analysis of data and even the design of experiments or controlled observations being assisted by ML or optimization techniques. Simultaneously, ARWs provide a way to satisfy pressing demands across fields to increase interoperability, reproducibility, replicability, and trustworthiness by better tracking results, recording data, establishing provenance, and creating more consistent metadata than even the most dedicated researchers can provide themselves.

To explore the benefits and challenges, as well as to suggest opportunities to move forward, the National Academies of Sciences, Engineering, and Medicine’s Board on Research Data and Information, in collaboration with the Board on Mathematical Sciences and Analytics and the Computer Science and Telecommunications Board, launched a study aimed at examining current efforts to develop advanced and automated workflows to accelerate research progress. A committee of nine members undertook the study, with support provided by Schmidt Futures. To accomplish its task, committee members held a public workshop in March 2020 and numerous additional discussions, conducted a review of the literature, and drew on their own expertise in specific domains and cross-disciplinary areas such as open science and digital transformation.

Although impressive strides are being made to apply ARWs in a variety of fields, there are also significant barriers to progress. Development and widespread adoption of ARWs require new skills, supportive funding mechanisms, and shifts in culture. Concerns about the role of humans in the discovery loop, privacy of data, and the impact on current incentive systems need to be addressed. What unforeseen technical and ethical issues may arise? Who “owns” the data and discoveries that are produced by automated and distributed systems? How should researchers evolve their practices to reap the benefits of automation while not losing the serendipity of human inspiration and creativity? What goals are best achieved by human scientists (such as invention of new techniques) and which are better left to automation (such as driving data collection to optimize models)?

Given the newness and rapidly evolving nature of this topic, the committee’s findings and recommendations are necessarily broad and future oriented. Fully realized ARWs are not common at present, and so the study examines how and where progress is being made in areas such as advanced computation, use of workflow management systems, laboratory automation, and the use of AI both as a workflow component as well as in directing the “outer loop” of the research process. This report constitutes an initial effort to create awareness, momentum, and synergies to help realize the potential of ARWs in scientific discovery. The committee hopes that this report will stimulate further discussion, transformations, and, most important, investments and meaningful use.

Page 3 Cite

Suggested Citation:"Summary." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

FINDING A: ACCELERATING DISCOVERY

In many disciplines, the emergence of automated research workflows (ARWs), built upon contemporary cyberinfrastructure, is demonstrating the potential to vastly increase the speed and efficiency of a range of research activities. These include designing and conducting experiments, analyzing data, and observing natural phenomena. These improvements can be realized at scale by implementing infrastructure and practices that facilitate the application of artificial intelligence and machine learning and related technologies to research. Realizing the potential of ARWs could accelerate the pace of scientific discovery by orders of magnitude and thereby expand the research enterprise’s contribution to society.

Developments in multiple domains are beginning to deliver on this potential. Below are some examples:

In materials science, research groups are building systems in which a combination of laboratory automation and ML are cutting the time required for synthesis and testing of materials from 9 months to 5 days (Service, 2019).
In particle physics, new approaches to drawing inferences can be combined in workflows that implement various inference algorithms. The new approaches hold the possibility of significantly advancing the productivity of research by allowing experiments to achieve, for example, a given sensitivity using half the data (Cranmer, 2020).
In drug discovery, an active learning algorithm identified 57 percent of the active compounds by performing 2.5 percent of the possible experiments, compared with 20 percent identified through a traditional approach of building a model for each target (Kangas et al., 2014).
Researchers working in biochemistry are using robotics and data science to automate high-throughput synthesis and screening (Cernak, 2020).
Astronomers are using ML and increasingly fine controls on telescopes to automate target selection so that observations are optimally informative given the observational constraints and scientific objectives (Szalay, 2020).
In climate science, the generation of high-resolution local simulations to inform lower-resolution global climate models about important small-scale processes can be automated, closing the loop of generating computational experiments and informing a global model with them (Schneider et al., 2017a).
In digital humanities, scholars are using deep neural annotation networks to tackle the complexities of compiling information from huge volumes of words and across multitudes of languages over the centuries to see patterns in how ideas have spread and changed over time, and to understand the development of human thought (Crane, 2020).

Page 4 Cite

Suggested Citation:"Summary." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

Researchers in the social and behavioral sciences are using new data resources and advanced analytics to better understand and address a range of pressing problems, including poverty alleviation and strengthening the delivery of public services in cities (O’Brien et al., 2017).¹

FINDING B: ADDITIONAL BENEFITS

In addition to increasing the speed and efficiency of research, the effective development and implementation of the technical and human infrastructure for automated research workflows (ARWs) will contribute to strengthening the research process in other ways. For example, the greater transparency and repeatability made possible by automating and capturing specific steps in the research process—advances that underlie the development of ARWs—can foster reproducibility, replicability, and responsibility in research. Adoption of common and interoperable tools and platforms—which could be accelerated by the advance of ARWs but depends on other developments as well—can facilitate international and interdisciplinary research collaboration. Broader access to research workflows and results and the enhanced ability to uncover and correct errors can contribute to greater confidence in research findings and the research enterprise and reduce redundancy among research efforts. In addition, incorporating emerging principles and guidelines for responsible artificial intelligence and machine learning advocated by various organizations, such as building in human review of algorithms, uncovering and addressing bias, and supporting transparency and reproducibility, will also help to secure the benefits of ARWs.

ARWs can help strengthen transparency, rigor, and reliability in research in several ways. These include the use of scientific workflow engines, software that provides a formalization of the computational analysis pipeline, as well as platforms to facilitate data flows and data management pipelines (ODSC Community, 2021). These tools provide a structured and repeatable process to automate complex ARWs at scale.

An additional trend offering an on-ramp to ARWs is the use of interactive computational laboratory notebooks, such as Jupyter, that allow researchers to capture a set of discrete analysis steps and track them with a single user interface. There were about 9.7 million Jupyter notebooks stored on GitHub when this was written in November 2020, with the number growing by about 8,500 per day (Project Jupyter, 2020). Increasingly sophisticated platforms, such as Google’s AI Platform and RedHat Openshift AI/ML workflows that integrate Jupyter Notebooks or similar interactive interfaces with scalable compute and data resources, are enabling complex, iterative AI/ML pipelines (Google Cloud, 2020; RedHat Openshift, 2020).

___________________

¹ For more information, see also https://opportunityinsights.org.

Page 5 Cite

Suggested Citation:"Summary." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

RECOMMENDATION 1: Design Principles

Organizations that fund, perform, and disseminate research, along with scientific societies, should support and enable automated research workflows (ARWs) that embody the following design principles:

ARWs and the systems, tools, and platforms that comprise them should facilitate openness, reproducibility, and transparency.
ARWs should facilitate the effective use of artificial intelligence (AI) and machine learning (ML) as research tools and incorporate principles of responsible AI and ML to mitigate the risks from various human and technological deficiencies, such as confirmation and sampling biases, inappropriate application of statistics, and challenges to interpretability of results and quantification of confidence and uncertainties when drawing inferences from ML analyses.
The associated research objects (data, code, even entire workflows) for ARWs should be FAIR (findable, accessible, interoperable, and reusable), not only by humans but also by machines, to facilitate automated reuse and collaboration.
ARWs should prioritize reuse and sustainability of existing tools and systems when possible and appropriate, reducing costly duplication efforts and facilitating the extension of capabilities through integration or federation of systems, and agreement on standards. Designs should allow for specialization into specific domains, but avoid unnecessary rebuilding.
While proprietary services and components can enhance the utility of ARWs, key ARW infrastructure should be controlled by and be accessible to the research community itself, with the community developing standards and practices to facilitate this.

This list of principles constitutes a vision for future ARW development that emerged from the study. They support and expand upon the recommendations of several National Academies reports that underline the importance of openness and reproducibility in research (NASEM, 2017, 2018b, 2019b). Although ARWs can be implemented without open research, open data, code, and workflows can speed their adoption and enhance their impact.

As these principles and other recommendations demonstrate, the focus of the committee’s effort went beyond the use of AI as a component in a workflow to the use of AI methods to design experiments and to automatically control them. We offer our findings and recommendations related to design principles, infrastructure sustainability, human resources, culture and incentives, and privacy protection as contributions to the next step in the transformative application of computing to scientific discovery.

Page 6 Cite

Suggested Citation:"Summary." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

There was a great deal of discussion at the workshop about mitigating various risks to the effective utilization of ARWs that can arise from technological and human deficiencies. Workflows do not guarantee transparency or, more broadly, integrity. An overreliance on naive ML approaches can lead to severe biases and lack of replicability. An example is p-hacking which occurs when successive ML algorithms are applied to data and only the algorithm with best performance is reported (Nugent, 2020).

Examples discussed at the March 2020 workshop point to the benefits of building ARWs on common lower-level, interoperable infrastructure to the extent possible as opposed to being project specific. The Common Workflow Language (CWL) is a good example of a standard aimed at facilitating interoperability (CWL, 2020).

FINDING C: RESEARCH ENTERPRISE

Realizing the potential of automated research workflows (ARWs) will require modification of the research enterprise, including sustainable funding for the necessary hardware, software, and human resources, educating the scientific workforce, reporting and sharing research results, and structuring researcher rewards and incentives. Multidisciplinary, multirole collaboration is essential to realize the potential of ARWs.

Because solutions to the biggest problems of our time are complex, they require end-to-end workflows for integrated management of many technical steps in addition to extensive knowledge of the application domain. These integrated steps require expertise from multidisciplinary team members to collaborate on (1) methods to manage, integrate, and interpret “big” data; (2) modeling and simulation tools executing on scalable computing platforms; (3) methods and interfaces for domain-specific analysis, communication, and visualization of results; and (4) technologies to make the process FAIR, that is, portable, transparent, repeatable, and reproducible.² Such a multidisciplinary collaboration to solve a problem is different from the way an individual conducts a scientific study, shifting the paradigm from individual science to team science.

In addition to creating a more supportive environment for the development and implementation of ARWs, the actions identified here will support more effective use of AI/ML and advanced computation in research more broadly.

RECOMMENDATION 2: Infrastructure, Code, and Data Sustainability

Research funders, working with other stakeholders such as societies, research institutions, and publishers, should place greater priority on approaches to ensuring the creation and sustainability of key systems, tools,

___________________

² GO FAIR, https://www.go-fair.org/. Accessed August 21, 2020.

Page 7 Cite

Suggested Citation:"Summary." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

platforms, and data archives for automated research workflows (ARWs). Priorities include

Funding support for efforts by research institutions and societies to link disciplines so they can share and benefit from the expertise in statistics, machine learning or data science, and engineering and computer science that is required to build and maintain sustainable infrastructure for ARWs.
Funders and research communities structuring funding for cyberinfrastructure projects such as large scientific instruments so as to maximize the potential for innovation in ARWs and the reuse of data and other outputs.
Funders and research communities supporting open data standards and open interfaces for scientific instruments.
Funders and research institutions enabling reuse, reproducibility, and long-term sharing of FAIR data and software resources through support of repositories that make archival and updated versions of these resources available within and across disciplines, and providing approaches to sustain those repositories.
Publishers updating their data-sharing requirements by directly associating articles to data in FAIR repositories.

Although providing sustained support for the development and operation of shared cyberinfrastructure across multiple disciplines has been challenging in the United States, there are numerous historical success stories such as the National Science Foundation (NSF) National Supercomputer Centers, the development of GenBank and other digital data resources in the life sciences, and NSF’s advanced cyberinfrastructure program. Current and recent U.S. programs aimed at establishing shared resources include Harnessing the Data Revolution (NSF, 2020b), the National Institutes of Health (NIH) Data Commons Consortium (NIH, 2018), and a series of efforts to advance strategic computing and related technologies across agencies under the auspices of the Networking and Information Technology Research and Development Program³ and its predecessors. Currently under way is the Global Open Research Commons Interest Group⁴ under the Research Data Alliance (RDA) (Jones, 2021), led by some members of the now terminated Data Commons Consortium. It plans to address the issues related to avoiding silos and agreeing on standards to build globally oriented data and science environments.

On January 1, 2021, Congress passed the National Artificial Intelligence Initiative Act.⁵ The act’s language reflects many of the issues raised in this report.

___________________

³ For more information, see https://www.nitrd.gov/.

⁴ For more information, see https://www.rd-alliance.org/groups/global-open-research-commons-ig.

⁵ 15 U.S.C. Chapter 119, https://uscode.house.gov/view.xhtml?path=/prelim@title15/chapter119&edition=prelim.

Page 8 Cite

Suggested Citation:"Summary." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

The 2021 federal budget included $868 million (a 76 percent increase) to NSF for AI-related grants and interdisciplinary research initiatives; $125 million to the U.S. Department of Energy’s Office of Science on AI research; and $50 million to NIH for research on chronic disease using AI and related approaches (CRS, 2020). For example, NIH is planning to invest $23 million per year over 7 years to support Artificial Intelligence for Biomedical Excellence, which will generate new biomedically relevant data sets amenable to ML (NIH, 2020).

Advancing AI as a research tool and building complementary data stewardship capacities are a focus internationally. In 2020, United Kingdom Research and Innovation, the major public funder of research in the UK, targeted computational and e-infrastructure as a priority theme for research infrastructure investment. The Alan Turing Institute, the UK’s national institute for data science and AI, has been strategically investing in key areas as part of a shared vision for the future of AI and data science in the UK. The European Open Science Cloud⁶ is still in development but is designed to be a new shared infrastructure with long-term investment to provide access to data repositories as well as resources such as cloud services, high-performance computing, and data analysis tools (EOSC, 2020). And in China, the CSTCloud (China Science and Technology Cloud) is emerging as a national infrastructure of data and computation for accelerating science discovery (CSTCloud, 2020).

Given the significant investments that governments and other research funders are making in data-driven science, it makes sense to leverage these investments across borders and domains to the extent possible. Goals of enhanced international collaboration would include facilitating access to tools and resources and ensuring the interoperability of national implementations. Distributed international efforts such as GO FAIR, RDA, the Research Data Framework, CODATA, FORCE11, and the Research Software Alliance are working to develop standards and approaches to facilitate research data management and sharing and FAIR data and software.

RECOMMENDATION 3: Human Resources

Research funders, higher education, research institutions, and scientific and professional societies should support the development and implementation of educational programs and career pathways aimed at building the workforce needed to develop and utilize automated research workflows (ARWs), including the creation of career tracks that support ARW capabilities. Examples of what is needed include

Programs that foster integration of domain expertise with data science and software engineering skills.
Programs that inculcate data literacy and computational analytical skills in all areas of research.

___________________

⁶ See https://ec.europa.eu/info/research-and-innovation/strategy/goals-research-and-innovation-policy/open-science/european-open-science-cloud-eoscen.

Page 9 Cite

Suggested Citation:"Summary." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

Developing the human resources needed to build, maintain, and operate ARW hardware and software, including hardware and software engineers who build, maintain, and operate automated laboratories and the software needed to learn from data and to design experiments.
Fostering collaborative research that aims at developing and using ARWs and that facilitates sharing workflows, code, data, and data products in ways that respect and protect privacy considerations.

Discussion at the March 2020 workshop revealed a number of education and training needs and challenges to effective development and implementation of ARWs that are common across multiple domains. For example, a need for more researchers who combine domain knowledge and data science or software development expertise was expressed in just about all use case discussions. Knowledge of mathematical and computational methods for designing experiments or controlling observations automatically and for learning from data is becoming essential for researchers (NASEM, 2018a, 2020b).

RECOMMENDATION 4: Culture and Incentives

Research funders, research institutions, and disciplines should work to create an automated research workflow (ARW)-friendly culture by making changes in incentive and reward structures aimed at encouraging behaviors that are central to realizing the potential of ARWs. These include

Encouraging team science and multidisciplinary teams.
Using funding support and provisions for data management plans to encourage development and curation of FAIR, responsible, and good-quality data resources.
Developing, improving, and sharing software resources.
Reporting reproducible results.
Helping others adopt ARW practices.
Pursuing international collaboration when possible in order to accelerate progress toward implementing the above changes at scale.

Misalignment of the incentives and priorities of researchers, research institutions, and research funders with the actions and efforts needed to effectively develop and implement ARWs was a major theme of the March 2020 workshop discussion and manifested itself in a variety of ways. Indeed, the rewards and incentives built into the research enterprise as it exists today and their negative impact on efforts to modernize research through greater transparency and rigor go beyond ARWs and have been discussed in several recent National Academies reports (NASEM, 2017, 2018b, 2019b, 2020a). Current conditions are not conducive to creating research environments that encourage transparency, sharing, the formation of interdisciplinary teams, and other behaviors that would boost the effectiveness of ARWs.

Page 10 Cite

Suggested Citation:"Summary." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

The COVID-19 experience illustrates that the prerequisites for automated and rapid research responses to public health threats are the same as those needed to effectively utilize ARWs in general (RDA, 2020). These include common data models, approaches to mitigate bias in data collection and analysis, and support for the infrastructure necessary to facilitate near-real-time scientific investigation and dissemination of findings that lead to guidance on how best to mitigate public health threats. Currently, there are strong incentives for launching rapid research responses, but weak incentives for sharing the outputs and ensuring that such responses are rigorous and reliable.

A shortage of shared domain resources being used, particularly well-characterized FAIR data and related infrastructure such as repositories and active curated services, is apparent in drug discovery, the digital humanities, and other fields. These resources are ultimately critical to the implementation of ARWs, because data are needed to develop and train ML algorithms, which in turn enable the development of closed-loop systems.

The European Union has prioritized open research data. Recent cost-benefit analysis estimated that by not having FAIR data, Europe incurs an opportunity cost of about $10 billion per year in categories such as additional time spent on research, higher storage costs, higher license costs, and higher rates of retracted research findings (EC, 2019). The EU also increased its investments in AI by 70 percent from 2018 to 2020, to about €1.5 billion (EC, 2021a).

Recognizing the overall importance of FAIR principles, their most salient point in the context of ARWs is to make research products such as the data, software, and workflows machine actionable. In addition to FAIR, data quality and data privacy (in some cases) are of great importance, and the FAIR principles do provide for authorization to access private data.

FINDING D: LEGAL AND POLICY ISSUES

In addition to barriers to progress that exist within the research process itself, there are legal and policy issues that affect implementation of automated research workflows in specific domains that will require international multistakeholder efforts to address.

Data collected for use in ARWs will increasingly include data generated outside of a traditional research setting, such as personal health data collected from wearables and medical visits or behavioral data collected online from social media. Use of such data is subject to additional challenges—catalyzed by documented cases of misuse—including public mistrust, institutional policies, and government regulation designed to protect personal data privacy. ARWs will have to be designed to comply with these terms and provide transparency in results from data use. For example, in response to concerns about the security and use of personal data, exacerbated by well-publicized examples of data breaches and misuse, policy makers and public interest groups have pushed to allow individuals to

Page 11 Cite

Suggested Citation:"Summary." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

have greater control over the use, storage, and reuse of their data. Prime examples are the European Union’s General Data Protection Regulations and the California Consumer Privacy Act. ARWs will need to comply with such policies.

The development and implementation of ARWs is also impacted by broader issues raised by the growing use of AI in a variety of policy-making and decisionmaking contexts. It will be a challenge to maintain transparency and trust in outcomes with significant real-life consequences when an algorithm determines those outcomes (Stoyanovich et al., 2020a).

RECOMMENDATION 5: Preserving Privacy

Research enterprise funders, performers, publishers, and beneficiaries should work with governments, data privacy experts, and other entities to address the legal, policy, and associated technical barriers to implementing automated research workflows in specific domains. They should explore solutions to make the outputs available through privacy-preserving algorithms and federated learning approaches to using data.

Privacy, ethics, and similar socially based topics will likely emerge as (a) more data are added into the workflow cycle and (b) the questions being asked are modified. Building mechanisms into ARWs that recognize and are sensitive to authorized data use issues is new territory to explore and develop, for example, to be able to distinguish between permitted and prohibited queries (Kusnezov, 2020). Ideas proposed at the workshop included embedding compliance in the design of the software for open research data services and standards for the architecture of the sharing and access system (Burgelman, 2020). Examples of current work on privacy-preserving approaches for social sciences include how to reduce privacy loss when dealing with small sample sizes (Chetty and Friedman, 2019) and development of a mathematical framework to quantify and manage privacy risks (Wood et al., 2018).

Page 12 Cite

Suggested Citation:"Summary." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×