National Academies Press: OpenBook
« Previous: 1 Introduction
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page27
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page28
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page29
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page30
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page31
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page32
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page33
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page34
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page35
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page36
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page37
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page38
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page39
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page40
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page41
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page42
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page43
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page44
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page45
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page46
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page47
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page48
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page49
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page50
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page51
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page52
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page53
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page54
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page55
Suggested Citation:"2 Context for Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page56

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

2 Context for Automated Research Workflows THE OPPORTUNITY As we enter the third decade of the 21st century, societal demands are converging that automated research workflows (ARWs) can help address. The volume and exponential growth of digital data and of the ability to mine and generate those data provide rich opportunities for progress. This growth has led to quantitative change in the way research is conducted. Pairing advances in artificial intelligence (AI), computing, and automation of laboratories and observations can also lead to a qualitative step change. Technological developments go hand in hand with scientific progress, and advances in computing and automation are no exception. Computing plays a central role throughout research workflows, from computerized models used for simulation and prediction, to control of equipment and data analysis, to publication. Laboratories and observational devices are increasingly controlled by computers and are automated. For example, telescopes now are routinely controlled remotely by computers, and increasingly the observational process is automated following workflows predefined by humans. Biological and chemical laboratories are increasing use of microfluidic devices that enable automated experimentation at higher throughput and a faster pace than is possible by hand. Computers and automation have led to a quantitative increase in research productivity over the past half century. 27 PREPUBLICATION COPY—Uncorrected Proofs

As an indicator of growth, the “global datasphere”—the amount of data that is created, captured, copied, or consumed in a given year—was estimated to have reached 64 zettabytes (or 64 trillion gigabytes) in 2020 and is projected to grow to around 175 zettabytes by 2025 (Woodie, 2020). Computing performance has also exploded. The peak performance of the world’s fastest computers has increased from 1 billion floating point operations per second (GFLOPS) in the 1980s to 108 GFLOPS in 2020. 1 More radical changes are under way. Consider the evolution of the basic research workflow since the scientific revolution of the 17th century, which put science on an empirical footing. A scientist was not only to observe “nature in the raw,” but also, in Francis Bacon’s (1620) words, to “twist the lion’s tail,” that is, “manipulate our world in order to learn its secrets” (Hacking, 1983). Science since then has advanced through a virtuous circle in which measurements, observations, and, increasingly, simulations generate data. Harnessing the data leads to an update of existing models or the formulation of new models for the data (Figure 2-1). Knowledge generation can begin at any point in this loop, for example, either with a new model that prompts the generation of new data, or with new data that prompts the generation of a new model or the modification of an existing one. The process of starting from a model and devising an observational or experimental way of generating new data is called experimental design (with an experiment understood broadly to include, for example, the collection of observational data). As we use it here, experimental design does not need to imply constructing a comprehensive set of experiments to cover some experimental space or creating a new experimental procedure. It can be as simple as choosing a 1 For more information, see https://www.top500.org. Accessed April 19, 2022. 28 PREPUBLICATION COPY—Uncorrected Proofs

single set of experimental parameters (e.g., a particular combination of materials, chemical reactants, or drugs and targets) to test next using an established procedure. The process of using data to inform a model can be called learning about a model. We are using the term “model” broadly, to include instantiations of general theories (e.g., how the collision of black holes gives rise to gravitational waves according to the general theory of relativity) and empirical or semiempirical models (e.g., in economics or the environmental sciences). “Data” is similarly taken as an all-encompassing term—for example, data generated in simulation studies as well as in laboratory experiments, or in the digital humanities, original text sources, images, maps, social media, and much more. FIGURE 2-1 Knowledge discovery loop. NOTE: Automated research workflows can automate and close the loop of scientific discovery. On one side of the loop, artificial intelligence (AI) and machine learning (ML) algorithms harness the experimental or observational data to learn about a model; on the other side of the loop, AI and ML are used to generate the study design for the next data collection. The loop goes on iteratively. 29 PREPUBLICATION COPY—Uncorrected Proofs

The potential for greater progress lies in exploiting this proven success model of science, but accelerating it by orders of magnitude by iterating faster and continuously. Given a model, not only one but many (thousands and more) experiments may be designed automatically and can be optimized to be maximally informative about the model. The learning step can likewise be automated. Algorithms for designing experiments to be maximally informative and for learning about a model from data can become broad purpose and transcend individual disciplinary fields (in much the same way that least squares estimation has become broad purpose). ARWs can be structured so that data collection, analysis, and hypothesis revision and refinement are undertaken as a continuous process, with updates occurring as new data are generated or discovered (Gil et al., 2017). However, models and data, and the means of acquiring data, will likely remain domain- specific. That is, the edges of the graph in Figure 2-1 can be automated with methods that transcend individual disciplines, but the nodes—data and models—will remain domain specific. The loop remains open to human intervention, for example, to identify variables relevant to measurement and modeling, and to analyze serendipitous results. To cast the discussion in modern ML terms, the closed-loop research workflow in Figure 2-1 encapsulates a form of reinforcement learning (Sutton and Barto, 1998), in which a model is used to design a manipulation or observation of an environment to generate data (experimental design), and the model subsequently learns from the data so generated. Reinforcement learning in essence is how science has progressed for centuries. Similar gains from rapid iteration in what may be understood to be a reinforcement learning loop—but may include a variety of techniques from ML, Bayesian learning, and experimental design—are now possible in some scientific fields. 30 PREPUBLICATION COPY—Uncorrected Proofs

A new generation of workflows is making extensive use of AI, ML, and, in general, automation. (See Box 2-1 for working definitions for these and other key terms used in this report.) AI and ML are increasingly used as components of ARWs across many domains: examples include understanding protein folding in biology and analyzing sparse data in the geosciences (Gil et al., 2019: Hey et al., 2020). Beyond that, AI and ML are beginning to be used to automate the design and operation of elements that are traditionally considered part of the workflow itself, such as the design of experiments (Deelman et al., 2019). This offers the opportunity to produce a next generation of workflows that are dynamic, intelligent, and self- governing. Distinct uses of AI and ML and workflows play a role in the many phases of a research project (e.g., planning, exploration, scale-up, and publishing). AI and ML techniques deployed within ARWs not only can drive an experiment and mine the literature to suggest future experiments, but also may enhance research reliability and productivity by facilitating the reuse of workflows and improving the ability of researchers to monitor workflow execution and detect anomalies (Deelman et al., 2019). Similar concepts in nonscientific domains are being called intelligent or cognitive workflows (Bellissimo, 2019). ARW opportunities are vast but are accompanied by technical and mathematical challenges and, perhaps even more so, by organizational, economic, policy and political, social, and incentive issues. Wide adoption of ARWs requires consideration of these interrelated concerns. In the United States, the leadership for advancing computing for research and higher education has come from the National Science Foundation, the Department of Defense (DOD), the Department of Energy (DOE), and the National Institutes of Health (NIH), sometimes coordinated by the Office of Science and Technology Policy (OSTP). Evolution of the technology alone is not enough to ensure adoption, access, and meaningful transformative uptake 31 PREPUBLICATION COPY—Uncorrected Proofs

by the various specialized science communities. What is required, and what these agencies have nurtured in the past, is the building of compelling visions of possibilities and scientific impact and then using these visions, backed by targeted funding, to motivate innovation in the science communities they serve. Federal financial support can stimulate related support from the private sector, nongovernmental foundations, and universities. Examples include cyberinfrastructure research and development (R&D), pilot application projects, and the related human resource development. The discussion of use cases, barriers, and opportunities in this report aim to illustrate how ARWs will become essential to the exploration of the frontiers of discovery and the grand challenges facing our world. 32 PREPUBLICATION COPY—Uncorrected Proofs

BOX 2-1 Key Terms Used in This Report Automated Research Workflows ARWs integrate computation, laboratory automation, and tools from artificial intelligence in the performance of tasks that make up the research process, such as designing experiments, observations, and simulations; collecting and analyzing data; and learning from the results to inform further experiments, observations, and simulations. Artificial Intelligence While specific definitions vary, artificial intelligence is, generally speaking, any method for programming computers to enable them to carry out tasks or behaviors that would require intelligence if performed by humans (NAS, 2018). Cyberinfrastructure The concept of cyberinfrastructure first emerged in the late 1990s and early 2000s. The term has come to encompass a spectrum of computational, data, software, networking, and security resources, tools and services, and computational and data skills and expertise that can be seamlessly integrated and used, and collectively enable new, transformative discoveries across science and engineering (NSF, 2019). Machine Learning Machine learning draws from a variety of fields, including computer science, statistics, engineering, cognitive science, and neuroscience. Researchers in machine learning develop both the mathematical foundations and the practical applications of systems that learn from data (NAS, 2018). In the context of this report, we use the term machine learning broadly, to comprise any form of learning from data, be that Bayesian learning about parameters, parametric functions, or nonparametric functions in scientific models or learning with artificial neural networks. Open Research Open Research (which incorporates Open Science and Open Scholarship) aims at increasing research quality, boosting collaboration, speeding up the research process, making the assessment of research more transparent, promoting public access to scientific results, as well as introducing more people to academic research. It is a set of principles and practices that fosters openness throughout the entire research life cycle (EC, 2018; NASEM, 2018) 33 PREPUBLICATION COPY—Uncorrected Proofs

Reproducibility and Replicability Reproducibility is obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis. Replicability is obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data. Two studies may be considered to have replicated if they obtain consistent results given the level of uncertainty inherent in the system under study (NASEM, 2019) References EC (European Commission). 2018. OSPP-REC, EC open science policy platform recommendations 2018. https://ec.europa.eu/info/research-and-innovation_en#view=fit&pagemode=none. Accessed May 21, 2021. NAS (National Academy of Sciences). 2018. The frontiers of machine learning: 2017 Raymond and Beverly Sackler U.S.-U.K. Scientific Forum. Washington, DC: The National Academies Press. https://doi.org/10.17226/25021. NASEM (National Academies of Sciences, Engineering, and Medicine). 2018. Open science by design: Realizing a vision for 21st century research. Washington, DC: The National Academies Press. doi: 10.17226/25116. NASEM. 2019. Reproducibility and replicability in science. Washington, DC: The National Academies Press. doi: 10.17226/25303. NSF (National Science Foundation). 2019. Transforming science through cyberinfrastructure. https://www.nsf.gov/cise/oac/vision/blueprint-2019/nsf-aci-blueprint-v10-508.pdf. Accessed July 12, 2021. BUILDING AUTOMATED RESEARCH WORKFLOWS: CURRENT STATE OF THE ART As outlined above, the confluence of several technological advancements is driving the development and implementation of ARWs. Fully realized ARWs are not common at present, and so this study examines how and where progress is being made in areas such as advanced computation, use of workflow management systems and notebooks, laboratory automation, and use of AI as a workflow component as well as in directing the “outer loop” of the research 34 PREPUBLICATION COPY—Uncorrected Proofs

process. These separate developments are producing positive changes in research processes. For example, the broader use of workflow management systems is delivering gains in reproducibility and reliability (Ferreira da Silva et al., 2021a,b). Likewise, the use of AI and ML in a range of disciplines is proceeding rapidly, whether they are used in conjunction with workflow management systems (Royal Society and Alan Turing Institute, 2019). Scientific Workflow Engines and Related Software Tools The concept of a “workflow” as a series of computational steps in the scientific discovery pipeline is not a new one. In a sense, any computational analysis pipeline that involves multiple stages and dependencies across those stages can informally be considered a workflow. These steps are often linked and chained together by a set of scripts to provide some degree of automation. As the nature of research problems and the cyberinfrastructure platform for exploring them have become more powerful and complex, scientific workflow engines have played a crucial role in harnessing and coordinating distributed computing and data resources. Scientific workflow engines are software tools that capture the computational analysis pipeline of a research project, providing provenance tracking and other functions that facilitate automation, reproducibility, and reusability. They are proliferating; one community effort has compiled a list of more than 280 scientific workflow engines and acknowledges that its list is probably incomplete (GitHub, 2021). Table 2-1 provides a representative list. The recent report from the Workflows Community Summit held in January 2021 emphasizes the need for building a workflows community (Ferreira da Silva et al., 2021a,b). 35 PREPUBLICATION COPY—Uncorrected Proofs

Scientific workflow engines formalize a workflow construct, in which a user defines a set of steps and the dependencies between those steps through configuration files and code. They are key to enabling automation and reusability in the scientific pipeline. This allows for more complex relationships such as directed acyclic graphs or loops to be expressed in the workflow. The same workflow can then be reexecuted under different initial conditions, against multiple data sets, and at different scales. These workflow engines have become key in large-scale analysis pipelines, both in academia (e.g., Kepler, Parsl, Pegasus, Fireworks, and Cromwell) and the commercial sector (e.g., Airflow and Luigi). Some of these tools are used to scale up AI or ML runs for automated parameter optimization or training models on very large data sets. There are also several distributed computing and automation frameworks with a narrower focus that capture specific execution patterns. These may also be intrinsically part of the “workflow.” Examples include tools such as Spark or Hadoop that enable a large number of data processing tasks at scale, or cloud data stores such as BigTable that can execute queries across a large distributed data set. Also worth mentioning are tools that implicitly encapsulate interactive workflows— “interactive notebooks” such as Jupyter, Wolfram, MATLAB, and RStudio provide a user- friendly interface to capture a set of discrete analysis steps in a single document, where each step can be run interactively, and intermediate results between steps can be examined and visualized. The popularity of these approaches is a testament to the fact that workflows are quite ubiquitous in scientific computing even if scientists do not formally use the term. These digital notebooks may serve as an on-ramp for students and researchers to a new generation of production-level, customizable tools that would be widely adopted. 36 PREPUBLICATION COPY—Uncorrected Proofs

Looking ahead, there is an opportunity for workflow engines to integrate with the “outer loop” of research discovery, where some level of input or feedback is required to drive the process, from human input, an external interaction with the physical world, or a computational process outside the automation engine. Electronic notebooks such as Jupyter enable human-in- the-loop interactivity and have thus provided scientists with the ability to tie human insight and iteration into this process. AI and ML workflows in particular have benefited from this, because many of the steps (model training, hyperparameter optimization, etc.) are human-centric processes that involve examination of intermediate results to inform future iterations. Building on the interactive workflow paradigm, Galaxy, with over 30,000 users as of April 2021, is an example of a workflow system where user setups are recorded, and where notebooks and workflows may be routinely intertwined (Galaxy, 2021a). We also note the emerging “Canonical Workflow Frameworks for Research” initiative, to capture canonical workflows and workflow patterns that are built on some of these technologies. This effort seeks to improve reuse of workflow components, increase efficiency of workflows, and allow incorporation of machinery that automatically generates FAIR data (Hardisty and Wittenburg, 2020). As the next generation of scientific workflow engines expands, automation of the scientific process can lead to a step change in the rate of discovery in many fields. Room for serendipity and human ingenuity remains essential, and so interactivity and integration of external input must be a core part of the system. In essence, we need to go beyond a closed automation loop and enable interaction points for modifying and driving the system and identifying relevant variables. In other words, next-generation workflows require tools that provide both interactivity and automation at scale. 37 PREPUBLICATION COPY—Uncorrected Proofs

TABLE 2-1 Examples of Workflow Engines and Related Tools Airflow https://airflow.apache.org Bigtable https://cloud.google.com/bigtable Chimera https://github.com/hysds/chimera Cromwell http://cromwell.readthedocs.io/ Cyverse Discovery Environment https://cyverse.org/discovery-environment Fireworks https://materialsproject.github.io/fireworks Hadoop https://hadoop.apache.org Galaxy https://galaxyproject.org iRODS https://irods.org Jupyter https://jupyter.org Kepler https://kepler-project.org Nextflow https://www.nextflow.io Open Science Framework https://osf.io Luigi https://luigi.readthedocs.io/en/stable/workflows.html Parsl http://parsl-project.org Pegasus https://pegasus.isi.edu Snakemake https://snakemake.readthedocs.io/en/stable Spark https://spark.apache.org Starfish Storage https://starfishstorage.com Wolfram https://www.wolframcloud.com 38 PREPUBLICATION COPY—Uncorrected Proofs

NOTE: Many of these tools are tracked by workflow community initiatives such as WorkflowHub (https://workflowhub.eu) and WorkflowsRI (https://workflowsri.org). Data Resources In 2014, academic and private researchers interested in overcoming obstacles for data discovery and reuse met in the Netherlands. From that meeting grew a set of principles calling for all research objects to be findable, accessible, interoperable, and reusable (FAIR). GO FAIR 2 formed as a stakeholder initiative to implement these principles and is funded by the Ministries of Science in France, Germany, and the Netherlands. The European Open Science Cloud (EOSC), based on the FAIR principles, has, as one of its priority actions, to “develop and sustain core data assets for the EOSC and make them available to the community under well-defined conditions. These assets may include workflows, analytics, programmes and notably existing data sets with FAIR status” (Wilkinson et al., 2016). While the FAIR principles call for all research objects to be FAIR, they are somewhat specific to data, and there have been subsequent efforts to address other types of objects, such as research software, FAIR workflows, and for the start of an effort focused on ML models (Goble et al., 2020; RDA, 2021a,b). A key aspect of FAIR data that enables ARWs is machine readability. Making more FAIR data available allows ARWs to find the data that are relevant to a research task in question and incorporate these data into the analysis. Wider reuse encourages researchers to make more FAIR data available, creating a virtuous cycle. 2 https://www.go-fair.org/. Accessed August 21, 2020. 39 PREPUBLICATION COPY—Uncorrected Proofs

Workflows can adhere to and advance FAIR data principles “by processing data according to established metadata, creating metadata themselves during the processing of data, and by tracking and recording data provenance” (Goble et al., 2020). Properly designed ARWs support FAIR data principles since they can capture the associated metadata and provenance necessary to describe their data products in a formalized and completely traceable way. They can provide more accurate curation of the data to support both data reuse and data review (to support assessment of reproducibility or robustness and of conclusions), significantly reducing the time and hence cost of making data FAIR. Creating workflows can be research products in their own right, encapsulating methodological know-how that needs to be published, accessed and cited, exchanged and combined with others, and reused as well as adapted. Data resources and needs related to specific domains are discussed in Chapter 3. Progress is being made in the number and diversity of domain-specific and general data repositories that support FAIR principles and provide archival functionality for long-term access to data and related research objects. Examples can be found in the Registry of Research Data Repositories. 3 3 See https://www.re3data.org. 40 PREPUBLICATION COPY—Uncorrected Proofs

Progress in Domain-Relevant Artificial Intelligence and Machine Learning Another key factor in building ARWs is the continued advances in learning algorithms for specific domains. New and better algorithms, defined as “encoded procedures for solving a problem by transforming input data into a desired output, based on specified calculations and procedures,” have been fundamental to the progress of ML across many domains within and outside of research (Gillespie et al., 2014; NAS, 2018). A report from a 2018 DOE workshop on basic research needs for advancing ML in science suggests a focus on “creating domain-aware, interpretable, and robust ML formulations, methods, and algorithms” (DOE, 2019). It identifies incorporating domain knowledge into ML as a key challenge: In many of the most successful ML examples, such as image recognition, system developers know the “ground truth” sufficiently well to check the results, often even while training the models. Almost by definition, the most interesting scientific applications of [scientific ML] are those, such as materials discovery or high-energy physics, where the answers are unknown beforehand or the results of an automated system are not easily verified. Understanding and managing the interplay between models derived from domain knowledge, ML, and how the system iteratively drives experimental design constitute a continuing task for ARW development across domains. IMPLEMENTING AUTOMATED RESEARCH WORKFLOWS: A CHANGING SCIENTIFIC PARADIGM Over the past two decades, scientific workflow systems have matured as powerful tools, especially for “resource allocation, task scheduling, performance optimization, and static coordination of tasks on a potentially heterogeneous set of resources” (Altintas et al., 2019). As a platform for these software capabilities, existing cyberinfrastructure provides important 41 PREPUBLICATION COPY—Uncorrected Proofs

components that can be incorporated into ARWs to translate new advances into repeatable, scalable solutions. (Altintas et al., 2019). Much of the progress in developing workflow management systems was carried out in the mid 2000s (Taylor et al., 2007). Their use has been limited initially by factors such as difficulties in incorporating human decision making into the loop, difficulties in adding new components, lack of interoperability, and the need for large commitments of time and effort to manage and maintain them. The developments that are combining to encourage greater use include the maturation of the systems, AI as a component in the workflow, better interoperability of systems and components, and the promise of open science and FAIR to raise the value of workflows broadly and evolve into ARWs. Scientific workflow engines have historically targeted applications in scalable computing where users chain together multiple steps in a complex computational process (e.g., job submission to a supercomputer, access to a database, execution of a web service) to express a dataflow. Such use of workflow engines is akin to use of an existing recipe, where the workflow designer, often an individual using a graphical or script-based user interface, programs a known dataflow of tasks for scale and reuse. However, solving the biggest problems of our time requires two main changes from this use of workflows: (1) shifting from an individual user to teams for designing the workflow and (2) capturing the evolution of a workflow and the data-driven explorations to make them later scalable in various forms. Reproducibility of the process under this new paradigm is an additional consideration. 42 PREPUBLICATION COPY—Uncorrected Proofs

From Workflow to Teamflow: Emergence of Teams as the User Solving big, complex problems requires end-to-end workflows for integrated management of many technical steps in addition to extensive knowledge of the application domain. These integrated steps require expertise from multidisciplinary team members to collaborate on (1) methods to manage, integrate, and interpret “big” data; (2) modeling and simulation tools executing on scalable computing platforms; (3) methods and interfaces for domain-specific analysis, communication, and visualization of results; and (4) technologies to make the process FAIR, as well as portable, transparent, repeatable, and reproducible (GO FAIR). Such multidisciplinary collaboration shifts the paradigm from individual to team science, that is “research conducted by more than one individual in an interdependent fashion, including research conducted by small teams and larger groups” (NASEM, 2015). Team science requires tools for managing, capturing, and advancing team collaboration, contribution, and communication as an open process, in addition to the discovery process and its reproducibility. 43 PREPUBLICATION COPY—Uncorrected Proofs

FIGURE 2-2 Team science workflow process. 44 PREPUBLICATION COPY—Uncorrected Proofs

As an example, the scenario in Figure 2-2 illustrates the scientific collaboration process among team members with complementary scientific and technical expertise in areas ranging from scientific domain expertise to data engineering, data analysis, and computational science. The research ideation and design is typically initiated by a domain scientist, triggering modeling and simulation. The experimentation toward scalable modeling and simulation is often complemented by the work of a data engineer to ensure that data from simulations and experiments are “acquired, modeled, and queried effectively for analysis and computational modeling. The data scientist generates insights from the data so that the computational model can be parameterized effectively” (Altintas et al., 2019). For example, the data and computational scientists might collaborate on parameter estimation, ML, or data assimilation methods so that the computational model benefits from the data analysis. The team members develop a process through exploratory activities and iterative communication, and scale the process to execute in an automated fashion on advanced computing environments once the exploratory activity results in a mature workflow. These roles may have some overlap across individuals, and there may be other work functions to consider. In addition, these roles are evolving and there are differences between disciplines in how they are implemented. For example, the role of data curator is not currently established in all disciplines. There may be hybrid roles such as “workflow system administration” that could span the tasks of a data engineer and software developer. The broader point is that multiple people are involved in the scientific discovery process across the workflow. To be sure, transitioning from individual to team-oriented research not in itself may be a central obstacle to building the human collaboration needed to implement ARWs. For example, 45 PREPUBLICATION COPY—Uncorrected Proofs

each of the research contributors shown in Figure 2-2 may come to the collaboration with a set of tools with which they are familiar or that work with the equipment they have at their disposal. Most workflow systems require that the collaboration adopt a specific set of tools and specific methodology for its research. That is, the workflow engines or other enabling tools may embody ways of conducting the work that need to be aligned with the human participants. Successful groups will need to agree on a common approach, such as building up an environment around the tools they already use. From Exploratory Activity to Scale Once a research team agrees on its research methods through exploration, there is often a need to scale up execution processes with more data or for larger parameter sets requiring automation and control. A big challenge in building ARWs is to sustain the linkage between the exploratory activities and the automated scalable process. It is inherently more difficult to fund development and maintenance of production-quality software (workflow engines, automated tools, etc.) that can be used broadly than to develop new software as part of a research project that may not be used outside that project. Often, following the exploratory activities, the process is reimplemented for scale as an automated process instead of reusing the software developed during the exploration phase. Because the exploratory and scalable components are separate, iterations between exploratory and scalable automated activities become difficult to manage (Reiter et al., 2021). ARWs can strengthen the link between exploration and scale in three ways: (1) capture key information (e.g., performance, accuracy of individual steps) during the exploration to enable seamless scalability of the final process; (2) enable auto-scalable converged application through 46 PREPUBLICATION COPY—Uncorrected Proofs

communications with data and computing middleware; and (3) optimize resources and dynamically adapt to the changes in the underlying cyberinfrastructure (Altintas et al., 2019). Automating data collection in order to analyze and use the data is key to building effective systems that bridge exploratory and scalable activities, make the workflows more useful and align with the way teams of researchers collaborate, and develop integrated applications. Reproducibility of the Process and Team Science Scientific workflow engines potentially provide “a programming model for deployment of computational and data science applications on all scales of computing and provide a platform for system integration of data, modeling tools, and computing while making the applications reusable and reproducible” (Altintas, 2018). Many research workflow systems today provide capabilities for provenance tracking, repeatability, and partial reproducibility support. However, the shift from individual workflow development to team science also creates the need for workflow systems to capture the process for validation, seamless integration, and repeatability of the team’s activity. Figure 2-3 illustrates in lighter blue the system hierarchy supporting the discovery loop by which the research team interacts with the scientific workflow engine and other software tools to run ML or AI algorithms or methods in a computing infrastructure using data to learn about the model and then to design new experiments based on what is learned. Increasingly these research teams are distributed across disciplines, organizations, and geography. The dark blue lower box lists the best and responsible practices in research that should apply across all levels, as data and code are used throughout. 47 PREPUBLICATION COPY—Uncorrected Proofs

FIGURE 2-3 ARW components and context. NOTE: The research team interacts with the workflow tools to run ML or AI algorithms and methods in a computational infrastructure to learn about the model using data and then to design new experiments based on what is learned. Responsible and best practices in scientific discovery apply across all components, as data and code are used throughout. All these components should be funded and sustained for the automated closed loop to work and advance. POLICY AND INDUSTRY CONTEXT FOR AUTOMATED RESEARCH WORKFLOWS Public Policy Readiness Policy makers and funding agencies in the United States and Europe have articulated a research vision at a scale and complexity that implies robust support for the development and 48 PREPUBLICATION COPY—Uncorrected Proofs

sustainability of ARWs. That is, while not explicitly singling out “support for ARWs,” they point to the societal and economic benefits that AI and ML can bring about. In most cases, achieving these benefits requires making use of ARWs as described in this report, and funding has begun to address this reality. At the committee’s March 2020 public workshop, Kelvin Droegemeier, then director of the White House OSTP, stressed the importance of AI as one of four “Industries of the Future,” together with advanced manufacturing, quantum information science, and 5G network capability. The 2021 federal budget included $868 million (a 76 percent increase) to the National Science Foundation for AI-related grants and interdisciplinary research initiatives; $125 million to DOE’s Office of Science on AI research; $50 million to NIH for research on chronic disease using AI and related approaches; $459 million for AI R&D at the Defense Advanced Research Projects Agency (DARPA); and $290 million for DOD’s Joint AI Center (CRS, 2020). Experts from several agencies provided context to the committee about how they envision a role for workflows. For example, at DOE, the Artificial Intelligence & Technology Office has been established to “transform DOE into a world-leading AI enterprise by accelerating the research, development, and adoption of AI,” with ARWs as a core part of its efforts (Kusnezov, 2020). Within DOD, DARPA funds projects that rely on or support the development of workflows, including those related to automating molecular discovery and an AI exploration program (Russell, 2020). In addition, the National Science Foundation’s Big Ideas program encompasses 10 areas in which it is investing in pioneering research, addressing big ideas, and pilot activities. While workflows are fundamental to many of these 10 areas, 2 are particularly salient here. 49 PREPUBLICATION COPY—Uncorrected Proofs

● Growing convergence research focuses on grand challenges that require multidisciplinary approaches: “From its inception, the convergence paradigm intentionally brings together intellectually-diverse researchers to develop effective ways of communicating across disciplines by adopting common frameworks, and a new scientific language.” ● Harnessing the data revolution involves support for a “cohesive, federated, national-scale approach to research data infrastructure, and the development of a 21st-century data-capable workforce.” The priority here and across the government is on identifying large, complex problems and then finding the workflows and other processes to solve them, rather than first developing technologies and then finding an application. On January 1, 2021, Congress passed the National Artificial Intelligence Initiative Act as part of the National Defense Authorization Act (NAIIA). 4 The act’s language reflects many of the issues raised by committee members and presenters. It sets out a major role by the federal government to incentivize and support AI, including access to data sets, computing resources, and real-world test environments; improved standards and benchmarking for AI systems; and removal of barriers to interdisciplinary collaboration. It also recognizes the need to educate an AI-savvy scientific workforce. In June 2021, OSTP announced the formation of the National Artificial Intelligence Research Resource Task Force as part of implementing the NAIIA. The Task Force will provide recommendations for establishing and sustaining the NAIRR, including technical capabilities, governance, administration, and assessment, as well as 4 15 U.S.C. Chapter 119, https://uscode.house.gov/view.xhtml?path=/prelim@title15/chapter119&edition=prelim. 50 PREPUBLICATION COPY—Uncorrected Proofs

requirements for security, privacy, civil rights, and civil liberties. The Task Force will submit two reports to Congress that together will present a comprehensive strategy and implementation plan — an interim report in May 2022 and final report in November 2022. (OSTP, 2021) The European Union has prioritized open research data in its research and innovation policy making, driven by the need for reproducible science, appreciation of data as a strategic asset (akin to how a country might value its oil reserves), and a desire to avoid overdependence on companies such as Google and Facebook (Burgelman, 2020). In 2019, the EU’s Directorate- General for Research and Innovation conducted a cost-benefit analysis that estimated that not having FAIR data would cost Europe about $10 billion per year (Directorate-General for Research and Innovation, 2019). This has led to development of the EOSC as a shared infrastructure to provide access to data repositories and resources such as cloud services, high- performance computing, and data analysis tools (EOSC, 2020). To back up policy priorities, the recently concluded European Union’s research and innovation programme, Horizon 2020, provided €80 billion in funding between 2014 and 2020, including emerging technologies, e-infrastructure, and advanced computing. It increased its investments in AI by 70 percent from 2018 to 2020, to about €1.5 billion, aiming to increase total investment in AI (public and private combined) to €20 billion per year by the end of 2020 (EC, 2021a). In addition, the EU’s Artificial Intelligence and Blockchain Investment Fund is set up to make €100 million available to companies, with the idea that these funds will leverage additional private- and public-sector support. Moving forward, the new EU Framework programme, Horizon Europe, will provide €95.5 billion in research funding for 2021–2027. As part of this, the European Commission announced a new €7.5 billion Digital Europe Programme 5 with focus 5 See https://ec.europa.eu/digital-single-market/en/europe-investing-digital-digital-europe-programme. 51 PREPUBLICATION COPY—Uncorrected Proofs

areas in supercomputing, AI, cybersecurity, advanced digital skills, and wider use of digital technologies. The United Kingdom Industrial Strategy aims for the UK to become the “world’s most innovative economy,” with R&D investment to reach 2.4 percent of gross domestic product by 2027. Realizing this vision includes developing the infrastructure to support it—ranging from physical facilities such as research ships and satellites, to archives and repositories, to cyberinfrastructure. In 2020, United Kingdom Research and Innovation (UKRI), the major public funder of research, published a review of the country’s research and innovation infrastructure to guide funding decisions and other priorities until 2030 (UKRI, 2020). In addition to computational and e-infrastructure as one of the review’s six themes, the potential contributions of intelligent workflows are suffused throughout the other themes (biological sciences, health, and food; physical sciences and engineering; social sciences, arts, and the humanities; environment; and energy). It calls for a “sustained multi-year investment” in supercomputing; data infrastructure; cloud computing; network and cybersecurity; authentication, authorization, and accounting infrastructure; and software and skills. In January 2020, the government allocated £300 million to UKRI to fund research infrastructure. Many UK research institutes and infrastructures are also playing key positions and providing pivotal input into the EOSC and have led the initial computational development work via the Science and Technology Facilities Council, a part of UKRI (UKRI, 2017). The UK is also a world leader in open research data, for example, with the UK Data Archive that has retained social science and humanities data for almost 50 years, and the delivery of a Concordat on Open Research Data (UKRI, 2016). 52 PREPUBLICATION COPY—Uncorrected Proofs

These public affirmations of support and new banner initiatives related to AI, infrastructure, data sharing, and related capabilities in the United States, Europe, and the UK do not, in and of themselves, guarantee adequate funding. However, such efforts provide a positive policy climate in which advancements may flourish. Note that China is also making investments in cyberinfrastructure. For example, the CSTCloud (China Science and Technology Cloud) is emerging as a national infrastructure of data and computation for accelerating science discovery. Industrial Use of Workflows This discussion of industrial use of workflows focuses primarily on research applications. Industrial development and use of computational workflows extends beyond—and predates—the use of workflows in the realm of scientific research. Computational workflows in industry have automated many business processes such as “generating an email response when a customer fills out a request form, transaction processing or communicating with multiple databases while processing an insurance claim” (IBM, 2021). There are barriers to translating practices and tools developed for computational workflows used to automate business processes to research applications. Business workflows have been built to perform a well-characterized series of tasks accurately and repetitively, while research applications may need to accommodate the collection and analysis of various types of experimental data (Barga and Gannon, 2007). Still, as ARWs become more widely used, it should be expected that more characteristics of industrial workflows—security and integrity in particular—will be valued and adapted at a larger scale into ARWs. Several participants at the March 2020 public workshop commented that industry has been more open than academia to the use of ARWs in research, perhaps because workflows are 53 PREPUBLICATION COPY—Uncorrected Proofs

common on the business side of their operations or because of bottom-line imperatives to maximize efficiencies. Companies are involved with workflows in two primary ways: using them in their own research (e.g., pharma companies doing high-throughput screening) and developing and marketing workflow tools, services, and training. These might be offered as part of a suite of products (e.g., IBM, Amazon Web Services), while a few have made scientific process development their primary focus. Several companies have also developed products to automate and accelerate the publishing process, as highlighted in Chapter 4. Examples of academic uptake of commercial workflows include KNIME, RapidMiner, and Rabix. Riffyn was launched in 2014 with the mission to integrate process data from scientific experiments for ML. According to CEO Timothy Gardner, Riffyn was intentionally set up as a for-profit company to capitalize on “resources, sustainability, and usability,” which he said could best be achieved through a for-profit structure. One example related by Gardner is a biotech firm that used Riffyn’s intelligent workflows and other processes to accelerate its development of advanced yeast strains. The company brought four strains to market in 18 months, which had both discovery-enhancing and financial benefits. Although Riffyn provides its product at no cost to academic researchers, Gardner noted that uptake is low, which he attributed to the educational and cultural barriers discussed in Chapter 5. Industry has some priorities that are distinct from those of governments or academia. For example, industry often uses open academic research data, but opens its own research data much less often. Companies may also use proprietary workflow tools that store and manage data in nonstandard proprietary formats. Since there is little incentive for toolmakers to agree to standards among themselves, researchers may be unable to access or utilize data even if they are technically open. Several presenters at the workshop have worked in academia and the private 54 PREPUBLICATION COPY—Uncorrected Proofs

sector and provided ideas on how to strengthen links across different organizational cultures and constraints. 55 PREPUBLICATION COPY—Uncorrected Proofs

Next: 3 Automated Research Workflows in Action »
Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop Get This Book
×
Buy Paperback | $45.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The needs and demands placed on science to address a range of urgent problems are growing. The world is faced with complex, interrelated challenges in which the way forward lies hidden or dispersed across disciplines and organizations. For centuries, scientific research has progressed through iteration of a workflow built on experimentation or observation and analysis of the resulting data. While computers and automation technologies have played a central role in research workflows for decades to acquire, process, and analyze data, these same computing and automation technologies can now also control the acquisition of data, for example, through the design of new experiments or decision making about new observations.

The term automated research workflow (ARW) describes scientific research processes that are emerging across a variety of disciplines and fields. ARWs integrate computation, laboratory automation, and tools from artificial intelligence in the performance of tasks that make up the research process, such as designing experiments, observations, and simulations; collecting and analyzing data; and learning from the results to inform further experiments, observations, and simulations. The common goal of researchers implementing ARWs is to accelerate scientific knowledge generation, potentially by orders of magnitude, while achieving greater control and reproducibility in the scientific process.

Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop examines current efforts to develop advanced and automated workflows to accelerate research progress, including wider use of artificial intelligence. This report identifies research needs and priorities in the use of advanced and automated workflows for scientific research. Automated Research Workflows for Accelerated Discovery is intended to create awareness, momentum, and synergies to realize the potential of ARWs in scholarly discovery.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!