National Academies Press: OpenBook

Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop (2022)

Chapter:5 Overcoming Barriers to Wider Use of Automated Research Workflows

« Previous: 4 Automatic Research Workflows and Implications for Advancing Research Integrity, Reproducibility, and Dissemination
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page111
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page112
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page113
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page114
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page115
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page116
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page117
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page118
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page119
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page120
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page121
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page122
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page123
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page124
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page125
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page126
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page127
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page128
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page129
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page130
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page131
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page132
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page133
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page134
Suggested Citation:"5 Overcoming Barriers to Wider Use of Automated Research Workflows." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.
×
Page135

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

5 Overcoming Barriers to Wider Use of Automated Research Workflows With all the benefits that automated research workflows (ARWs) can provide, whether in discipline contexts as summarized in Chapter 3 or across disciplines as summarized in Chapter 4, the question remains: What inhibits their greater use? Beyond technical challenges, discussion at the March 2020 workshop and other information indicates that the same conditions that slow or prevent change in other aspects of the research enterprise are in play here as well. These conditions include the tendency to maintain academic silos and a focus of research funders on investigator-led projects rather than underlying infrastructure. Perhaps more unique is an overlay of concern about machines taking over from humans—whether that concern is expressed about machines supplanting humans in the quest for discovery or the more prosaic, yet understandable worry about loss of jobs, changes in workplace practices, and obsolescence. As made clear throughout this report, and reiterated below, ARWs are a tool, but people remain central to the process of discovery, no matter the level of automation or terabytes of data. The committee identified five main challenges to wider use of ARWs and offers ideas to address them related to the incentive system, the current research culture, education and training needs, sustainability, and privacy and ethical concerns. 111 PREPUBLICATION COPY—Uncorrected Proofs

REIMAGINING INCENTIVES There has been extensive discussion in recent years about the perverse or misaligned incentives for researchers that result from hypercompetition and the inappropriate use of bibliometric measures in evaluation (Teitelbaum, 2008; Casadevall and Fang, 2012; Stephan, 2012; DORA, 2013; Alberts et al., 2014; NASEM, 2017). Hypercompetition has emerged due to slower growth in research funding alongside continued high production of Ph.D. degrees in science and engineering, leading to a scarcity of tenure-track positions in academic research relative to the number of qualified candidates. Some experts believe that the use of bibliometric measures such as the Journal Impact Factor in evaluation leads students and early career researchers to focus their efforts on publishing articles in the most prestigious journals and to choose currently fashionable topics where articles are likely to be highly cited, rather than to engage in riskier fundamental studies (Lawrence, 2007; Alberts, 2013; Ioannidis et al., 2014). As a result of this emphasis on publishing in the most prestigious journals in the reward structures of many researchers, the current research ecosystem tends to reward researchers for independence and innovation over practices that support rigor and reliability. Publishers themselves have made significant efforts to encourage data sharing over the past decade and have established community norms for rigor and transparency in data generation (see discussion in Chapter 4). However, these expectations are not yet applied to other components of the research life cycle. For this reason, independent and artisanal experimental procedures are typically used in research and often lack the oversight necessary to truly evaluate the validity of the experimental methodologies and/or implementations used. 112 PREPUBLICATION COPY—Uncorrected Proofs

The integration of ARWs into the research life cycle provides an opportunity to address these issues. However, doing so will require a shift in incentives for scientific productivity. Research culture and values are set by funders, academic institutions, publishers, regulators, and tooling platforms, and are often affected by national policies. These entities will play a major role in changing researcher incentives in ways that foster more rapid and effective adoption of ARWs and ensure that community standards for quality assurance, transparency, and reproducibility are upheld. Incentive structures were mentioned as barriers to progress across most of the use cases covered in Chapter 3. For example, many fields do not have a strong tradition of sharing data, and the task of curating data for others to use is not generally rewarded and can therefore prove hard to justify by time-strapped researchers. This hinders progress in building the large, artificial intelligence (AI)-ready data resources needed for ARWs in most fields. Senior researchers may not believe that AI always provides meaningful intuition into foundations or reasons of association and may discourage junior researchers from pursuing it. AI that is oriented to provide causal reasoning may address this issue. Examples of the sorts of shifts in incentives that would create a positive environment for implementing ARWs include: ● Weighing reliability and innovation equally when evaluating research productivity. ● Rewarding the sharing of a much wider range of outputs beyond standard narrative research articles (e.g., data notes, software tools, and methods articles) as well as valuing negative or null findings to ensure a balanced representation of research understanding and generation of knowledge from those findings. 113 PREPUBLICATION COPY—Uncorrected Proofs

● Valuing transparency and reproducibility through all phases of the research life cycle. ● Rewarding the use of validated and documented experimental processes. ● Encouraging researchers to incorporate process design and systems analysis into their experimental workflows to ensure that all components of the research life cycle are performed and disseminated in a transparent and reproducible manner. ● Rewarding researchers for the measurable reuse of workflow system-enabled results for applications that employ algorithms and workflows. ● Incentivizing collaboration and team science. OVERCOMING BARRIERS IN THE RESEARCH CULTURE Cultural changes in the research enterprise are necessary for effective adoption and use of ARWs. It will be important to develop these processes in a way that promotes ARWs as tools that can support both reliability and innovation in discovery, rather than falling into the trope of “machines replacing humans.” This inaccurate representation has been seen extensively with the advent of AI in medicine, including articles in the popular press about whether “AI will replace doctors,” and it has hampered progress. Current research culture focuses on the scientific laboratory and the principal investigator as an independent artisan responsible for inventing innovative solutions to all aspects of a problem. In fact, the approach leads to unnecessary uncertainty in experimental steps. This is particularly true for processes that can easily be standardized and automated. ARWs are reliant on the stitching together of individual components, and, by necessity, these each need to be as performant as possible. The explosion of fully annotated research data that will accompany the emergence of ARWs will facilitate collaborations between researchers and across laboratories, 114 PREPUBLICATION COPY—Uncorrected Proofs

teams, or departments. One can easily imagine a future network of connected ARWs that use distributed AI and consensus learning to mutually coordinate massive experiments that would have been previously inconceivable or impractical. Training researchers to view themselves as the master regulator of the entire system should ensure that they innovate where innovation is necessary and useful—rather than innovating artisanal solutions to solved problems that should be standardized. A combination of shifts in top-down expectations (policy, funding, and training) and bottom-up expectations (build it into the tools, directories for finding things, peer review expectations, and society-level shifts in focus of innovation) should facilitate this culture shift. As noted above in the discussion of incentives, research funders can support this culture shift by emphasizing (and rewarding) transparency, reliability, and teamwork in their granting practices. One central task for researchers as managers of ARWs will be to understand the interplay between reductionist science and its parameter-sparse models, which have been immensely successful, and, on the other hand, deep learning methods, whose success relies on overparameterization. What we need is the best of both worlds: machine learning (ML)-informed science. As discussed in Chapter 2, this requires continued methodological advances. We need to find good ways of imposing scientific structure on deep learning models, or, conversely, inform scientific models through ML. Finally, as noted several times throughout this report, it will remain important to leave space for serendipity. Automated workflows, community standards, and collaborative approaches are tools designed to support researchers in reliable scientific innovation. They should be used where appropriate but should not replace active and individual human oversight where needed to detect the unexpected. 115 PREPUBLICATION COPY—Uncorrected Proofs

CLOSING EDUCATION GAPS Future generations of ARWs will have a significant impact on research practice and scientific experimentation, analysis, and interpretation. However, to pull maximum benefit from these systems, future researchers will need additional training outside their discipline. This will require integrating domain science training with data science training and relevant software engineering into academic programs across all disciplines at both the undergraduate and graduate levels. In addition, research teams will need additional specialized expertise from research software engineers, computational scientists, and data stewards. The use cases discussed in Chapter 3 revealed some education and training needs and challenges common across multiple domains. A need for more researchers who combine domain knowledge and data science or software development expertise was expressed in just about all the use-case discussions. In several of the use cases, including materials research and chemical synthesis of pharmaceutical compounds, presenters expressed a growing need for researchers who are knowledgeable in working with automated laboratory equipment. For example, Rebecca Nugent described Carnegie Mellon’s recently established graduate degree program in automated science, and Carole Goble noted the long-term return on investment in establishing a “career track for research software engineers.” More broadly, knowledge of mathematical and computational methods for designing experiments or controlled observations automatically and for learning from data is becoming essential for researchers. Several universities (e.g., Massachusetts Institute of Technology and California Institute of Technology) are beginning to address this need by establishing cross-disciplinary graduate programs or cross-links between domain-specific programs and programs in the computational and mathematical sciences. By 116 PREPUBLICATION COPY—Uncorrected Proofs

contrast, there will be less need for researchers to perform mechanical activities that have traditionally been a part of standard training—such as pipetting in biology—and more need to understand the overall goals of the research program and design a complex approach to reaching the goals using more automated tools and technologies. The use cases discussed in Chapter 3 also illustrate the need for additional specialized expertise in data stewardship, software development, and other computational tasks. For example, research software engineers are key players in the development of ARWs and other research workflows, with their own career paths. Organizations such as the United States Research Software Engineer Association, 1 the Society of Research Software Engineering, 2 and the Campus Research Computing Consortium 3 are working to build community among research computing and data professionals. A report from the Organisation for Economic Co-operation and Development emphasizes the critical importance of data-intensive science and the need to strengthen the digital capacity and skills of the scientific enterprise (OECD, 2020). Students in different disciplines may require different types of supplementary data science training and at differing depths. However, it is likely that all students will benefit from increased exposure to workflow-related data science areas including, as described at the committee’s workshop: statistical modeling of high-dimensional data (Nugent), scientific ML (“AI 3.0,” Vidal), database management (Wietzner, 2020), sensor management (Beckman, 2020), FAIR workflows (Goble, 2020), and the application of AI for hypothesis generation and automation (Stodden, 2020). Advanced statistics training will enable researchers to cogently interpret the strength of evidence for the complex high-throughput experiments that are run by 1 See https://us-rse.org/. 2 See https://society-rse.org/. 3 See https://carcc.org/. 117 PREPUBLICATION COPY—Uncorrected Proofs

ARW systems. Training in managing and analyzing data, including areas such as how to model data and select the right tools to support particular models, will be required for managing and accessing the massive and diverse data repositories that workflow-aided science is expected to produce. Training in AI will support researchers in evaluating and selecting the appropriate black-box ML models used by automated workflows for extracting knowledge and designing new experiments and studies. Many disciplines have begun the process of introducing data science as a core component of academic training, for example, experimental chemistry, astronomy, and biology (Cernak, Glotzer). Training in ARWs will be a natural extension of this process in these disciplines. The transformation of science education will foster new collaborations with faculty in computer science, statistics, and mathematics and will result in the creation of new courses that integrate data science and experimental science (Stodden, 2020). Faculty will need to grapple with difficult questions about existing domain science topics that will need to be dropped to provide time to introduce workflow topics. Yet doing so will equip the next generation of students to assess the tradeoffs between using physical models versus AI prototypes in designing scientific workflows, and to weigh the sacrifice in model interpretability against the convenience of black- box automation of experiments (Vidal). Although it is not necessary for all discipline experts to acquire expert proficiency in data science or coding, they should have enough background to critically assess the “black box” aspects of many workflow tools so they can understand and make adjustments for any likely biases inherent in the system. One of the March 2020 workshop speakers suggested a core curriculum for research and development skills relevant to ARWs leading to a certification that would include process design, measurement systems analysis, process qualification, design of experiments, and quality 118 PREPUBLICATION COPY—Uncorrected Proofs

by design (Gardner, 2020). Such an approach would provide grounding in both the principles underlying ARWs as well as implementation issues. The curriculum could include both self- learning and classroom learning and include variants for different educational levels. Other perspectives that might be incorporated into new ARW-friendly curricula include the potential of ARWs in translational research and convergence research. The concept of translational research—meaning the conversion of basic knowledge into products or processes that meet critical real-world needs—emerged several decades ago in the biomedical domain. Computational scientists have begun to conceive of priorities within their own discipline, including workflow management systems, as examples of translational research (Deelman et al., 2020). According to the National Science Foundation (NSF), convergence research is characterized by integrated work across disciplines directed at a specific, compelling problem (NSF, 2020a). However, as noted above, changes in institutional culture and incentives will be needed to accelerate and sustain such transformations in education. At many research universities, existing tenure and promotion criteria do not favor faculty collaborations on programmatic or curricular transformations. Department chairs are likely to be concerned about letting their faculty divert their efforts away from departmental teaching commitments. Universities may lack the resources to support the major investments needed to revise domain science curricula to include advanced workflow training. These hurdles can be overcome with extramural financial support from foundations, federal agencies, and industry (Kusnezov, 2020). National Institutes of Health (NIH)-like training grant programs would be an enabler of faculty and student investment into workflows for research (Cernak, 2020). Partnerships between academic institutions and 119 PREPUBLICATION COPY—Uncorrected Proofs

industry will be especially fruitful since industry has led the use of workflow technology in scientific research and development (Fox, 2020). ENSURING SUSTAINABILITY A common concern voiced at the workshop focuses on how to fund ARW development and maintenance to support research within and across domains. Realizing the promise of ARWs requires investment in hardware, software, and human resources. Although there seems to be general agreement that such investment is important, the current level and structure of resource allocation fall short of these well-stated intentions in the realm of real-world funding. Investment Priorities to Advance ARWs For several of the use cases discussed in Chapter 3, the development of tools and technologies constitutes a key enabler for accelerating progress. For example, materials researchers examined existing research workflow management systems and ended up building their own due to the need for a system that enables dynamic rerouting, facilitates constant communication among researchers, incorporates error management capability, and is flexible. Building an advanced computational environment for wildfire monitoring and behavior prediction requires integrating numerous functions, such as collecting various types of data and performing multiple, complex modeling tasks. In particle physics, development of computational tools that allowed for collaborative statistical modeling, in addition to workflow management and computational tools, was critical to confirming the existence of the Higgs boson. Laboratory automation technology is a major driver for advances in experimental domains such as chemical synthesis of pharmaceuticals and materials research. 120 PREPUBLICATION COPY—Uncorrected Proofs

To realize the potential of ARWs, it is essential that software tools for aspects of workflows that transcend disciplines—for example, those involving AI and ML methods for designing experiments and learning from data—become interoperable and broad purpose. This will require a level of software engineering not commonly encountered in academic environments. Usually, research groups develop software for their own purposes, but have neither the means nor the incentives to build scalable software platforms that can reach beyond an individual laboratory. Universities generally lack the software engineering infrastructure to develop and maintain the complex, interoperable, and performance-portable code bases needed to realize the full potential of ARWs. A sustainable infrastructure requires software engineers, test engineers, and release engineers, but none of these are typically funded in research grants. Several institutions are working on software sustainability, such as the Software Sustainability Institute, WSSSPE (Working Towards Sustainable Software for Science Practices and Experience), and the Workshop on Sustainable Software Sustainability, and the general problem is that software stops working if not actively maintained (Hinsen, 2019). The availability and utility of data and related infrastructure such as repositories and active curated services are also critical to the implementation of ARWs in many fields. In both experimental and observational fields, including materials research and astronomy, FAIR data are needed to develop and train ML algorithms, which in turn enable the development of closed- loop systems in which the selection of experiments or instrument targeting can be automated. In some fields, such as particle physics, there is considerable experience with collecting and processing large amounts of data, but new approaches to instrument design and data are needed to allow for simulation-based inference based on reuse of data. Across several of the fields examined, including digital humanities, there is a growing need for shared community data 121 PREPUBLICATION COPY—Uncorrected Proofs

resources, such as FAIR repositories. One of the workshop speakers cited digital music as an analogy; to implement ARWs, communities need to move to shared data resources in the cloud that are available for a myriad of uses, similar to music streaming services. Creating and sustaining community data resources involves many challenges, including funding, deciding which data sets should be stored and maintained, and facilitating interoperability between them. The lack of availability of these resources ultimately limits the size and scope of collaborations. Realizing FAIR Data, Software, and Workflows Another aspect of fostering sustainability is support for community efforts to develop new tools, services, and frameworks aimed at realizing FAIR data, software, and workflows. As seen in the use cases discussed in Chapter 3, data that are FAIR, well-curated, discoverable, and actionable do not just appear. Significant directed investment and specific actions are needed to support creation, sharing, and curation of such data at volumes needed for implementation of ARWs, and use of ML and AI as research tools at a significant scale in many domains. Actions on the part of several stakeholder groups could help address this. To begin, publishers can move away from the acceptance of data supplements as adequate for fulfilling their data-sharing requirements and instead directly associate articles to data in FAIR repositories, with preference given to leading domain repositories. One initiative in the earth and space sciences is working to implement this approach, which could be adopted by other disciplines (Stall et al., 2019; COPDESS, 2021). In addition, funders could increase support for leading domain repositories, for the creation of new repositories, and for the broader data ecosystem. Leading domain repositories provide quality FAIR curation and simplify discoverability. They also help develop leading 122 PREPUBLICATION COPY—Uncorrected Proofs

practices around what data should and can be preserved. Quality curation and metadata will enable interoperability. Many domain repositories are poorly or inconsistently funded and thus are forced to spend significant staff time on fundraising that could be spent on data services. Support is also needed for related organizations that provide important infrastructure for the data ecosystem, such as Crossref, Datacite, the Research Data Alliance (RDA), the National Information Standards Organization, and Force 11. Research communities can work to ensure that domain and institutional repositories collaborate effectively. There are potential mutual benefits to be gained when institutions support leading repositories as illustrated by the California Digital Library’s (part of the University of California) partnership with the Dryad Digital Repository (Waibel, 2018). Such collaboration will need to be supported. Funders and institutions can also do more to incentivize researchers to practice FAIR for data and other research outputs such as software (Force11 is approving software citation guidance and much direction exists). This can be done through the grant process and data management plans (DMPs) and broader efforts to communicate the value and importance of FAIR as part of changing the institutional culture. One promising example is Duke University’s effort to provide various services and training opportunities to researchers aimed at improving the skills of researchers in developing and implementing DMPs (Duke University, 2021). Funder requirements around data and DMPs can also evolve over time so that fulfilling data-sharing and management expectations are recognized and appreciated as key outcomes of the award. Examples of language that funders can use to encourage sharing of data and other research products such as software, methods, and samples that were discussed as part of a recent National Academies workshop and resulting proceedings on Developing a Toolkit for Fostering 123 PREPUBLICATION COPY—Uncorrected Proofs

Open Science Practices (NASEM, 2021). Additional community thinking on these issues is contained in the responses to the White House Office for Science and Technology Policy Request for Information on Draft Desirable Characteristics of Repositories for Managing and Sharing Data Resulting from Federally Funded Research (OSTP, 2020). Another set of tasks relates to realizing FAIR for workflows themselves. For example, there has been considerable progress in community efforts to develop standards in areas such as registries (Dockstore, an app store for bioinformatics, 4; WorkflowHub, 5 a registry for describing, sharing, and publishing scientific computational workflows), services for monitoring and testing (LifeMonitor, 6 OpenEBench 7), standards for packaging (Workflow-RO-Crate, 8), and Bioschemas’ schema.org definitions for workflows 9 (Goble et al., 2020). Efforts to develop standards for describing workflows, such as Common Workflow Language (CWL) or Workflow Description Language (WDL), and more general abstractions for recovering workflow information from scripts are also important for achieving FAIR workflows (McPhillips et al., 2015; Perkel, 2019). The Global Alliance for Genomics and Health 10 is creating standards for defining, sharing, and executing portable workflows. All of these services are implemented in the European EOSC-Life Workflow Collaboratory , 11 for example, and similar services exist elsewhere. These are the necessary components of FAIR workflows. Another relevant standard is the IEEE 2791-2020 Standard for Bioinformatics Analyses, a 4 See https://dockstore.org/. 5 See https://workflowhub.eu/. 6 See https://crs4.github.io/life_monitor/. 7 See: https://openebench.bsc.es/dashboard. 8 See https://about.workflowhub.eu/Workflow-RO-Crate/. 9 See https://bioschemas.org/. 10 See s and Health. 2021. GA4GH. Website. Available at: https://www.ga4gh.org/. 11 See https://www.eosc-life.eu/. 124 PREPUBLICATION COPY—Uncorrected Proofs

regulatory metadata framework for the U.S. Food and Drug Administration’s High Throughput Sequencing workflows for precision medicine (IEEE, 2020). Why Is It Difficult to Secure Sustained Support for These Priorities? One strong theme of the March 2020 workshop discussion is the difficulty in securing sustained support for the sorts of efforts described above. Even in the case of successful enabling tools such as Jupyter, the funding situation is sometimes tenuous (Granger, 2020). Since World War II, the primary modes of federal research funding in the United States have included short- term competitive awards to individual investigators organized by discipline by agencies such as NSF and NIH, longer-term funding of large intramural and extramural projects that advance the missions of agencies such as the Department of Energy (DOE) and the National Aeronautics and Space Administration, and approaches aimed at catalyzing progress toward addressing particular issues relevant to an agency mission (e.g., Defense Advanced Research Projects Agency [DARPA], IARPA, ARPA-E). Given this framework, providing sustained support for the development and operation of shared infrastructure that serves multiple disciplines has historically been challenging. Nevertheless, since the launch of the Advanced Research Projects Agency Network (ARPANET) in the 1960s, there have been several examples of the U.S. federal government providing shared resources that have allowed research communities to harness information technologies to significantly advance their work. Examples include the establishment of national supercomputer centers in the 1980s by NSF in partnership with academic institutions, the development of GenBank and other digital data resources in the life sciences by NIH and National Library of Medicine starting in the 1980s, and NSF’s advanced cyberinfrastructure program launched in the 125 PREPUBLICATION COPY—Uncorrected Proofs

early 2000s. Current programs aimed at bridging the gaps between investigator-focused projects include Harnessing the Data Revolution (NSF) and Big Data to Knowledge (NIH), and a series of efforts to advance strategic computing and related technologies across agencies under the auspices of the Networking and Information Technology Research and Development Program and its predecessors. Relevant programs and efforts by DOE, DARPA, and international efforts on the part of the European Union and UK Research and Innovation were also discussed at the workshop and are highlighted in Chapter 2. It is difficult for individual institutions to make cyberinfrastructure investments in the same way as they view, for example, a mass spectrometer or other large physical “thing.” One possible model is the Harvard Dataverse, with tens of thousands of data sets deposited for sharing and over 1.5 million downloads. Internet2 is an example of a nonprofit university consortium that has provided sustained support for information technology infrastructure under a membership model. Note that the above barriers to sustainable investments are not only relevant to ARWs but apply to all research that depends on software, which is a significant fraction of all research (Nangia and Katz, 2017). As discussed in Chapter 2, it is inherently more difficult to fund development and maintenance of production-quality software (workflow engines, automated tools, etc.) that can be used broadly than to develop new software as part of a research project that may not be used outside that project. The commercial and nonprofit sectors support ARWs in various ways. The challenge is finding a route that provides continued access and availability to researchers while recognizing that these backers have their own revenue- or mission-based goals to meet. Big tech companies have provided cloud computing services at a large scale to academic researchers at low cost. 126 PREPUBLICATION COPY—Uncorrected Proofs

There is a potential catch-22 in this support, however, in that at some future point, they can raise costs or restrict use through patents. The workshop also revealed areas where the dominance of proprietary tools or data acts as a barrier to ARWs. One example from materials science is scanning transmission electron microscopy. Current commercial electron microscopes “tend to down-sample or discard the vast majority of the signal via averaging or decimation,” and important sample properties are lost in the process (Somnath et al., 2019). Further, the “down- sampled data are usually written into proprietary file formats, which impede and sometimes even preclude access to data and metadata, complicate long-term archiving, obstruct sharing, and fracture the scientific communities along file formats” (Somnath et al., 2019). Future Directions for Sustainability It is beyond the scope of this committee to propose specific funding mechanisms or amounts, but rather to emphasize the importance of support at all stages of the pipeline. Indeed, there are ways forward to ensure sustainability. Several speakers at the workshop suggested that providing continued support for ARWs could be a pivotal role for the government. As examples in Europe and the United States, two large consortia have sustained their systems through collaborations and shared resources: CERN, the European research program in advanced physics, and CERT, the Community Emergency Response Team in the United States. Another suggestion was to create a stable endowment, akin to the Smithsonian Institution, that can serve as a common resource. Given the significant investments that governments and other research funders are making in data-driven science, it makes sense to leverage these investments across borders and domains to the extent possible. Goals of enhanced international collaboration would include 127 PREPUBLICATION COPY—Uncorrected Proofs

facilitating access to tools and resources and ensuring the interoperability of national implementations. Distributed international efforts working to develop standards and approaches to facilitate FAIR data and software include GO FAIR, the RDA, and the Research Data Framework (NIST, 2021). GO FAIR was established to advance the FAIR Principles, which emphasize the importance of machine readability and reuse of data (Wilkinson et al., 2016). RDA was started in 2013 and aims to build “the social and technical infrastructure to enable open sharing and re-use of data.” 12 The Research Data Framework 13 was initiated by the National Institute of Standards and Technology in 2019 and is aimed at increasing the supply of trustworthy research data across domains by developing a “strategy for various roles in the research data management ecosystem.” Additionally, the FAIR for Research Software Working Group is convened as an RDA Working Group, FORCE11 Working Group, and Research Software Alliance Task Force. MANAGING ISSUES OUTSIDE OF RESEARCH THAT AFFECT ARWS The March 2020 workshop featured discussion of several important issues outside of the research enterprise that bear on the development and implementation of ARWs. The most obvious issue is the treatment and use of data that are collected from or about individuals. Discussion about the use of ARWs also connects to broader considerations about the use of AI in various societal and decision-making settings. 12 See https://www.rd-alliance.org/about-rda 13 See https://www.nist.gov/programs-projects/research-data-framework- rdaf#:~:text=To%20address%20these%20issues%2C%20NIST%20initiated%20a%20new%2C,customizable%20str ategy%20for%20the%20management%20of%20research%20data. 128 PREPUBLICATION COPY—Uncorrected Proofs

ARWs promise to open significant new areas of research through the use of data that have not been collected through experiments or simulations, but rather concern the health and medical condition of individuals, their social media behaviors, financial transactions, and the like. In response to concerns about the security and use of personal data—exacerbated by well- publicized examples such as the data breach at credit reporting firm Equifax in 2017 and the 2018 exposure of the use of Facebook data for political purposes by Cambridge Analytica— policy makers and public interest groups have pushed to allow individuals to have greater control over the use, storage, and reuse of their data. The European Union’s General Data Protection Regulations, put into effect in 2018, are intended to protect personal data by placing strong regulations on the entities that collect, process, and use data (EU, 2018). The California Consumer Privacy Act, which went into effect in 2020, established privacy rights that businesses operating in California or that provide a product or service to a resident of California must take steps to protect (Rothstein and Tovino, 2019). Although this does not cover research in the public interest, as of this writing, there was still some ambiguity about how the law will affect the research community, with some clarifying legislation proposed (Moundas and Peloquin, 2020). At the same time as a privacy-aware public shares personal information in unprecedented ways through social media and other avenues, many also express reservations about its use in research. Institutions and commercial entities have established their own policies and requirements about data use for research. As one workshop participant pointed out, academia has tended to be more cautious than the private sector about the use of some personal data, such as social media data, in research (Weitzner, 2020). 129 PREPUBLICATION COPY—Uncorrected Proofs

Some common work-arounds, such as de-identification, can impinge on research, most notably biomedical research. As Bradley Malin, director of the Health Information Privacy Laboratory at Vanderbilt University (and member of this committee) noted at a 2018 workshop on planning for long-term use of biomedical data, “De-identification results in a loss of data utility; encryption results in a loss of functionality; and secure environments result in a loss of efficiency. However, with no action, the potential outcomes include losses of privacy, money (due to litigation and remuneration), societal trust, and scientific opportunity” (NASEM, 2018a). An irony is, as one workshop participant noted, “success can lead to failure.” By this he meant that workflows may be able to use data that were not previously being used, whether because an experiment did not have the intended results or simply because of the data’s size, unwieldiness, or seeming irrelevance. As the data take on a new purpose (and value), the data owner may take another look and place restrictions on what was heretofore more accessible. Even areas of research that do not directly work with personal data must consider privacy issues. The goal of making the workflow itself transparent strengthens reproducibility but could impinge on privacy under certain circumstances, for example, by revealing personal information about specific researchers. As one workshop presenter pointed out, tracking and crediting provenance in data generation must address these types of privacy issues (Weitzner, 2020). Emerging blockchain-based approaches to using and analyzing data hold the potential to alleviate some of these privacy concerns. For example, federated learning is “a new framework for Artificial Intelligence (AI) model development that is distributed over millions of mobile devices [providing] highly personalized models” while protecting privacy (Bhattacharya, 2019). A collaboration of European companies and academic research institutions is developing MELLODDY (Machine Learning Ledger Orchestration for Drug Discovery), an ML platform 130 PREPUBLICATION COPY—Uncorrected Proofs

using federated learning to allow participating organizations to use proprietary data to speed drug discovery while data owners retain control of those data (IMI, 2021). The development and implementation of ARWs also involves broader issues raised by the growing use of AI and ML in a variety of policy-making and decision-making contexts. How can transparency and trust in outcomes with significant real-life consequences be maintained when the characteristics of a specific algorithm unrelated to the issue or problem determine those outcomes? Stoyanovich et al. (2020a) suggest the need for a framework to connect interoperability and trust in algorithmic-based decisions. Initiatives have been launched in recent years under the rubric of “responsible AI” and “responsible ML.” For example, Fairness, Accountability, and Transparency in Machine Learning (FAT/ML) is a series of workshops aimed at exploring the challenges raised by ML “for ensuring non-discrimination, due process, and understandability in decisionmaking.” 14 Organizations such as Google have developed principles for responsible AI, and the Institute for Ethical AI & Machine Learning was established in the United Kingdom to carry out “highly- technical research into processes and frameworks that support the responsible development, deployment and operation of machine learning systems” (Google, 2021; Institute for Ethical AI & Machine Learning, 2021). The principles and guidelines espoused by these initiatives overlap to a significant degree, with the need for human review, protecting data privacy and security, uncovering and addressing bias, and support for transparency and reproducibility generally being invoked. 14 See https://www.fatml.org/. 131 PREPUBLICATION COPY—Uncorrected Proofs

The technical, legal, and policy barriers to implementing ARWs are intertwined in the sense that technological development needs to be informed by policy and legal requirements. Possible approaches to addressing these issues are both computational and policy or legal related. New laws or regulations could bring clarity and uniformity to software services as providers work to comply. This would constitute a type of convergence or use-inspired research. An example discussed at the workshop was the changes in computer hardware and software related to accessibility for people with disabilities. The Americans with Disabilities Act spurred the kinds of technological change that are now integrated in virtually all computer systems and tools. It was also suggested that governments can play a role as an honest broker for data use. Here, too, however, some tension was recognized between agencies that mainly fund and undertake research that produces presumptively open data and agencies that are oriented toward work that produces more restricted data (some agencies do both). Research and its associated data production and use or reuse is also international, making the effectiveness of a single national government on shaping global policy challenging. Data use agreements (DUAs) can address many of these issues from the start, rather than as an add-in consideration, especially when partnerships are formed across the public and private sectors (O’Hara, 2020). However, DUAs are often difficult and time-consuming to conclude (Mello et al., 2020). Privacy, ethics, and similar socially based topics will likely emerge as more data are added into the workflow cycle, and the questions being asked are modified. Building mechanisms into ARWs that recognize and are sensitive to data use issues is new territory to explore and develop, for example, to distinguish between permitted and prohibited queries (Kusnezov, 2020). Ideas proposed at the workshop included embedding compliance in the design of the software for open 132 PREPUBLICATION COPY—Uncorrected Proofs

research data services and standards for the architecture of the sharing and access system (Burgelman, 2020). In addition to privacy concerns, attention has focused on algorithms that perpetuate sexism, racism, and other forms of discrimination and hate. Well-known examples include Microsoft’s TayTweets, which “learned” so much vitriol that it had to be shut off within 16 hours (Hunt, 2016), and the perpetuation of biases in ML (Caliskan et al., 2017). As Oren Etzioni, the CEO of the Allen Institute for Artificial Intelligence, told attendees at a NASEM- convened workshop in 2018, “systems use data from the past to generate models to predict the future, so if society’s past was racist and sexist, the models will carry that bias into the future and also, for technical reasons, exacerbate it” (NASEM, 2018c). Preventing it, he noted, requires human attention and intervention. In the humanities, Hepworth and Church (2014) looked at two data visualizations of instances of lynchings and other white supremacy mob violence that depicted two different results about the extent of these acts of terror, particularly when looking at history of the U.S. West, based on decisions made by the researchers setting the parameters of the data searches. They used this example to propose an “ethical visualization workflow” with three main phases: pre-data collection (defining, reviewing); data collection and curation (collecting, pruning, and describing); and data visualization and argumentation (surveying, pre-visualizing, visualizing, and publishing). The workflow can be implemented, the authors argue, with a twofold approach that is similar to other areas in which domain and computer specialists must work together: firstly, by familiarizing themselves with the latest research in the content field and adjacent field; secondly, by including team members familiar with the entirety of the data pipeline from collection to cleaning to presentation…User experience design is particularly important for 133 PREPUBLICATION COPY—Uncorrected Proofs

evaluating the interpretive intervention made by the visualization and mitigating harm caused by the final visualization. (Hepworth and Church, 2014, para. 43) In lab-based science, an intervention to embed research ethics training was evaluated through a randomized trial conducted by the Center for Open Science (COS) and the University of California, Riverside, with a grant from NSF (Plemmons et al., 2020). In the intervention, training in responsible conduct of research was integrated into the ongoing projects and circumstances of the laboratory, rather than an isolated training in a classroom or online. The training, called the Institutional Re-Engineering of Ethical Discourse in STEM (iREDS), was developed by COS and is available free and online. While it goes beyond workflow development and use, the two areas of the training that were the focus of an article by Plemmons et al. (2020) are relevant: author attribution and data management. The authors concluded, “The iREDS approach shifts the paradigm of research ethics training from merely telling researchers what is and is not ethical, to empowering them to incorporate ethical practices into their research workflow.” 134 PREPUBLICATION COPY—Uncorrected Proofs

Next: 6 Conclusion »
Automated Research Workflows For Accelerated Discovery: Closing the Knowledge Discovery Loop Get This Book
×
Buy Paperback | $45.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The needs and demands placed on science to address a range of urgent problems are growing. The world is faced with complex, interrelated challenges in which the way forward lies hidden or dispersed across disciplines and organizations. For centuries, scientific research has progressed through iteration of a workflow built on experimentation or observation and analysis of the resulting data. While computers and automation technologies have played a central role in research workflows for decades to acquire, process, and analyze data, these same computing and automation technologies can now also control the acquisition of data, for example, through the design of new experiments or decision making about new observations.

The term automated research workflow (ARW) describes scientific research processes that are emerging across a variety of disciplines and fields. ARWs integrate computation, laboratory automation, and tools from artificial intelligence in the performance of tasks that make up the research process, such as designing experiments, observations, and simulations; collecting and analyzing data; and learning from the results to inform further experiments, observations, and simulations. The common goal of researchers implementing ARWs is to accelerate scientific knowledge generation, potentially by orders of magnitude, while achieving greater control and reproducibility in the scientific process.

Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop examines current efforts to develop advanced and automated workflows to accelerate research progress, including wider use of artificial intelligence. This report identifies research needs and priorities in the use of advanced and automated workflows for scientific research. Automated Research Workflows for Accelerated Discovery is intended to create awareness, momentum, and synergies to realize the potential of ARWs in scholarly discovery.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!