Page 63 Cite

Suggested Citation:"4 Automatic Research Workflows and Implications for Advancing Research Integrity, Reproducibility, and Dissemination." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

4

Automatic Research Workflows and Implications for Advancing Research Integrity, Reproducibility, and Dissemination

Researchers across and beyond the disciplines described in Chapter 3 share challenges related to how they plan, conduct, and disseminate their work. They face pressures to secure and sustain funding, collaborate with others, and communicate results expeditiously and accurately. They need to do this in structures that range from small, vertically structured labs involving supervision and mentorship to huge dispersed networks across institutions with no formal lines of authority. In this context comes another overlay: the need to conduct experiments in a manner that allows others not only to understand the findings, but also to have access to and use of the data and methods to arrive at those findings.

During the workshop and in discussions, the committee considered how automated research workflows (ARWs) contribute to these crosscutting research issues. Cyberinfrastructure-enabled research (NSF, 2007) is now essential, but, as was generally agreed in the workshop, its use can also introduce inaccuracies and skewed results if not well understood or allowed to self-perpetuate without human oversight. This chapter covers the relationship between ARWs and issues related to integrity, reproducibility and replicability, and dissemination of research.

INTEGRITY

The increasing complexity of research and the associated data it produces, combined with ever-increasing hypercompetition among researchers in many fields, has given rise to a number of persistent research integrity challenges (Nature, 2017; Bucci, 2018). These challenges range from detrimental research practices, such as authorship misrepresentation and inappropriate use of statistical analysis, to research misconduct in the form of data falsification and fabrication.

Page 64 Cite

Suggested Citation:"4 Automatic Research Workflows and Implications for Advancing Research Integrity, Reproducibility, and Dissemination." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

When exposed, these cases not only have negative repercussions on the individual researcher and/or research group, but also can lead to a lack of trust in scholarly research by the broader community and the public.

The increasing concern across the research ecosystem has led to major reports and recommendations on how to improve the integrity of research (e.g., ALLEA, 2017; NASEM, 2017; WCRI, 2019). Interestingly, a report by the European Commission’s Open Science Policy Platform reveals that while some research enterprise stakeholders such as research institutions and learned societies believe that significant progress has been made in addressing integrity issues, stakeholders such as research funders, libraries, publishers, and other organizations that disseminate research believe that much still needs to be done (EC, 2020).

Technological advances may create new ways for researchers to manipulate results, as well as lead to development of new tools to detect mistakes and misbehavior. As noted in the National Academies’ 2017 report Fostering Integrity in Research:

In theory, if not always in practice, all the data contributing to a research result can now be stored electronically and communicated to interested researchers. However, this trend toward greater transparency has created tasks and responsibilities for researchers and the research enterprise that did not previously exist, such as creating, documenting, storing, and sharing scientific software and immense databases and providing guidance in the use of these new digital objects (NASEM, 2017, pp. 47–48).

This significant increase in both opportunity and complexity can create challenges in ensuring that researchers at all career and seniority levels receive adequate and regular training in good research practices, applying this knowledge not only in their own work but also in undertaking peer review and other evaluation of the work of others. Furthermore, the increased complexity of much research can make it challenging (in time and effort) to adequately capture and report all elements of an experiment, and then imposes similar challenges for peer reviewers in adequately assessing all this detailed information, with few incentives to authors or reviewers to undertake this effort.

ARWs provide a significant opportunity to address these issues and hence enhance research integrity by

Enabling automated capture and retention of data and their associated metadata in cyberinfrastructure deployed across the research life cycle.
Better documentation and reporting of the details of the methods, increasing the ability of other researchers to scrutinize the work and potentially reducing the possibility of data being falsified or results being selectively reported.
Accounting for limited sample size to reduce p-hacking if uncertainty quantification is incorporated into the workflow.

Page 65 Cite

Suggested Citation:"4 Automatic Research Workflows and Implications for Advancing Research Integrity, Reproducibility, and Dissemination." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

Providing much greater transparency in processes, which can highlight uncertainties.
Supporting the comparison of findings to highlight inconsistencies and outliers and to provide transparency to determine whether these elements have been “cleaned up” for the final presentation of the data.

But workflows do not guarantee transparency or, more broadly, integrity. To achieve and maintain integrity, it is crucial to involve humans in the process. An overreliance on naive machine learning (ML) can in itself introduce p-hacking and other errors, warned Rebecca Nugent at the March 2020 workshop (Nugent, 2020), often from the data handlers’ lack of training or simply being overwhelmed by the amount of data available. Simple bugs and errors in software may result in mistakes in reported results. In cases in which people rely on automated techniques to help them identify something that might be useful, the result may be HARKing or p-hacking.¹ Caution needs to be exercised around the biases inherent in algorithms, which can easily self-perpetuate and lead to suggested correlations that make no sense. Closed algorithms can be difficult for peer reviewers to assess and spot where such biases may have occurred that may have influenced the final results and conclusions. In addition, although data-driven models have the ability to self-learn and adapt, they sometimes do so blindly. An iterative loop is needed, in which algorithms are adapted based on domain knowledge: domain knowledge is refined based on what the algorithms learn, the algorithms are improved to become more robust, and so on. A new generation of artificial intelligence (AI) algorithms, an “AI 3.0” as it has been called, would better integrate domain-based models with data-driven models (Vidal, 2020).

There is also a need for much higher quality data on which such algorithms might work, to avoid garbage in, garbage out. Data curation becomes an increasingly important task to ensure that small errors in a data set do not get amplified by the use of automated processes that then use that output to inform the next experiment. The task is complex. It involves assessment of which data sets to prioritize for the considerable effort involved in curation, as well as training, incentives to prioritize the effort above other tasks such as conducting further experiments that might lead to further publications, and funding to support the extra effort and cost involved. Several studies have suggested that data stewards can provide research teams with this expertise (e.g., Scholtens et al., 2019), and Barend Mons has called for 5 percent of research funding to be dedicated to ensuring that data are reusable (Mons, 2020).

___________________

¹ HARKING, or “Hypothesizing after the Results are Known,” was so-named by psychologist Norman Kerr in 1998. P-hacking refers to manipulating statistical data to show that a result is more significant than it is. In 2016, the American Statistical Association issued a statement to address the proliferation of incorrect use of p values (Wasserstein and Lazar, 2016).

Page 66 Cite

Suggested Citation:"4 Automatic Research Workflows and Implications for Advancing Research Integrity, Reproducibility, and Dissemination." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

REPRODUCIBILITY AND REPLICABILITY

Reproducibility and replicability are crucial elements to ensuring the integrity of scientific processes and trust and reliability of new discoveries on which the next discoveries are built.² Attempts to reproduce and replicate work require the original researchers to transparently share their underlying data and the associated methods, as well as the estimation, characterization, and reporting of uncertainty (NASEM, 2019b, p. 6). Given the size and complexities of these investigations, manual capture is challenging and sometimes not possible.

The growing ubiquity and complexity of computation in the research process across many disciplines presents additional challenges to independently reproducing results. Examples of these challenges include the use of nonpublic data and code in research, the costs of retrofitting long-standing research projects with tools that automatically capture logs of computational decisions, and incomplete information about the computing environment where the research was originally performed (NASEM, 2019b). Also, research that utilizes high-performance algorithms and parallel processing may produce different numerical outputs from the same input data on different runs, with the output being “understood as an approximation to the correct value within a certain accepted uncertainty” (NASEM 2019b).

ARWs can enhance research reproducibility and replicability by

Capturing the provenance of the results, and the data and models on which they are built, thereby supporting the more accurate rerunning of processes. This can include capturing more detail of the methods than might be achievable manually but which might have a material impact on the ability to replicate or reproduce a finding.
Providing simpler and more efficient approaches to the sharing of the research processes.
Increasing efficiencies and eliminating a potential source of errors in on-boarding new research team members, and better supporting knowledge transfer as research teams change.
Supporting broader and more stringent review and validation of findings, including through formal peer review during publication as well as by the broader research community when data and associated methods, materials, and code are published.
Supporting full transparency of the research process to reduce the integrity issues mentioned above, whether deliberately or through poor research practice (e.g., removing outliers, selective reporting, image manipulation, etc.).

___________________

² As defined in NASEM (2019b), reproducibility involves “obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis.” Replicability involves “obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.”

Page 67 Cite

Suggested Citation:"4 Automatic Research Workflows and Implications for Advancing Research Integrity, Reproducibility, and Dissemination." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

Capturing validation, seamless integration, and repeatability in team science approaches, which are increasingly the reality in attempting to solve complex problems.³

Even without reproducibility and replicability set as primary goals, workflows help achieve these goals by virtue of the features they offer users. As the previous National Academies committee on reproducibility and replicability highlighted, workflow systems such as Nextflow, Galaxy, and Snakemake in the life sciences, Chimera in physics, and the Open Science Framework in psychology (as illustrative examples only) can link research results to the computational processes that derived them (NASEM, 2019b). Blockchain can potentially lock in protocols and outputs so that it is clear that nothing has been interfered with, whether deliberately or through poor research practices.

Capturing provenance is considered a benefit of most workflows, but they are not a panacea. As pointed out in a review of workflows’ role in reproducibility, computational limitations include interoperability gaps, use of third-party services that may have reliability issues, and lack of central repositories (Cohen-Boulakia et al., 2017). Even when code and data are available, it may be difficult or impossible to reproduce work when the particular computational environment in which data have been processed, including specific versions of language and systems libraries, cannot be recreated. Capturing provenance and ensuring reproducibility in research that involves issues such as complex computation, rapid or continuous adjustments to analytical processes, and large amounts of streaming data will continue to raise challenges for designing and implementing ARWs.

In summary, technical concerns need to be addressed to fully benefit from reproducibility and replicability features, in addition to the cultural and educational challenges described more fully in Chapter 5.

DISSEMINATION

The scientific article, published in a prescribed format in a peer-reviewed journal, has constituted the essential ingredient of traditional publishing since the 17th century. Although traditional publishing is still dominant, new models are gaining strength, with the opportunity for the publishing community to rethink its approaches, not just tinker and automate existing practices. One important step toward openness was the memorandum on Increasing Access to the Results of Federally Funded Scientific Research from the Office of Science and Technology Policy (Holdren, 2013), which directed federal agencies with over $100 million in annual conduct of research and development expenditures to develop a plan to support increased public access to the results of research funded by the federal

___________________

³ For example, CRediT (Contributor Roles Taxonomy) includes 14 contributor roles that are typically played by contributors to scientific scholarly output. See https://casrai.org/credit.

Page 68 Cite

Suggested Citation:"4 Automatic Research Workflows and Implications for Advancing Research Integrity, Reproducibility, and Dissemination." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

government (NASEM, 2018b). ARWs not only can play a direct role in the processes of how research is disseminated, but also can provide new opportunities (and challenges) in the types of outputs that need effective dissemination and review. They can support the rapid publication of new findings, shown to be critical as the research landscape for COVID-19 barrels forward. The experience and new ways of working can be translated into all fields of research where speed of progress is equally important.

Publishers and other research disseminators have taken important steps to adapt their systems and requirements to foster transparency and access to nonarticle outputs as information technology advances and other shifts have changed research practices. Notable examples include efforts to facilitate data sharing, bring clarity to author contributions, and enable interdisciplinarity and more rapid utilization of research findings through “convergent” approaches such as We Share Data’s Data Sharing Seminar Series for Societies⁴ (McNutt, 2017; McNutt et al., 2018).

In terms of streamlining publishing workflows, AI-based tools are already being used by most publishers to detect plagiarism (e.g., iThenticate). An increasing number of publishing-related organizations (e.g., Digital Sciences, Clarivate/Publons, Elsevier) have developed services that use algorithms and text mining and/or natural language processing to support authors, editors, funders, and others in identifying relevant peer reviewers, guest editors, contributors to special issues, and others involved in the process. Some AI-based tools aim to accelerate the publication process as well as streamline the effort required to validate article submissions, which can be especially important to make high-priority areas of research available sooner. For example, StatReviewer aims to check that the statistics and methods in manuscripts are sound, and UNSILO’s Evaluate tool⁵ uses advanced machine intelligence and natural-language understanding to help authors, editors, reviewers, and publishers carry out evaluation and screening of submitted manuscripts.

Many tools use text mining, ML, network analysis, and other methods to filter published research to support researchers in keeping on top of the relevant literature (e.g., Researcher app, most bibliographic reference manager apps). These services identify linked concepts in the literature that may not be obvious to humans and surface emerging trends (e.g., Meta, Euretos). Some reference management tools, such as Sciwheel,⁶ support authors by assessing their written text and suggesting other articles to improve the quality and relevant completeness of their citation lists.

ARWs can benefit publication through greater adherence to minimum reporting requirements, and they open the possibility of bringing reporting standards to fields where they are not so advanced. This would reduce time and effort for

___________________

⁴ See https://wesharedata.org/.

⁵ See https://unsilo.ai/unsilo-evaluate.

⁶ See https://sciwheel.com/?lg.

Page 69 Cite

Suggested Citation:"4 Automatic Research Workflows and Implications for Advancing Research Integrity, Reproducibility, and Dissemination." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

authors, publication editorial teams, and reviewers in checking whether enough detail has been provided to support reproducibility. Nature, for example, has compiled its reporting requirements and resources for contributors, in which the author’s responsibilities related to sharing of data, materials, codes, and protocols are emphasized in bold-faced type (Nature, 2021).

At the same time, ARWs increase the need for greater agreement on reporting standards so that different workflows and tools capture the same minimum set of information. These requirements also need aligning with the specific depositing requirements of subject-specific repositories that often have minimum metadata requirements; a service called FAIRsharing⁷ is one way to monitor this landscape of ever-changing standards (Sansone et al., 2019). This, together with an increasing use of a standard set of ontologies, could help to realize the vision of the European Open Science Cloud and other such clouds of making data truly interoperable between disciplines, and open significant opportunities of truly collaborative research.

A shift in how research is conducted and produced is allowing the community to rethink publishing approaches beyond simply automating existing practices, but rather to better utilize technologies aligned with the way research outputs are being produced. For example, models are now available to rapidly publish (typically within a few days) new findings together with the underlying findable, accessible, interoperable, and reusable (FAIR) data and detailed protocol information, to be versioned and updated as new data come in (e.g., F1000Research).⁸ Other platforms include the Gates Foundation’s Gates Open Research⁹ and Open Research Europe.¹⁰ These models use transparent peer review and support in-article visualization of data and tools such as Code Ocean¹¹ and Whole Tale¹² so that readers and peer reviewers can assess the code, edit it, and reanalyze the data on the fly within the article without the labor-intensive need to set up the relevant computational environment. Some publishers are also exploring publication of electronic notebooks and including these directly into publishing, and a few in chemistry are including these directly into publishing workflows (AGU, 2021).

The advance of container technology—which “allow[s] packaging up all code and dependencies to ensure that analyses run reliably across a range of operating systems and software versions”—has a lot of promise (Wiebels and Moreau, 2021) in this space. Containers can be disseminated along with articles and enable reproducibility and broader sharing. Workflows can thus effectively be distributed with their entire runtime environment and versioning, allowing for preservation of provenance information.

___________________

⁷ See www.fairsharing.org.

⁸ See https://f1000research.com/.

⁹ See https://gatesopenresearch.org.

¹⁰ See https://open-research-europe.ec.europa.eu/.

¹¹ See https://codeocean.com/.

¹² See https://wholetale.org.

Page 70 Cite

Suggested Citation:"4 Automatic Research Workflows and Implications for Advancing Research Integrity, Reproducibility, and Dissemination." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

COVID-19 has demonstrated the need for such rapid and collaborative research in public health emergencies and the effectiveness and impact of this way of working. For example, in early March 2020 at the request of OSTP, the Allen Institute for AI and partners created CORD-19,¹³ an AI-enabled, open, machine-readable collection of papers and data (Wang et al., 2020). As of early July 2020, it held more than 130,000 articles, obviously more than any group of humans could expect to skim, much less truly take advantage of. As another example, to avoid lag time of the normal peer review process but minimize the impact of preprints that would not ultimately withstand such review, MIT Press and University of California Berkeley launched RR:C19,¹⁴ what they characterize as an “open access, rapid-review overlay journal that will accelerate peer review of COVID-19-related research” using AI tools. Such efforts are enabling a transformation of research into the virus and demonstrate how progress can be made in drug and vaccine discovery at speeds many times faster than previously achieved (see additional discussion in Chapter 3). This experience and new ways of working now need to be translated into all fields of research where speed of progress is equally important.

As the speed of research updates created by automated research is likely to increase, we need to consider how to adequately review these outputs, especially given that peer reviewers are already overwhelmed. Publishers are developing article transfer mechanisms of various forms to minimize subsequent review as a manuscript passes between journals looking for acceptance. During COVID-19, new initiatives (e.g., Outbreak Science¹⁵) have developed to incorporate pre-submission triage mechanisms on preprints to minimize direct peer review requests from already overworked coronavirus experts. But these efforts will not be enough for the potential volume increase from significant uptake of ARWs. Smarter ways are needed to decide what really warrants full peer review as currently practiced versus different types of peer review. These new forms of review may be conducted by different actors in the system (e.g., data curators as reviewers who are experts in automated workflows, rather than just discipline experts), and in some cases could include AI-based peer review. Not only will this widen the pool of potential reviewers, but it will also bring much needed breadth in reviewers’ perspectives and expertise as research becomes increasingly multidisciplinary and collaborative.

Developments are emerging in automating the process of creating data note publications (short, peer-reviewed publications that describe research data stored in a repository) in XML alongside rapid data production, which then goes through AI-based peer review to conduct an initial set of checks, followed by community review. A collaboration between Wellcome, the Sanger Institute, and F1000 Research produced the first such publications in 2021. However, in the race to

___________________

¹³ See https://allenai.org/data/cord-19.

¹⁴ See https://rapidreviewscovid19.mitpress.mit.edu/.

¹⁵ See https://outbreakscience.org/.

Page 71 Cite

Suggested Citation:"4 Automatic Research Workflows and Implications for Advancing Research Integrity, Reproducibility, and Dissemination." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×

keep up with the volume of outputs, it is important not to lose the detail of peer review sufficient to adequately assess workflow systems and hence the impact that they could have on research and trust in the scholarly system.

In addition to articles, data, and software, workflows themselves are important research products that can be shared, evaluated, and reused. Challenges related to ensuring FAIR workflows (discussed in Chapter 5) and using AI and ML in ways that are transparent and reproducible (discussed above in Chapter 4) are relevant to developing methods and platforms for disseminating ARWs as expressions of methods underlying reported work. Specific issues include standardizing formats for unstructured data and accommodating AI black boxes—models created from data by an algorithm that are “inherently uninterpretable and complicated” (L’Heureux et al., 2017; Rudin and Radin, 2019).

The use of automated and AI-based workflows also opens interesting questions on the impact on authorship, new types of contributions to the work (e.g., the relationship between the discipline expert and the workflow developers), and the balance in seniority between those roles, as well as credit for work if the automated workflows are generating their own further research questions based on the previous data. With authorship in traditionally published outlets as a significant component toward promotion, tenure, and funding decisions, the balance between the contribution of the workflow and the human needs to be carefully thought through.

Page 72 Cite

Suggested Citation:"4 Automatic Research Workflows and Implications for Advancing Research Integrity, Reproducibility, and Dissemination." National Academies of Sciences, Engineering, and Medicine. 2022. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. Washington, DC: The National Academies Press. doi: 10.17226/26532.

×