Potential Disruptors to Forecasting Costs
In this report, a disruptor is anything that may cause radical changes to the ways research is conducted and data are collected, used, archived, or preserved. Disruptors may be positive or negative and may raise or lower the cost of data management and preservation. This chapter considers some of the future developments and disruptors in data technologies and data science that may reduce or increase data costs in the next 5 to 10 years. Recent examples of disruptors affecting biomedical research are the widespread use of high-resolution imaging instruments (e.g., electron microscopes [Courtland, 2018; Guzzinati et al., 2018]); the decreasing cost of sequencing, the rate of which even surpasses Moore’s law (Wetterstran, 2019); and the advent and accessibility of cloud storage and computing. The use of high-resolution imaging instruments alongside the decreasing cost of sequencing has resulted in the ability to collect huge volumes of data, while cloud computing has resulted in new ways to store, aggregate, and analyze data. Cloud computing, however, has also resulted in new or different costs that many researchers do not yet fully understand. Those costs must be considered in the context of potential gains in capacity or functionality.
There is no way to fully anticipate factors that might radically affect the costs of future data preservation, archiving, and use. This chapter focuses on certain emerging challenges spanning different dimensions, including
- biomedical data volume and variety,
- advances in machine learning and artificial intelligence (AI),
- changes in storage technologies and practices,
- future computing technologies,
- workforce-development challenges,
- legal and policy disruptors, and
- human-subjects research.
This illustrative list gives examples of disruptors likely to affect costs of data management and use in the next 5 to 10 years. Although quantifying the contributions of these disruptors to long-term data-preservation costs is beyond the scope of the study committee’s charge, these issues warrant attention so that associated cost changes can be anticipated and minimized or exploited to some extent.
BIOMEDICAL DATA VOLUME AND VARIETY
The biomedical sciences have generated steady streams of data for decades, but there have been sudden orders-of-magnitude increases in data collection in particular domains. Over the past decade, emerging and evolving data from sources such as next-generation sequencing, correlated light and electron microscopic imaging, and multiscale high-performance computing simulations have led to large increases in the volumes of data that can be collected and have pushed biomedical research into the realm of “big data.” There are many centralized core facilities serving research communities, such as the National Center for Microscopy and Imaging Research,1 that can produce extremely large live data feeds. Many researchers and laboratories have already acquired or will acquire volumes of data that cannot, at present, be completely analyzed.
Imaging tools are undergoing a revolution, and new microscopy technologies produce ever more detailed images, leading to a data-size explosion. Both large centers and conventional research laboratories are exploring imaging regimes that cross fundamental length scales: from tens of centimeters to angstroms. These image data sets are on the order of tens of terabytes per project and accumulate petabytes of data per year per instrument. The scales of such data are critical to advance further understanding of key biological processes. Other areas of biomedical research are experiencing similar growths in data volume, including genomics, where next-generation sequence data are introducing unprecedented challenges in data management, organization, and analysis. Electronic medical records and small data collected at individual laboratories that must be aggregated with existing data sets also present challenges to efficient data analysis in the quest for actionable knowledge.
In the foreseeable future, the biomedical research community will experience spurts in data growth that will tend to either (1) add a dimension to the data space or (2) extend a dimension by an order of magnitude. This growth may be related to, for example, the following:
- Gene sequencing. Moving from sample-level sequencing, to cell-level sequencing (representing a new dimension), to “cell in context” sequencing (representing yet another new dimension—the cell location in terms of both its position in the body and surrounding cell types and other structures). The shift from per-sample sequencing to per-cell sequencing results in 1,800 times as much data per subject (Ameur et al., 2011).
- Population size (extending a dimension). Moving from a sample size of 100 or 1,000 to 1,000,000.
- Time dimension. Images of the same cell or piece of tissue over time, or gene sequencing the same individual cell or tissue at multiple points in time (e.g., to establish a “healthy” baseline and then watch precursors of disease develop).
- Sequencing depth. Going from a coverage depth (i.e., how many times, on average, each location in a sequence of interest is sequenced) of 30 times or so to 100 times and more to find rare transcripts or mutants (such as in RNA-Seq).
- Reanalysis of existing images or reimaging samples. New techniques or methods allowing greater resolution or precision.
The ever-expanding data-collection capability continues to impose challenges to biomedical science applications owing to its volume, velocity, variety, veracity, and variability (e.g., Ristevski and Chen, 2018) but promises transformative advances. However, the size and complexity of those data sets are overwhelming existing repository structures and are pushing the boundaries of the current capabilities of technologies to access, manage, integrate, and analyze them at scale. Increasingly, biomedical data are too voluminous for a single platform, too unstructured for a traditional database system, or too continuous to store for analysis at a later time. More than ever, such challenges or possible cost increases associated with big data must be considered in the context of additional value and novel opportunities for scientific understanding at different scales.
1 The website for the National Center for Microscopy and Imaging Research is https://ncmir.ucsd.edu/, accessed December 13, 2019.
ADVANCES IN MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE
The above-mentioned volumes of data shifted the bottleneck in biomedical sciences from data availability to the generation of insight from data. This shift has resulted in increased use of the newest advances in machine learning and AI in biomedical sciences. A simple search for the term “machine learning” in medical literature on Semantic Scholar2 reflects this increasing use. Continuous automatic annotation of data and metadata generation is one growing use of machine learning. Any biomedical researcher using big data needs to be able to reduce the size and complexity of the data while adding more meaning and value to them with the addition of richer metadata. This trend is already evident in many parts of the field (Shah et al., 2019; Zhu et al., 2019). Automated metadata generation using techniques that allow regular updates to volumes of data increases the need for active and more costly storage approaches. Automated data analysis requires programmatic access to data. Increased use of services that enable data search and access as well as findable, accessible, interoperable, reusable (FAIR), and responsible use will likely result in the need for additional or new human resources and talent development.
AI offers the potential to lower costs by automating ethical and regulatory processes. There is growing use of deep learning as an approach to AI in biomedical science, although many challenges, including interpretability (Xu and Jackson, 2019), remain. This creates the need for AI-ready solutions and systems in which data curation and storage are no longer independent of their analyses. For instance, the Broad Institute has developed and tested a Data Use Oversight System (DUOS)3 meant to reduce the person-hour costs of data access committees. These committees are staffed by highly trained professionals who consider requests to access data for secondary research purposes in light of the restrictions on data use that are built into consent forms or other governance commitments. DUOS semi-automates this process by building ontologies into both consent forms and secondary data access request forms. Other costs associated with responsible data use and sharing might be lowered through use of automated processes that have the additional benefit of placing more control over data in the hands of participants—for example, an automated process by which participants with revised preferences for secondary use or sharing of their data could unilaterally change the access policies that apply to their data and could lower costs associated with manual tracking and updating of participant preferences.
On the other hand, AI has the potential to be a negative disruptor that drives up costs associated with ethics and regulation by upending assumptions about which data are nonidentifiable4 or deidentifiable.5 For instance, AI makes it easier to re-identify facial and cranial images. There are also concerns about AI-based hacking, upending assumptions about the security of data.
Changes in Storage Technologies and Practices
Biomedical big data from many sources today place different constraints on data management, from their acquisition and movement, to storage and access. Greater constraints on physical data storage are posed by the need for bandwidth and computing to move and analyze data. The doubling of storage capacity represented by Kryder’s law (Walter, 2005), which has slowed down over the past decade owing to different approaches to storage and cloud computing, is likely to increase storage costs over time. These changes could easily affect free and infinite storage by adding charges to computing and networking around the data, most of which are not built for multicloud solution scenarios.
As discussed in the previous section, there will likely be a shift away from merely storing data toward approaches that allow continuous extraction of value from data using machine learning and AI techniques. This shift will affect raw data coming off the sources in active and passive archives and include models and knowledge
2 The website for Semantic Scholar is https://www.semanticscholar.org/search?q=%22machine%20learning%22&sort=relevance&fos=medicine, accessed December 13, 2019.
3 The website for the DUOS is https://duos.broadinstitute.org/, accessed December 13, 2019.
4 See 45 C.F.R. § 46.102(f)(2).
5 See the Health Insurance Portability and Accountability Act of 1996, Pub. L. No. 104-191, 100 Stat. 2548 (1996).
generated from data. Connections to open knowledge networks that enable the search and access of data generate a new set of requirements around how data are stored and used.
These factors have influenced a major shift in how data storage has been managed in the past decade, and new technologies will likely continue to stress the capabilities of data storage systems. Today’s storage systems are multifaceted and software driven in a way that optimizes storage performance for various uses. Next-generation storage systems are being built around AI-integrated approaches that enable users to monitor storage performance for its use rather than bitbucket costs. Although the cost implications of these approaches and their use are not trivial and are difficult to estimate for scientific users, the storage systems have already led to major cost-cutting efficiencies for enterprise use. Software-defined and hybrid storage approaches (Donald, 2019) are also potential areas that will disrupt how scientific data are stored and managed as AI-ready entities.
Future Computing Technologies
In addition to the disruption in data generation and intelligent value-driven storage, the analysis and computing costs of data will have a large role in the cost structure over the next 5 to 10 years. Further advances in moving computation to data are likely to continue to shift the cost of data, from download and storage (e.g., from the cloud) and toward computing. New cloud cost models will be the determinant factor for the effects of this shift on the overall cost of data. This concept might be encapsulated as a shift away from data sets to data streams. In addition, emerging edge computing6 architectures are expected to disrupt central repository-driven computation and privacy strategies to a more distributed mode for the generation of insight from data, especially on biomedical data sensed via “Internet of Things” devices that capture data in real time for health monitoring and alert-generation scenarios. Last, the increasing number of non–von Neuman architectures and machine learning accelerators will require careful consideration regarding co-locating data and computing based on the models that need to make use of co-dependent composable services at the digital continuum.
DEVELOPMENTS WITH POTENTIAL COST SAVINGS
Of all existing technologies, a few may reduce costs within shorter time frames. These range from new approaches to managing cloud use to analyzing data through service and managing the integrity of data. A few examples include the following:
- Scalable search approaches and libraries that can combine many types of searches on heterogeneous data systems across distributed storage platforms. Use of such searches could have implications for costs associated with metadata. Elasticsearch7 is an example of a service offering this type of capability.
- Blockchain is a chain of timestamped hashes of information that enable a number of applications to preserve data integrity and ownership as well as lead to a new credit-economy discussion around data (e.g., exemplified by LunaDNA8). The potential uses for this technology in the biomedical sciences range from patient-controlled data access to researchers managing who has access to their data and for how long.
- Open Knowledge Networks are community efforts to develop national-scale data infrastructure (see, e.g., OSTP, 2018). Universities, funders, and companies are working on knowledge networks that would provide greater and richer access (e.g., semantic) to data through more accessible interfaces (e.g., natural language). Arguably, such networks have not yet achieved widespread adoption, but they merit some examination for potential impact on costs.
- The National Science Foundation (NSF) CloudBank is a collaborative NSF award (with a $5 million grant)9 to make accessing the cloud easier and less costly. It aims to be helpful for many stakeholders, including NSF program officers, researchers, students, and cloud providers. CloudBank will provide oversight of the cloud ecosystem. Because many challenges associated with cloud computing are related to account management and account monitoring, CloudBank is actively building methods to enable diverse users to manage their cloud credits through business operations functions and services (Norman, 2019; San Diego Supercomputer Center, 2019). This type of resource could greatly assist researchers in understanding the consequences of their choices using cloud-based storage or computing (see Figure 7.1).
6 In edge computing, data are first processed at a center geographically closer to the data sources. The resulting smaller or compressed information is then sent to the cloud for computing. This process reduces latency periods.
7 The website for Elasticsearch is https://www.elastic.co/, accessed December 13, 2019.
8 The website for LunaDNA is https://www.lunadna.com/, accessed December 13, 2019.
9 The website for the CloudBank award is https://www.nsf.gov/awardsearch/showAward?AWD_ID=1925001, accessed December 13, 2019.
As has been described in Chapter 2, of all the resources required to make biomedical data useful for science, perhaps the most cost intensive is the human one. A major challenge to the biomedical research community, especially influenced by the above-mentioned disruptors, is the training and education of the current and next generation of biomedical scientists and a workforce that can effectively process and manage data in their different states. For this reason, workforce development is a disruptor to the cost structure in biomedical data archival, preservation, and access in the long term. The biomedical data community must work collectively to address the need for well-trained human talent.
The number of tools and techniques for working with biomedical data is increasing, and open access and cloud computing place those tools, as well as the data and infrastructures, within reach. But human talent, particularly in data science, is difficult to find. So far, biomedical scientists and researchers who are developing applications and models for big data have filled this role, but such individuals represent a small percentage of analytical talent. Data scientists are in high demand by those who can offer higher wages than offered through the public sector and academia. This disparity makes it difficult and expensive to attract and sustain an adequate workforce. Training biomedical data scientists who are well versed to take advantage of the emerging disruptive technologies in scientific applications is of critical importance to the future of biomedical data-driven research and knowledge advancement. However, no single group or strategy will be able to cover the full spectrum of educational requirements to comprehensively train biomedical data researchers. Approaches that take advantage of open, online teaching modules to train in-house experts on emerging data and technology trends may be a means to reduce the cost of talent generation.
LEGAL AND POLICY DISRUPTORS
The legal and policy environments—and the evolution of those environments—are another source of potential disruption that may affect costs. Some challenges to forecasting the costs of data curation, dissemination, and preservation arise as unintended consequences of U.S. science policy. For example, the NSF policy against cost sharing, intended to provide a level playing field for researchers from differently resourced institutions, may obscure cost information about data preservation and access that occurs after or as an adjunct activity to the central research activity. Federal Statistical Research Data Centers (FSRDCs) are an important dissemination mechanism for a set of confidential, high-value, population-level biomedical data produced by the National Center for Health Statistics and other statistical agencies. FSRDCs are supported by universities (“institutional partners”) and member federal statistical agencies. There are currently 30 secure enclaves across the United States, through which thousands of researchers access data. It is reasonable for science and health agencies to ask about costs to provide access via such data centers. NSF has provided funding for most of these enclaves, but that funding is explicitly intended to cover start-up costs and requires ongoing support from institutional partners or individual, externally funded research projects. However, NSF, in pursuit of its goal of fair access to external funding, has decided to not require applicants to demonstrate their plans for financial sustainability (and in fact prohibits prospective research data centers from including those plans in funding proposals). Thus, neither NSF nor the federal statistical agencies has any information about the ongoing costs to the institutional partners of maintaining this data-dissemination infrastructure—no one knows how much it costs to disseminate these data. This situation could become an additional disruptor to sustainability, especially as data and data infrastructure become increasingly large and complex.
Other disruptors generated as a result of changing legislation and policy are related to data privacy. Illustrative examples are developed in the next sections concerning, respectively, when data are considered to be “identifiable” and when and under what circumstances it is permitted or seen as appropriate to collect, store, or share data.
Current regulatory definitions of data and tissue “identifiability” are volatile. The Common Rule is a set of regulations in place since 1991 that applies, directly, to HSR conducted or funded by most federal departments and agencies and, indirectly, to virtually all academic (and some industry) HSR by institutional policy. It defines data as “identifiable” when “the identity of the subject is or may readily be ascertained by the investigator or associated with the information” (45 C.F.R. § 46.102(f)(2)). This definition of (non-)“identifiable” under the Common Rule is critical to how data (and tissue) are collected, preserved, and accessed. (The distinct but related concept of “deidentified” under the Health Insurance Portability and Accountability Act [HIPAA] carries similar consequences.) Research with existing data and tissue (whether originally collected for research, clinical, administrative, or other purposes) that meet the Common Rule’s definition of “nonidentifiable” does not involve “human subjects” as the Common Rule defines that term, and therefore such research falls outside of the Common Rule, including its default rules requiring Institutional Review Board (IRB) review and informed consent. The rationale behind this policy was that the main risk of research that involves neither intervention nor interaction but only analysis of existing data is informational privacy; analysis of data that cannot be linked to an individual’s identity does not pose such a risk. Historically, IRBs and other governance and compliance actors have considered genomic data not to constitute “identifiers” in and of themselves, without being linked to additional information.
From 2011 to 2017, federal regulators engaged in public notice-and-comment rulemaking to revise the Common Rule, whose substance had not been significantly changed since 1991. Among the most controversial proposals was altering the definition of “human subject” to include both identifiable and nonidentifiable biospecimens. That change would have defined research using existing tissue samples that were stripped of identifiers as HSR and therefore subject to IRB review and consent.
The rationale behind the proposal was twofold. First, a series of academic reidentification “attacks” demonstrated the possibility, under certain circumstances, of reidentifying genomic and a wide variety of other data (e.g., consumer and geolocation data) that were considered to be nonidentifiable (Narayanan and Shmatikov, 2008; El Emam et al., 2011; Gymrek et al., 2013; De Montjoye et al., 2013; Gambs et al., 2014; De Montjoye et al.,
2015). These attacks cast doubt on the assumption that research with data considered to be nonidentifiable under the Common Rule does not implicate participant privacy. Second, some commentators were of the opinion that people have autonomy interests in controlling the use of their data for various projects, even if those data are never associated with them as individuals (Javitt, 2010; Mello and Wolf, 2010; Tarini and Lantos, 2013).
The proposed rule would apply only prospectively, so that researchers would not be required to recontact and consent those whose tissue comprises existing biobanks. Nevertheless, the research community was largely opposed to the proposal; in fact, every stakeholder category failed to receive public comments to support the proposal. As a result, the revised Common Rule does not include the proposed redefinition of identifiability.
The current Common Rule does, however, require federal departments and agencies to reconsider, within the first year of the new rule going into effect, and at least once every 4 years thereafter, both the definition of “identifiable” data and biospecimens and whether any “technologies or techniques” applied to biospecimens, such as whole-genome sequencing, should be considered to generate data that are necessarily identifiable. If regulators determine there is a need to alter the definition of “identifiable,” then agencies are to develop interpretive guidance to achieve this goal. Similarly, if regulators determine that technologies determined to necessarily generate identifiable data shall be placed on a public list following a public notice-and-comment period. Under such circumstances, agencies could then issue guidance recommending that limited IRB review and broad consent be required for research involving those technologies without public notice or comment (Lynch and Meyer, 2017).
Agency guidance is not legally binding and, presumably, as with the proposal to change to the Common Rule itself, it would only apply prospectively. Still, processes developed or modified, implemented, and relied on for the collection, archiving, and secondary use of nonidentifiable data would likely be deemed no longer fit for purpose in a new regime under which best practice is to consider those data identifiable. New processes would need to be developed and implemented at both the State 1 (researcher) and State 2 (active repository) levels, which are costly endeavors. (State 3 [long term] is not included because the Common Rule applies only to research use of data to contribute to generalizable knowledge, not to the mere act of storing it. To the extent that State 3 does not include access as a key component, changes to the Common Rule’s understanding of identifiability would not apply directly.) And because agency guidance development is not subject to the time schedules of public notice-and-comment rulemaking, regulated entities might have relatively little notice.
Permissible Data Collection, Storage, and Sharing
In reaction to events such as the Cambridge Analytica scandal (see Davies, 2015), regulators sometimes enact data protection laws that impose substantial burdens on regulated entities and do not always consider the impact on research. For instance, the EU General Data Privacy Regulation (GDPR),10 which went into effect on May 25, 2018, enacted sweeping changes in how nonanonymous personal data, including data that might immediately or eventually be used in research, may be collected, stored, and disseminated. The GDPR has a global impact, applying to data collected from individuals residing in the EU at the time of data processing.
To date, the United States has had no such comprehensive privacy law. Instead, it has a patchwork of federal and state laws, such as HIPAA, the Common Rule, and the Family Educational Rights and Privacy Act.11 The 2018 California Consumer Privacy Act (CCPA)12 covers individual, identifiable data and went into effect on January 1, 2020. Although the CCPA (unlike the GDPR) obligates only for-profit businesses, and specifically excludes data that are already subject to federal privacy laws (e.g., HIPAA) and information collected for a clinical trial subject to the Common Rule, biomedical researchers increasingly collaborate with for-profit businesses around non-HIPAA data, including consumer wearables, mobile health apps, and genetic data from direct-to-consumer testing companies. The anticipated impact on research of the GDPR and the CCPA are approximately the same: both contain various exceptions for research (e.g., the right to erasure, or so-called right to be forgotten, has only
10 The website for GDPR is https://gdpr-info.eu/, accessed December 13, 2019.
11 See 20 U.S.C. § 1232g; 34 C.F.R. Part 99.
12 See Assembly Bill No. 375, an act to add Title 1.81.5 (commencing with Section 1798.100) to Part 4 of Division 3 of the Civil Code, relating to privacy.
limited applicability to research data) while still regulating (some) research activity, and in both cases the actual impact on research remains unknown. In April 2019, U.S. Senator Edward J. Markey introduced into Congress the much more sweeping Privacy Bill of Rights Act13 that closely resembles the GDPR and applies to “any person that collects or otherwise obtains personal information.”
CHANGING UNDERSTANDING OF HUMAN-SUBJECTS POLICY
When research data are about human beings, a variety of laws, policies, and norms are likely to apply throughout the data life cycle. Compliance with each of these laws, policies, and norms has associated costs, often in the form of administration, tracking, and training. In State 1 (the primary research environment described in Chapter 2), data are captured initially through interaction or intervention with humans for research use, existing data may be collected for new research or nonresearch use (e.g., clinical or administrative purposes), or a research project might include both kinds of data capture.
Researchers may already be familiar with HSR activities and their legal requirements, as researchers bear at least some of the associated burdens. These activities include HSR ethics training for everyone engaged in the research and a variety of prospective reviews of the research protocol. IRBs review a research proposal or determine that it is either exempt HSR or not HSR (e.g., because it involves analysis only of existing data that include no personal information that could lead to the identification of an individual—nonidentifiable data). Other reviews might also apply. Data subject to HIPAA14 might require a review by a Privacy Board (if the IRB is not also acting as the covered entity’s Privacy Board) and a review by an institution’s information security office. If research involves consent from or notice by the human subject, those materials and processes must be developed and, often, pretested for participant comprehension or operational feasibility. In some cases, consent processes will require creating new institutional infrastructure that has its own costs. For example, a biobank (i.e., a type of repository for biological samples) that uses a “front door” opt-in consent will have to train the relevant staff to solicit such consent. That consent process will then have to be integrated into the clinical workflow, and patients’ enrollment status will need to be recorded and perhaps incorporated into the electronic medical record. Successful large-scale research projects such as biobanks often involve potentially extensive and costly participant incentives or engagement activities (e.g., “results” might be provided to participants for engagement or other purposes, or something else of value might be provided). Innovative consent methodologies may also carry costs. For instance, some large research projects—especially those such as biobanks, where data are collected under broad, rather than study-specific, consent—take a more or less “dynamic” approach to consent, in which participants are invited to change the way their data are used in response to changing circumstances. The ongoing communication of the project(s) status to participants and inviting and implementing their evolving consent preferences requires a significant investment of time and, to varying degrees, material costs. Another kind of consent—tiered consent—enables different participants to choose the degree of data collection, use, or sharing they authorize. Tiered consent may involve costly tracking to ensure adherence to those heterogeneous preferences. State 1 HSR activity costs are direct costs charged to the funder, while activity costs associated with accessing data in States 2 and 3 are indirect costs that are typically at least offset by grant funding.
The costs of HSR activities associated with State 2 (i.e., active repository) acquisition, aggregation, and support for access are less visible to researchers because they often are externalized onto repositories. As a result, those costs are more difficult for researchers to anticipate. When data are transferred to a repository, they may need to be deidentified (i.e., personally identifying information removed), consistent with HIPAA requirements, or otherwise anonymized or pseudonymized. Participants generally have a right to withdraw their data from a data set, which requires the repository to remove those data and update related resources as necessary. If secondary data use is restricted, for example, by the terms of consent, a data use agreement, prospective review (e.g., by a data access committee), or auditing might be required to enforce those restrictions. Often, some data in a data set are more sensitive than others, such that tiers of data access among users is necessary. The least sensitive data may be
13 U.S. Congress, Senate, Privacy Bill of Rights Act, S.1214, 116th Congress, introduced April 11, 2019.
14 Health Insurance Portability and Accountability Act of 1996, Pub. L. No. 104-191, 100 Stat. 2548 (1996).
openly accessible to any users, while other tiers of data are available only to certain people, under certain circumstances, or for certain purposes. Determining criteria for each tier of access and sorting the data accordingly could be laborious and therefore incur cost. The more sensitive the data, the more repositories might want to develop sandboxes15 or enclaves where researchers can access and analyze those data but not remove them. The development of sandboxes and enclaves, and their periodic update in light of new research purposes or new technological requirements, imposes costs. A good example of many of these mechanisms is the National Institutes of Health (NIH) All of Us Research Program,16 which plans to use three tiers of access, identity verification, a researcher code of conduct, voluntary prospective review of sensitive projects by a Resource Access Board, a “data passport” (to allow access to registered or controlled-access data sets) and sandbox, and retrospective data use audits, to preserve data privacy and security and participant trust. Merely developing these governance mechanisms required significant person-hours before actual data access began.
In State 3 (long-term preservation), the primary HSR activity is ensuring that data placed in a long-term archive continue to adhere to current legal and governance requirements. For instance, data considered or perceived as nonidentifiable when generated may later become identifiable (e.g., because additional information about the data sources becomes available or reidentification techniques are developed) or are redefined as identifiable under new laws or policies (see the section “Legal and Policy Disruptors” in this chapter).
OTHER POTENTIAL DISRUPTORS
There are many other potential disruptors that are not discussed in this report but that could affect long-term costs within the next 5 to 10 years or beyond. Examples include
- open data practices;
- long-term resilience of technology production;
- evolving requirements for cybersecurity (e.g., surreptitious cyberattacks to corrupt data; data misuse and theft that undercut support of repositories);
- influences of the FAIR data principles, open science, and Responsible Data movements, particularly of increasing acceptance and adoption of standards;
- transfer learning;
- investigating connected data that cross spatial and temporal scales and modalities;
- transitioning from needing specialized expertise to providing self-contained tools and resources;
- risks associated with third-party vendors (particularly if they capture a large share of the biomedical data market); and
- natural disasters that disrupt technology production in the long term.
Although the committee did not deliberate on the effects of those disruptors, they may warrant further attention by the biomedical research community.
Ameur, A., J.B. Stewart, C. Freyer, E. Hagström, M. Ingman, N.-G. Larsson, and U. Gyllensten. 2011. Ultra-deep sequencing of mouse mitochondrial DNA: Mutational patterns and their origins. PLoS Genetics 7(3):e1002028.
Courtland, R. 2018. The microscope revolution that’s sweeping through materials science. Nature 563:462-464.
Davies, H. 2015. Ted Cruz campaign using firm that harvested data on millions of unwitting Facebook users. Guardian, December 11.
De Montjoye, Y.-A., C.A. Hidalgo, M. Verleysen, and V.D. Blondel. 2013. Unique in the crowd: The privacy bounds of human mobility. Scientific Reports 3:1376. https://doi.org/10.1038/srep01376.
15 A sandbox is a separate platform on which researchers can use tools to explore and experiment with data.
16 The website for NIH’s All of Us Research Program is https://allofus.nih.gov/, accessed December 5, 2019.
De Montjoye, Y.-A., L. Radaelli, V.K. Singh, and A.S. Pentland. 2015. Unique in the shopping mall: On the reidentifiability of credit card metadata. Science 347(6221):536-539.
Donald, D. 2019. Major disruptions in data storage technology: What this shake-up means for the Enterprise. insideBIGDATA, April 5. https://insidebigdata.com/2019/04/05/major-disruptions-in-data-storage-technology-what-this-shake-up-means-for-the-enterprise/.
El Emam, K., E. Jonker, L. Arbuckle, and B. Malin. 2011. A systematic review of re-identification attacks on health data. PLoS One 6(12):e28071.
Gambs, S., M.-O. Killijian, and M.N. del Prado Cortez. 2014. De-anonymization attack on geolocated data. Journal of Computer and System Sciences 80(8):1597-1614.
Guzzinati, G., T. Altantzis, M. Batuk, A. De Backer, G. Lumbeeck, V. Samaee, D. Batuk, et al. 2018. Recent advances in transmission electron microscopy for materials science at the EMAT Lab of the University of Antwerp. Materials (Basel) 11(8):1304.
Gymrek, M., A.L. McGuire, D. Golan, E. Halperin, and Y. Erlich. 2013. Identifying personal genomes by surname inference. Science 339(6117):321-324.
Javitt, G. 2010. Why not take all of me: Reflections on The Immortal Life of Henrietta Lacks and the status of participants in research using human specimens. Minnesota Journal of Law, Science, and Technology 11(2):713-755.
Lynch, H.F., and M.N. Meyer. 2017. Regulating research with biospecimens under the revised Common Rule. Hastings Center Report 3(May-June):3-4.
Mello, M.M., and L.E. Wolf. 2010. The Havasupai Indian Tribe case—lessons for research involving stored biologic samples. New England Journal of Medicine 363(3):204-207.
Narayanan, A., and V. Shmatikov. 2008. Robust de-anonymization of large sparse datasets. 29th IEEE Symposium on Security and Privacy 111-125.
Norman, M. 2019. Presentation to the Committee on Forecasting Costs for Preserving and Promoting Access to Biomedical Data, September 12.
OSTP (Office of Science and Technology Policy). 2018. Open Knowledge Network: Summary of the Big Data IWG Workshop. https://www.nitrd.gov/pubs/Open-Knowledge-Network-Workshop-Report-2018.pdf.
Ristevski, B., and M. Chen. 2018. Big data analytics in medicine and healthcare. Journal of Integrative Bioinformatics 15(3):20170030.
San Diego Supercomputer Center. 2019. UC San Diego, UC Berkeley, U Washington Announce ‘CloudBank’ Award. Press Release, August 8. https://www.sdsc.edu/News%20Items/PR20190808_CloudBank.html.
Shah, P., F. Kendall, S. Khozin, R. Goosen, J. Hu, J. Laramie, M. Ringel, and N. Schork. 2019. Artificial intelligence and machine learning in clinical development: A translational perspective. npj Digital Medicine 2:69.
Tarini, B.A., and J.D. Lantos. 2013. Lessons that newborn screening in the USA can teach us about biobanking and large-scale genetic studies. Personalized Medicine 10(1):81-87.
Walter, C. 2005. Kryder’s law. Scientific American, August 1. https://www.scientificamerican.com/article/kryders-law/.
Wetterstran, K.A. 2019. DNA sequencing costs: Data from the NHGRI Genome Sequencing Program. www.genome.gov/sequencingcostsdata.
Xu, C., and S.A. Jackson. 2019. Machine learning and complex biological data. Genome Biology 20:76.
Zhu, G., B. Jiang, L. Tong, Y. Xie, G. Zaharchuk, and M. Wintermark. 2019. Applications of deep learning to neuro-imaging techniques. Frontiers in Neurology 10:869.