Page 24 Cite

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

2

Framework Foundation: Data States and Associated Activities

The data life cycle begins when data are collected during the conduct of primary research and continues through data analysis, preservation and curation, reuse, storage, and potentially to deaccession. The data life cycle is not necessarily linear, and data may be reused and repurposed, combined with other data, and analyzed in a variety of ways and for different purposes throughout the existence of the data. How actively data are used during the data life cycle may change: they may be used often when initially collected, then see only periodic use after being placed in a repository. At some point, they may become dormant and be placed in an archive for long-term preservation. They may be rediscovered at any time and once again see active use. The environments in which the data are placed throughout their existence allow for different types of activities, and they may be moved from one environment to another as the need arises. The committee calls these environments “data states” and recognizes that the data may move from one state to another in a nonlinear manner. These data states were conceptualized by the committee to communicate the characteristics of different environments with different purposes, and different data storage and preservation costs. Note that they do not map directly to the data life cycle.

Digital data transition among three states over the research life cycle is described in Box 2.1.

Page 25 Cite

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

BOX 2.1
The Three Data States

Digital data transition among three states over the research life cycle. The three states provide the framework for forecasting data storage, preservation, and archiving costs presented in this report. Data take a different form in each state, and each state includes different activities with different personnel, hardware, and management requirements. The labor and computation required to transform data from one state require significant resources.

State 1: The primary research and data management environment where data are captured and analyzed. It could be possible that no one working in the data environment is focused on standardizing, documenting, sharing, or preserving data and algorithms.
State 2: An active repository and platform where data may be acquired, curated, aggregated, accessed, and analyzed. This is an active information system that usually provides services to a wide range of users. Where data are complex, confidential, or very large, it may be a platform for controlling access and may also provide support for analyzing and processing data.
State 3: A long-term preservation platform in which content is preserved across changes in governance, assessment of data value, and technology. The platform may include an extract of data from a single data set, multiple data sets, or an information system in a system-agnostic format. In this state, data are neither directly analyzable nor easily accessible.

Because research activities related to data may not occur sequentially, data might not transition through the three states sequentially during the life cycle of a research project or the life of a repository. A research laboratory may maintain data in State 1 for analysis, while transforming the data into State 3 for other purposes. An active State 2 repository may coexist with the same data stored on a State 3 long-term preservation platform, or the same data may be stored in more than one State 2 environment. A new research project may require a new State 1 environment with inputs resulting from transformations of multiple State 2 or 3 resources, in addition to capturing data from novel sources.

The three states and the activities involved in each of the states are summarized in Figure 2.1.1. The activities and subactivities in each state will be described in greater detail in later sections. The enumeration of the activities and subactivities have benefited from previous work, including the recent National Academies study Open Science by Design (NASEM, 2018) and the models proposed by “Keeping Research Data Safe” (Beagrie, 2019), The Open Archival Information System (Lavoie, 2014), The LIFE² Final Project Report (Ayris et al., 2008), the National Aeronautics and Space Administration’s Cost Estimation Toolkit (Fontaine et al., 2007), and Palaiologk et al. (2012).

Page 26 Cite

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

STATE 1: THE PRIMARY RESEARCH AND DATA MANAGEMENT ENVIRONMENT

The first state is the form that the data take in the primary research environment. The data are actively captured in this environment as they are created—for example, as digital sampling of electrical current, image and voice signals, text, or binary data. Computing ahead of storage (e.g., processing as data are generated) is generally fast enough to synchronously capture the data stream and to manage its conversion to data structures for quality assurance and initial analysis. The data management systems in this environment ideally include software features to manage disruptions in logical work units (if, for example, there is a disruption in electrical current as data are being transferred, the data flow needs to be corrected before completing the transfer). Multiple generations of backup may be needed to provide time to detect corruptions resulting from the addition of new data before those new data cascade across older backups.

Table 2.1 describes State 1 activities and subactivities as well as the types of individuals who carry out those activities. Personnel are specifically noted because personnel costs often account for the largest expenditures in data management activities. Relative salary levels of personnel costs are discussed in a later section following the discussions of the three data environments.

Page 27 Cite

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

TABLE 2.1 State 1: Primary Research and Data Management Environment Activities and Personnel

Activity	Subactivities	Personnel
A. Outreach and Training Guidance on best practices in collecting and archiving data	Obtain support for creating funding proposals and data management plans (DMPs). Obtain support for creating and describing research data. Identify tools available for optimal data sharing.	Researcher, records management specialist, data scientist, data librarian, information technology (IT) systems engineer, education specialist, policy specialist

B. Provocation and Ideation Activities involved in exploring existing data resources and initiating the research activity	Explore and mine existing data resources for possible use and augmentation. Design project with data sharing in mind. Prepare funding application and explicit DMP (including estimates of costs of data storage and access). Negotiate intellectual property rights. Obtain ethics and regulatory approvals (e.g., Institutional Review Board [IRB], privacy office/Health Insurance Portability and Accountability Act, information security protocols).	Researcher, data scientist, software engineer, research domain project manager, IT security specialist, policy specialist, administrative staff

C. Knowledge Generation and Validation Activities involved in creating shareable research data	Evaluate and use tools for data collection, curation, and analysis. Generate data and metadata using community-accepted standards. Manage and document project data. Validate data and code (including version). Maintain active DMP/records.	Researcher, metadata librarian, data scientist, research domain project manager, research domain curator, software engineer

D. Dissemination and Preservation Activities involved in the disposition of the data	Prepare data and algorithms for submission to an active repository or long-term archive. Transform data and algorithms as necessary in line with repository/archive submission requirements.	Researcher, research domain project manager, IT project manager, software engineer, data wrangler, research domain curator

STATE 2: THE ACTIVE REPOSITORY AND PLATFORM

The second state is the active repository and platform. Data are acquired from the primary research environment or from another active repository, or may be revived from archival storage for active use. Acquisition is asynchronous, either in near real time or in a batch form. Data are less volatile during acquisition in this state than they are in the primary research environment. In the ideal case, data may be curated as they are acquired to add metadata describing the data’s provenance (i.e., the context that is implicit in the primary research environment and must be made explicit to accommodate use across research environments). Depending on the depth and quality of the data curation before it enters State 2 (including adherence to community data standards), the transition to State 2 may require extensive curation. Data sets are merged and aggregated with other data already in the active repository, which includes formatting, applying standards, and validating the data. The storage is fast enough to accommodate the search and analysis compute platforms used to make the data accessible. The data management systems in this environment necessarily handle much more data than the primary research environment because they aggregate data from multiple research projects. It is important to note that many State 2 activities will need to be repeated each time a new data set is added to the existing system. It is crucial that versioning and its documentation be controlled and curated. Failure to document and curate versions as they are created can lead to scientific errors with significant negative consequences. Costs incurred through activities in this state may reduce the efforts of future users of the data and for those transitioning data to other states or platforms.

Table 2.2 describes State 2 activities and subactivities as well as the types of individuals who carry out those activities.

Page 28 Cite

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

TABLE 2.2 State 2: Active Repository and Platform

Activity	Subactivities	Personnel
A. Community Leadership Engagement with the broader community in the development of tools, standards, and best practices	Develop community data standards and best practices and policies. Share lessons from development of repository systems and tools. Identify community needs through community outreach.	Researcher, informatician, records management specialist, data librarian, communication specialist

B. Functional Specifications and Implementation Processes involved in designing or modifying and implementing the system for access and use	Design or modify and implement the repository infrastructure. Consult with stakeholders on proposed design. Design or modify and implement analytic tools. Design or modify and implement search capabilities. Design or modify and implement visualization tools. Design or modify and implement authentication/authorization methods for secure access. Design and implement user interfaces for data submission and access. Design or modify and implement services for programmatic access to the data. Design or modify and implement a private data enclave for researcher and collaborator use before access by other users of the repository. Address findable, accessible, interoperable, and reusable (FAIR) compliance.	Senior staff, software engineer, informatician, research domain project manager, IT project manager, IT security specialist

C. Validation Processes involved in supporting the researcher in ensuring compliance with repository requirements	Provide a sandbox for researchers to test data sets for compliance with repository standards. Test compliance with repository submission requirements. Resolve errors. Release data for submission.	Research domain curator, research domain project manager, software engineer

D. Acquisition Processes involved in acquiring the data	Apply selection policy to incoming data. Provide support for and negotiate submission agreements with depositors. Assess compliance with legal, ethical, and other policies (e.g., determination that secondary use is consistent with consent terms). Revise selection policy as necessary.	Senior staff, data librarian, policy specialist

E. Ingest Processes involved in receiving and preparing the data for insertion in the repository	Receive submission. Conduct quality assurance of submitted data. Transform data into a format suitable for deposit and access (including possible deidentification). Curate data: generate, validate, or upgrade descriptive metadata and documentation. Assign unique identifiers. Generate administrative metadata.	Research domain curator, research domain project manager, metadata librarian, data wrangler, IT project manager, software engineer

F. Data Aggregation and Linking Processes involved in merging and aggregating new data with existing data, and processes involved in linking to external databases	Integrate data with existing data in the data repository. Link new data to external repository data, if relevant (e.g., link data to publications). Link data to external data sets through database federation.	Software engineer, informatician, data scientist, research domain curator, research domain project manager

Page 29 Cite

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

Activity	Subactivities	Personnel
G. Database Management Services and functions for managing the repository	Maintain the integrity of the database. Generate administrative reports from the database. Back up data at additional storage sites. Plan for potential disaster recovery.	Software engineer, IT project manager, IT security specialist

H. Access Services and functions for making the data available to users	If applicable, confirm identity or eligibility of user as a qualified user (e.g., IRB approval, Collaborative Institutional Training Initiative training). Determination that specific proposal for secondary use is consistent with consent terms. Design or modify and deploy search algorithms. Prepare data for dissemination to user. Deliver search results.	Software engineer, IT security specialist, IT project manager, informatician, policy specialist

I. User Support Services for making the repository useful to users	Develop or modify and implement training materials. Staff a help desk. Publicize the repository.	Software engineer, education specialist, communication specialist

J. Administration Functions that control the overall operation of the repository	Provide general management and oversight. Develop and review policies and standards. Monitor use. Provide support for security assessment and audit. Provide administrative support including billing for submission and usage, if required.	Senior staff, research domain project manager, IT security specialist, policy specialist, administrative staff

K. Common Services Shared supporting services	Provide operating system, network, and network security services. Provide and renew software licenses. Provide hardware maintenance. Ensure physical security and disaster management. Supply utilities.	IT systems engineer, IT project manager, facilities manager

L. Data Retention or Replacement Determining whether the data will be retained, replaced, transferred, or destroyed	Retain data, or Replace data, or Prepare data for transfer and transfer data and any transformation code to long-term archive, or Destroy data.	Senior staff, research domain project manager, software engineer

STATE 3: THE LONG-TERM PRESERVATION PLATFORM

The third state is the long-term preservation platform. Content (e.g., data and code) are preserved in such a platform when it is anticipated that the data will not be actively used for the foreseeable future or if the resources are not available to maintain an active repository. For example, data from an active repository may be transformed into text, delimited strings, images, or other forms that may be viewed or processed without the content of the data management systems of States 1 and 2. This transformation enables preservation over tens to, perhaps, hundreds of years through changes in governance and computational technologies and may include compression (although compression could hinder preservation if corresponding decompression routines are also not preserved). Storage may be offline. Data may be rehydrated (see Box 2.2) as needed and moved back into an active environment, where it can be accessed and be more easily discovered.

Page 30 Cite

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

BOX 2.2
Data Dehydration and Rehydration

Data dehydration and rehydration are terms used in this report as shorthand for the processes of transitioning data from one state to another. Data are said to be dehydrated when transitioning from an active platform (States 1 or 2) to a less active platform (usually a State 3 platform). A decision to dehydrate may be made, for example, when a sophisticated, high-function software platform reaches the end of its funding cycle and additional funds cannot be found to sustain it. Commonly, data are moved to some file repository as a series of flat files accompanied by metadata descriptions.

The following should be considered for data in a given State 2 (active repository) resource that are to be dehydrated:

What data should be preserved? Not all data can realistically be preserved in perpetuity. Decisions will need to be made about the potential value of data and the criteria that warrants their preservation.
Granularity: How should data be mapped to files? A general rule would be that any data object with an externally referenced identifier that might be found in the literature should be recoverable from the files and metadata exported from the State 2 platform in a reasonably straightforward way (e.g., the identifier corresponds to a file or a group of files).
What metadata should be exported to accompany the files? Note that the response leaves room for curatorial decision making. Some State 2 platform metadata about data objects may not export meaningfully or usefully (e.g., detailed editing histories attributed to specific users of the platform).

Hydration is not a tidy inverse of dehydration. It could be viewed as building a new State 2 hosting platform (or adapting an existing one) and importing data sets existing from one or more State 3 repositories. More broadly, it could be viewed as the process that a researcher (or group of researchers) needs to go through to move data from a State 3 environment to one that is directly useful and usable. This description is necessarily vague; to give one example of the kinds of issues here, when in a State 2 platform, computational or analytic tools might be part of the platform and thus provide a set of readily available capabilities for researchers reusing data on the platform. When the data transition to State 3, these tools are no longer there. Rehydrating data from State 3 may require a specific subset of these analytic capabilities for their intended use of the data; the reuser may need to rebuild these capabilities, or may be able to obtain them from existing tools. Note also that standards are an important enabler: to the extent that there are standardized file formats for classes of biomedical data—and tools that understand those standards—the barriers to some kinds of rehydration may be considerably reduced.

It will become important to collect and understand best practices about dehydration of data in State 2 platforms as they are decommissioned. These will evolve, continuously informed by subsequent attempts to reuse and rehydrate data from State 3 repositories. Importantly, best practices will also be informed by decisions not to reuse data given that the costs of rehydration are too high (and borne by the reuser). It will be valuable to understand these decisions, perhaps through reviews of research proposals that choose to collect new data or to ignore certain existing State 3 data for these reasons.

There will naturally be overlap in some activities in all the data states. The distinction between States 2 and 3 helps focus on the different issues that arise as one moves from facilitating active use to long-term retention. Those managing a State 2 information resource may make decisions related to a State 3 resource, and the movement from State 2 to State 3 could potentially be seamless. Following good archival practice, State 2 resource managers may automatically create preservation copies of the data as they are accessioned, or those data may be stored in a preservation format. Drawing a boundary between States 2 and 3 helps to ensure that decision-making processes also consider the challenges of long-term data preservation and their associated costs.

Table 2.3 describes State 3 activities and subactivities as well as the types of individuals who carry out those activities.

Page 31 Cite

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

TABLE 2.3 State 3: Long-Term Preservation Platform

Activity	Subactivities	Personnel
A. Preservation Planning Services and functions for ensuring that the archive remains accessible over the long term	Develop preservation policies, strategies, and standards with particular attention to possible future data rehydration. Develop preservation-metadata specifications. Engage with and monitor the designated user community. Monitor technology. Develop migration plans.	Senior staff, records management specialist, curator, IT project manager, software engineer

B. Ingest and Data Transformation Processes involved in receiving and preparing the data for insertion in the archive	Receive data for long-term storage. Check for errors in data transfer. Transform data into a format suitable for deposit. Generate administrative metadata.	IT project manager, records management specialist, curator, software engineer, data wrangler, data scientist

C. Archive Storage Services and functions for long-term data storage	Store data. Replace media as needed.	Software engineer, IT project manager, IT security specialist

D. Common Services Shared supporting services	Provide hardware maintenance. Ensure physical security and disaster management.	IT systems engineer, facilities manager

E. Data Export or Deaccession Functions involved in transferring custody of or deaccessioning data	Prepare data for transfer of custody, or Deaccession data.	Senior staff, software engineer, research domain curator

PERSONNEL AND THEIR RELATIVE SALARY LEVELS

Based on published case studies (e.g., Palaiologk et al., 2012) and experience of individual committee members, personnel salaries often account for the largest expenditures in data preservation, curation, and access. Appendix C provides data drawn from occupational employment statistics for the relative salary levels shown in Table 2.4. Table 2.4 defines the roles of the personnel shown in Tables 2.1-2.3 and indicates a relative salary level (VH, very high; H, high; M, medium) for each of them based on information from Appendix C.

REFERENCES

Ayris, P., R. Davies, R. McLeod, R. Miao, H. Shenton, P. Wheatley, S. Grace, et al. 2008. LIFE² Final Project Report. http://discovery.ucl.ac.uk/11758/1/11758.pdf.

Beagrie, C. 2019. Keeping research data safe: Cost-benefit studies, tools, and methodologies focussing on long-lived data. https://beagrie.com/krds.php.

Fontaine, K., G. Hunolt, A. Booth, and M. Banks. 2007. Observations on cost modeling and performance measurement of long-term archives. NASA research paper in PV2007 Conference Proceedings. http://www.pv2007.dlr.de/Papers/Fontaine_CostModelObservations.pdf.

Lavoie, B. 2014. The Open Archival Information System (OAIS) Reference Model: Introductory Guide, 2nd ed. (Charles Beagrie, Ltd, eds.). Digital Preservation Coalition. https://www.dpconline.org/docs/technology-watch-reports/1359-dpctw14-02/file.

NASEM (National Academies of Sciences, Engineering, and Medicine). 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, D.C.: The National Academies Press.

Palaiologk, A., A. Economides, H. Tjalsmaand, and L. Sesin. 2012. An activity-based costing model for long-term preservation and dissemination of digital research data: The case of DANS. International Journal on Digital Libraries 12:195-214.

Page 32 Cite

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

TABLE 2.4 Personnel Categories with Definitions and Relative Salary Levels

Personnel	Definition	Relative Salary Level
Administrative staff	Provides a variety of support functions for a project or program	M
Communication specialist	Trained in effective methods for publicizing and disseminating information to a broad audience	M
Curator	Often an archivist, trained in methods to describe and add value to data	M
Data librarian	Trained in the technical aspects of data management	M
Data scientist	Trained in quantitative methods for collecting, analyzing, and interpreting data	H
Data wrangler	Trained in methods for transforming data from one format into another and data cleansing for improved data interpretation	H
Education specialist	Trained in design, modification, and implementation of training materials relevant to data management and use	M
Facilities manager	Oversees and handles matters relating to the physical environment	M
Informatician	Trained in biology, medicine, or other health-related field and in quantitative methods for collecting, analyzing, and interpreting data in those fields	VH
IT project manager	Responsible for planning, executing, and overseeing a project; trained IT specialist	H
IT security specialist	Trained in methods to protect IT systems against inadvertent or malicious attacks	VH
IT systems engineer	Trained in implementing, monitoring, and maintaining IT systems	VH
Metadata librarian	Trained in the technical aspects of data standards	M
Policy specialist	Trained in relevant ethical, legal, and regulatory requirements	H
Project manager	Responsible for planning, executing, and overseeing a project	M
Records management specialist	Often an archivist, trained in managing data throughout the data life cycle	M
Research domain curator	Domain expert trained in methods to describe and add value to data	H
Research domain project manager	Domain expert responsible for planning, executing, and overseeing a project	H
Researcher	An individual who generates potentially shareable data while conducting research	H
Senior staff	Has a supervisory and decision-making role within an organization or program	VH
Software engineer	Trained in the design, implementation, testing, evaluation, operation, and maintenance of computer programs or databases	VH

NOTE: H, high; M, medium; VH, very high.

Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs (2020)

Chapter: 2 Framework Foundation: Data States and Associated Activities

2

Framework Foundation: Data States and Associated Activities

STATE 1: THE PRIMARY RESEARCH AND DATA MANAGEMENT ENVIRONMENT

STATE 2: THE ACTIVE REPOSITORY AND PLATFORM

STATE 3: THE LONG-TERM PRESERVATION PLATFORM

PERSONNEL AND THEIR RELATIVE SALARY LEVELS

REFERENCES

Welcome to OpenBook!

Get Email Updates