Framework Foundation: Data States and Associated Activities
The data life cycle begins when data are collected during the conduct of primary research and continues through data analysis, preservation and curation, reuse, storage, and potentially to deaccession. The data life cycle is not necessarily linear, and data may be reused and repurposed, combined with other data, and analyzed in a variety of ways and for different purposes throughout the existence of the data. How actively data are used during the data life cycle may change: they may be used often when initially collected, then see only periodic use after being placed in a repository. At some point, they may become dormant and be placed in an archive for long-term preservation. They may be rediscovered at any time and once again see active use. The environments in which the data are placed throughout their existence allow for different types of activities, and they may be moved from one environment to another as the need arises. The committee calls these environments “data states” and recognizes that the data may move from one state to another in a nonlinear manner. These data states were conceptualized by the committee to communicate the characteristics of different environments with different purposes, and different data storage and preservation costs. Note that they do not map directly to the data life cycle.
Digital data transition among three states over the research life cycle is described in Box 2.1.
STATE 1: THE PRIMARY RESEARCH AND DATA MANAGEMENT ENVIRONMENT
The first state is the form that the data take in the primary research environment. The data are actively captured in this environment as they are created—for example, as digital sampling of electrical current, image and voice signals, text, or binary data. Computing ahead of storage (e.g., processing as data are generated) is generally fast enough to synchronously capture the data stream and to manage its conversion to data structures for quality assurance and initial analysis. The data management systems in this environment ideally include software features to manage disruptions in logical work units (if, for example, there is a disruption in electrical current as data are being transferred, the data flow needs to be corrected before completing the transfer). Multiple generations of backup may be needed to provide time to detect corruptions resulting from the addition of new data before those new data cascade across older backups.
Table 2.1 describes State 1 activities and subactivities as well as the types of individuals who carry out those activities. Personnel are specifically noted because personnel costs often account for the largest expenditures in data management activities. Relative salary levels of personnel costs are discussed in a later section following the discussions of the three data environments.
TABLE 2.1 State 1: Primary Research and Data Management Environment Activities and Personnel
|A. Outreach and Training
Guidance on best practices in collecting and archiving data
||Researcher, records management specialist, data scientist, data librarian, information technology (IT) systems engineer, education specialist, policy specialist|
|B. Provocation and Ideation
Activities involved in exploring existing data resources and initiating the research activity
||Researcher, data scientist, software engineer, research domain project manager, IT security specialist, policy specialist, administrative staff|
|C. Knowledge Generation and Validation
Activities involved in creating shareable research data
||Researcher, metadata librarian, data scientist, research domain project manager, research domain curator, software engineer|
|D. Dissemination and Preservation
Activities involved in the disposition of the data
||Researcher, research domain project manager, IT project manager, software engineer, data wrangler, research domain curator|
STATE 2: THE ACTIVE REPOSITORY AND PLATFORM
The second state is the active repository and platform. Data are acquired from the primary research environment or from another active repository, or may be revived from archival storage for active use. Acquisition is asynchronous, either in near real time or in a batch form. Data are less volatile during acquisition in this state than they are in the primary research environment. In the ideal case, data may be curated as they are acquired to add metadata describing the data’s provenance (i.e., the context that is implicit in the primary research environment and must be made explicit to accommodate use across research environments). Depending on the depth and quality of the data curation before it enters State 2 (including adherence to community data standards), the transition to State 2 may require extensive curation. Data sets are merged and aggregated with other data already in the active repository, which includes formatting, applying standards, and validating the data. The storage is fast enough to accommodate the search and analysis compute platforms used to make the data accessible. The data management systems in this environment necessarily handle much more data than the primary research environment because they aggregate data from multiple research projects. It is important to note that many State 2 activities will need to be repeated each time a new data set is added to the existing system. It is crucial that versioning and its documentation be controlled and curated. Failure to document and curate versions as they are created can lead to scientific errors with significant negative consequences. Costs incurred through activities in this state may reduce the efforts of future users of the data and for those transitioning data to other states or platforms.
Table 2.2 describes State 2 activities and subactivities as well as the types of individuals who carry out those activities.
TABLE 2.2 State 2: Active Repository and Platform
|A. Community Leadership
Engagement with the broader community in the development of tools, standards, and best practices
||Researcher, informatician, records management specialist, data librarian, communication specialist|
|B. Functional Specifications and Implementation
Processes involved in designing or modifying and implementing the system for access and use
||Senior staff, software engineer, informatician, research domain project manager, IT project manager, IT security specialist|
Processes involved in supporting the researcher in ensuring compliance with repository requirements
||Research domain curator, research domain project manager, software engineer|
Processes involved in acquiring the data
||Senior staff, data librarian, policy specialist|
Processes involved in receiving and preparing the data for insertion in the repository
||Research domain curator, research domain project manager, metadata librarian, data wrangler, IT project manager, software engineer|
|F. Data Aggregation and Linking
Processes involved in merging and aggregating new data with existing data, and processes involved in linking to external databases
||Software engineer, informatician, data scientist, research domain curator, research domain project manager|
|G. Database Management
Services and functions for managing the repository
||Software engineer, IT project manager, IT security specialist|
Services and functions for making the data available to users
||Software engineer, IT security specialist, IT project manager, informatician, policy specialist|
|I. User Support
Services for making the repository useful to users
||Software engineer, education specialist, communication specialist|
Functions that control the overall operation of the repository
||Senior staff, research domain project manager, IT security specialist, policy specialist, administrative staff|
|K. Common Services
Shared supporting services
||IT systems engineer, IT project manager, facilities manager|
|L. Data Retention or Replacement
Determining whether the data will be retained, replaced, transferred, or destroyed
||Senior staff, research domain project manager, software engineer|
STATE 3: THE LONG-TERM PRESERVATION PLATFORM
The third state is the long-term preservation platform. Content (e.g., data and code) are preserved in such a platform when it is anticipated that the data will not be actively used for the foreseeable future or if the resources are not available to maintain an active repository. For example, data from an active repository may be transformed into text, delimited strings, images, or other forms that may be viewed or processed without the content of the data management systems of States 1 and 2. This transformation enables preservation over tens to, perhaps, hundreds of years through changes in governance and computational technologies and may include compression (although compression could hinder preservation if corresponding decompression routines are also not preserved). Storage may be offline. Data may be rehydrated (see Box 2.2) as needed and moved back into an active environment, where it can be accessed and be more easily discovered.
There will naturally be overlap in some activities in all the data states. The distinction between States 2 and 3 helps focus on the different issues that arise as one moves from facilitating active use to long-term retention. Those managing a State 2 information resource may make decisions related to a State 3 resource, and the movement from State 2 to State 3 could potentially be seamless. Following good archival practice, State 2 resource managers may automatically create preservation copies of the data as they are accessioned, or those data may be stored in a preservation format. Drawing a boundary between States 2 and 3 helps to ensure that decision-making processes also consider the challenges of long-term data preservation and their associated costs.
Table 2.3 describes State 3 activities and subactivities as well as the types of individuals who carry out those activities.
TABLE 2.3 State 3: Long-Term Preservation Platform
|A. Preservation Planning
Services and functions for ensuring that the archive remains accessible over the long term
||Senior staff, records management specialist, curator, IT project manager, software engineer|
|B. Ingest and Data Transformation
Processes involved in receiving and preparing the data for insertion in the archive
||IT project manager, records management specialist, curator, software engineer, data wrangler, data scientist|
|C. Archive Storage
Services and functions for long-term data storage
||Software engineer, IT project manager, IT security specialist|
|D. Common Services
Shared supporting services
||IT systems engineer, facilities manager|
|E. Data Export or Deaccession
Functions involved in transferring custody of or deaccessioning data
||Senior staff, software engineer, research domain curator|
PERSONNEL AND THEIR RELATIVE SALARY LEVELS
Based on published case studies (e.g., Palaiologk et al., 2012) and experience of individual committee members, personnel salaries often account for the largest expenditures in data preservation, curation, and access. Appendix C provides data drawn from occupational employment statistics for the relative salary levels shown in Table 2.4. Table 2.4 defines the roles of the personnel shown in Tables 2.1-2.3 and indicates a relative salary level (VH, very high; H, high; M, medium) for each of them based on information from Appendix C.
Ayris, P., R. Davies, R. McLeod, R. Miao, H. Shenton, P. Wheatley, S. Grace, et al. 2008. LIFE2 Final Project Report. http://discovery.ucl.ac.uk/11758/1/11758.pdf.
Beagrie, C. 2019. Keeping research data safe: Cost-benefit studies, tools, and methodologies focussing on long-lived data. https://beagrie.com/krds.php.
Fontaine, K., G. Hunolt, A. Booth, and M. Banks. 2007. Observations on cost modeling and performance measurement of long-term archives. NASA research paper in PV2007 Conference Proceedings. http://www.pv2007.dlr.de/Papers/Fontaine_CostModelObservations.pdf.
Lavoie, B. 2014. The Open Archival Information System (OAIS) Reference Model: Introductory Guide, 2nd ed. (Charles Beagrie, Ltd, eds.). Digital Preservation Coalition. https://www.dpconline.org/docs/technology-watch-reports/1359-dpctw14-02/file.
NASEM (National Academies of Sciences, Engineering, and Medicine). 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, D.C.: The National Academies Press.
Palaiologk, A., A. Economides, H. Tjalsmaand, and L. Sesin. 2012. An activity-based costing model for long-term preservation and dissemination of digital research data: The case of DANS. International Journal on Digital Libraries 12:195-214.
TABLE 2.4 Personnel Categories with Definitions and Relative Salary Levels
|Personnel||Definition||Relative Salary Level|
|Administrative staff||Provides a variety of support functions for a project or program||M|
|Communication specialist||Trained in effective methods for publicizing and disseminating information to a broad audience||M|
|Curator||Often an archivist, trained in methods to describe and add value to data||M|
|Data librarian||Trained in the technical aspects of data management||M|
|Data scientist||Trained in quantitative methods for collecting, analyzing, and interpreting data||H|
|Data wrangler||Trained in methods for transforming data from one format into another and data cleansing for improved data interpretation||H|
|Education specialist||Trained in design, modification, and implementation of training materials relevant to data management and use||M|
|Facilities manager||Oversees and handles matters relating to the physical environment||M|
|Informatician||Trained in biology, medicine, or other health-related field and in quantitative methods for collecting, analyzing, and interpreting data in those fields||VH|
|IT project manager||Responsible for planning, executing, and overseeing a project; trained IT specialist||H|
|IT security specialist||Trained in methods to protect IT systems against inadvertent or malicious attacks||VH|
|IT systems engineer||Trained in implementing, monitoring, and maintaining IT systems||VH|
|Metadata librarian||Trained in the technical aspects of data standards||M|
|Policy specialist||Trained in relevant ethical, legal, and regulatory requirements||H|
|Project manager||Responsible for planning, executing, and overseeing a project||M|
|Records management specialist||Often an archivist, trained in managing data throughout the data life cycle||M|
|Research domain curator||Domain expert trained in methods to describe and add value to data||H|
|Research domain project manager||Domain expert responsible for planning, executing, and overseeing a project||H|
|Researcher||An individual who generates potentially shareable data while conducting research||H|
|Senior staff||Has a supervisory and decision-making role within an organization or program||VH|
|Software engineer||Trained in the design, implementation, testing, evaluation, operation, and maintenance of computer programs or databases||VH|
NOTE: H, high; M, medium; VH, very high.