The Cost-Forecasting Framework: Identifying Cost Drivers in the Biomedical Data Life Cycle
Thus far, this report has provided foundational information. This chapter organizes that information into a framework for identifying the major cost drivers for any biomedical research information resource. It can be applied by anyone who generates, collects, or manages data at some point in the data life cycle, or it may be applied by a funding or institutional official. The framework walks the cost forecaster through the various characteristics of data and information resources to determine which of those are likely to represent major cost drivers in the short and long terms. Cost forecasters will likely need to consult with multiple individuals with varied expertise to minimize uncertainty in the forecast.
The framework presented herein should be considered the basis of a cost forecast rather than a one-size-fits-all analytical tool for all applications. How it is applied in any situation depends on the circumstances, needs, and resources available to those involved. The activities, decisions, and cost drivers will be situationally dependent, and the framework provided herein will need to be modified to suit the specific purpose. In whatever application, however, the forecaster is encouraged to think beyond the costs associated with the specific data state being developed or managed. In the long term, it is more efficient to think early about how decisions may affect the costs of data management and access in future data states, the transitions to those states, and the future value of data to the scientific enterprise.
Making the right decisions about infrastructure can help to minimize many costs, as can taking advantage of economies of scale. But decisions need to be weighed against each other to understand their short- and long-term cost implications and their effects on data value. Many costs incurred over time are related to the curation, management, and preservation of different types of data, from different sources, that are generated by different evolving technologies and that are to be aggregated, accessed, and used in new ways. The cost-forecasting framework will help the forecaster identify the decisions to be made about the variables that can impact costs and the value of data in the short and long terms. The forecaster will necessarily focus on costs associated with the resource under development or being managed but will need to be aware of how decisions made in the earliest planning stages might affect long-term costs of data curation and use. Decisions made early in the planning process might increase the efficiency of future data curation and use or make future data curation prohibitively expensive.
Table 4.1 provides a framework for conducting a cost forecast. Subsequent sections of this chapter describe how to accomplish the steps. It should be noted that although these are presented as “steps,” they are actually activities that may occur concurrently, or iteratively as new information is gathered.
TABLE 4.1 Steps for Forecasting Costs of a Biomedical Information Resource
CONSULTING WIDELY TO CONDUCT A COST FORECAST
The cost forecaster may be a researcher but could also be a funding or institutional official. The steps outlined in Table 4.1 require the cost forecaster to develop a narrative regarding the biomedical information resource and the data contained within it that considers the entire life cycle of the data. To envision the data life cycle, which likely extends beyond any single funded performance period, the forecaster may need to consult an array of experts to understand how decisions made about data and the information resource affect the data life cycle and costs. When identifying how the information resource is to be used both in the present and future (step 1 in Table 4.1), the forecaster may refer to the request for application (RFA) from a funding agency, consider the goals and objectives of relevant research, and consult experts within the institution that will host the information resource about available assets. An RFA may require or provide guidance regarding specific data curatorial or preservation activities that will define the type of resource to be created or managed, or the researcher or research community may have specific needs or standards to be met. Aligning those activities with the activities associated with States 1, 2, or 3 (see Tables 2.1, 2.2, and 2.3) will help the forecaster determine the data state of the information resource. Because the scientific enterprise benefits from preserving the long-term value of the data and from increasing the efficiency and effectiveness of long-term data curation and use, activities related to eventual transfer of data to other states need to be considered. To identify data characteristics, data contributors, and data users (step 2 in Table 4.1), the cost forecaster will need to work with her institution, project funders, and perhaps the broader research community to identify or develop appropriate metrics to better understand and manage costs.
In spite of the fact that metrics for determining the costs and value of biomedical research data are immature, the framework explicitly makes identification of the current and potential value of the data an integral part of the cost-forecasting process (step 3 of Table 4.1). The long-term value of data comes with their being discovered, aggregated, and reused. Repositories that commit to archiving all data submitted for some designated minimum period (i.e., to satisfy a research funding agency’s data archiving requirements), regardless of the current or future value of those data, risk using valuable resources for little return on investment. Chapter 3 provides information regarding the economics of cost forecasting and for data valuation.
Identifying the personnel and infrastructure necessary in the short and long terms (step 4 of Table 4.1) requires an identification of the activities and subactivities associated with the desired data state. The forecaster will refer to Tables 2.1, 2.2, and 2.3 that describe the high-level activities and subactivities associated with each of the data states as well as the personnel who might be required for each activity to develop and maintain an efficient and sustainable data resource. The forecaster will need to consider the goals for the information resource under consideration to identify the appropriate activities. Table 2.4 describes many of the categories of personnel required for all the activities and subactivities, as well as their relative salary levels. The forecaster can work with the institution to determine how personnel resources might be acquired and then with those staff to help determine how physical infrastructure needs may be met. Some design or modification and implementation costs related to the information resource capabilities (e.g., persistent identifiers, citation management, and search) might be avoided if open-source software can be used. However, use of open-source software still incurs costs, such as those related to integrating with other repository components, updating the software as new versions are available, and harmonizing the user interface with the overall look and feel of the repository. Consulting with IT professionals, metadata librarians, software engineers, and many others may be necessary to compile the information necessary to identify the major cost drivers (step 5 in Table 4.1).
MAPPING COST DRIVERS TO ACTIVITIES IN EACH DATA STATE
The fifth step in the forecast is identifying the cost drivers and decision points associated with each anticipated activity and how those decisions might affect the ways data may be used, as well as the cost of those uses. If one were to forecast the costs of manufacturing a physical product (e.g., a digital camera), one would want to know how it will be used and distributed, its specific features (e.g., megapixels, memory capacity), and desired characteristics (e.g., long battery life, small form factor). It is also desirable to understand the properties of the components that go into the product (e.g., microprocessor power consumption, defect rate on lenses). Similarly,
in costing a biomedical information resource,1 its intended content, capabilities, and context—and the properties of data that will populate it—must be understood. In this section, those dimensions of a biomedical information resource likely to have the greatest effects on cost are considered. In each case, choices are laid out along the dimensions or range of variation in the data and the manner in which they may influence costs. Box 4.1 provides some examples in the biomedical research field regarding decisions and actions that affect costs in the short and long terms.
Table 4.2 is a generalized matrix developed by the committee that shows which cost drivers, identified by the committee and described later in this chapter, are most likely to affect the costs of specific activities in each of the data states described in Chapter 2. Although individual research activities, databases, and archives may generate costs differently than as depicted based on requirements for particular data sets or research platforms, Table 4.2 provides useful information when conceptualizing costs into the future. In most cases, the cost of long-term data preservation will not be accrued by a single individual or institution, but rather responsibility at different stages may be transferred from, for example, a researcher to a data platform host. Understanding where costs will be accrued and who has managerial responsibility for them will inform decision makers for all data states. Box 4.2 provides guidance on how to use Table 4.2. The individual cost drivers and decision points that can affect those cost drivers will be discussed in later sections.
Table 4.2 is a high-level summary of the main cost drivers influencing each activity. Some of the activities, listed in the columns in Table 4.2, are affected by a large number of cost drivers. People engaged in those activities (or costing of them) need to be sure not to focus too narrowly on one or two of the cost drivers (rows in Table 4.2) in decision making and planning. Activities affected by the most cost drivers are listed below. Definitions for all of these, and questions to guide decisions around them, are provided later in the chapter.
- I.C Knowledge Generation and Validation. This item encompasses subactivities for creating shareable research data. These activities are critical for promoting use and controlling preservation costs at later states in the data life cycle. Many of the cost drivers, such as metadata requirements, persistent identifiers, and quality control, reflect up-front work that benefits downstream use.
- II.B Functional Specification and Implementation. It is not surprising that this activity is influenced by many cost drivers, as it includes a large number of subactivities involving design or modification and implementation of all of the main repository components.
- II.F Data Aggregation and Linking. The large number of major cost drivers for this activity may indicate two general conclusions. One is that the nature, quality, and amount of data in a repository strongly influence the effort required for successful aggregation and linkage. The second is that data linkage, especially to external sources, creates dependencies that must be managed whenever data at either end of a link change.
- II.L Data Retention or Replacement. The number of cost drivers influencing these activities perhaps highlights the complexity of decisions about data retention, encompassing characteristics of the data, users and uses of the data, and constraints that regulate the data.
- III.B Ingest and Data Transformation. This activity has many cost drivers in common with Knowledge Generation and Validation. That similarity is not surprising—it reflects that decisions during the generation of data have a major influence on activities at the end of their lifetime (and about where costs are borne). For example, rigorous metadata requirements on the initial collectors of data require more effort of those people but simplify the job of those charged with archiving the data later.
It also is evident that certain cost drivers (rows in Table 4.2) affect many activities (columns in Table 4.2). When specifying and scoping a biomedical information resource, special attention should be given to these cost drivers because the ramifications of decisions related to them will strongly influence costs. The cost drivers that affect the most activities are listed below.
1 “Biomedical information resource” is used in this chapter as a generic term for a system for storing and accessing biomedical information, across all the states introduced in Chapter 3. It might be a group workspace in a single laboratory (State 1), a public repository (State 2), or a cold-storage archive (State 3), among other possibilities.
- A. Content. That the size, complexity and diversity of data, and metadata requirements facets of content affect so many activities is not unexpected. The effort required in many activities scales directly with these aspects.
- H. Confidentiality, Ownership, and Security. The prevalence of these cost drivers across many activities—especially confidentiality and security—derives in a large part from the prevalence of human-subjects and animal-model data in the biomedical domain.
- J. Standards, Regulatory, and Governance Concerns. The applicable standards cost driver influences a number of activities, which reflects the two sides of standards: there is effort required to conform to them, but dealing with data that conform to standards often facilitates other activities. The regulatory and legislative cost driver also impinges on many activities, which, again, possibly arises from the extensive use of human and animal data in biomedicine.
INDIVIDUAL COST DRIVERS IN THE DEVELOPMENT AND OPERATION OF A BIOMEDICAL INFORMATION RESOURCE
There is a wide variety of biomedical information that is worth preserving and sharing, from genomic sequences to clinical outcomes. Because of the variation in the content and other aspects, the costs of constructing, maintaining, and accessing such information can differ greatly. This section describes the main ways biomedical information resources may vary and why each variation is likely to affect costs or utility. The variations are grouped into more general categories in the next subsections, which are numbered to correspond with the categories provided in Table 4.2. When considering costs of alternatives, total costs related to managing, accessing, and using data need to be considered—both those costs borne by the operators of the resource as well as those of the users. Decisions regarding quality, delivery, or stewardship of data that will populate the resource can all drive up costs (or limit the value of the resource) as well. The more ambitious the plans for a biomedical information resource, the more personnel and financial resources will be required to support it but the greater potential benefit for users of the information resource and to scientific discovery. Thus, understanding the properties of the data is essential for estimating the costs involved with a biomedical information resource so that the forecaster can understand the short- and long-term trade-offs necessary related to each of the cost drivers. Cost drivers related to issues with input data and their disposition are called out in some of the subsections below.
Table 4.2 will help the cost forecaster understand which information resource-related activities will likely be important cost drivers to short- and long-term costs. Questions to help the cost forecaster identify key decision points for each cost driver are provided. The questions are written at a high level and intended to help the forecaster identify areas where more detailed lines of inquiry are warranted. When forecasting costs, these are the types of questions a cost forecaster needs to ask about each cost driver. How the forecaster answers these questions affects not only the cost of managing a given data state but also future costs for users or future data state managers.
The questions have been compiled into a blank table in Appendix E that could be used as a template when considering long-term costs. The template could help the forecaster organize the detailed narrative necessary to realistically assess activities that promote efficient and effective data preservation and use. The narrative can then drive a detailed quantitative analysis of the costs based on the resources available to the forecaster. Examples of how the template can be applied are provided in Chapter 5. Appendix F compares cost drivers for three hypothetical biomedical information resources (one for each data state).
TABLE 4.2 Drivers Affecting Cost of Data-Related Activities in the Three Data States
|State 1: Primary Research and Data Management Environment||State 2: Active Repository and Platform||State 3: Long-term Preservation Platform|
|I.A Outreach & Training||I.B Provocation & Ideation||I.C Knowledge Generation & Validation||I.D Dissemination & Preservation||II.A Community Leadership||II.B Functional Specifications & Implementation||II.C Validation||II.D Acquisition||II.E Ingest||II.F Data Aggregation & Linking||II.G Database Management||II.H Access||II.I User Support||II.J Administration||II.K Common Services||II.L Data Retention or Replacement||III.A Preservation Planning||III.B Ingest & Data Transformation||III.C Archive Storage||III.D Common Services||III.E Data Export or Deaccession|
|A.2 Complexity and diversity of data types||✓||✓||✓||✓||✓||✓||✓||✓||✓||✓||✓||✓||✓|
|A.3 Metadata requirements||✓||✓||✓||✓||✓||✓||✓||✓||✓||✓|
|A.4 Depth versus breadth||✓||✓||✓||✓|
|A.5 Processing level and fidelity||✓||✓||✓||✓|
|A.6 Replaceability of data||✓||✓||✓|
|B.1 User annotation||✓||✓||✓||✓|
|B.2 Persistent identifiers||✓||✓||✓||✓||✓||✓||✓|
|B.4 Search capabilities||✓||✓||✓||✓|
|B.5 Data linking and merging||✓||✓||✓|
|B.6 Use tracking||✓||✓||✓||✓||✓||✓|
|B.7 Data analysis and visualization||✓||✓||✓|
|C.1 Content control||✓||✓||✓||✓|
|C.2 Quality control||✓||✓||✓||✓||✓||✓||✓|
|C.3 Access control||✓||✓||✓||✓|
|C.4 Platform control||✓||✓||✓|
|D. External Context|
|D.1 Resource replication||✓||✓|
|D.2 External information dependencies||✓||✓||✓||✓|
|State 1: Primary Research and Data Management Environment||State 2: Active Repository and Platform||State 3: Long-term Preservation Platform|
|I.A Outreach & Training||I.B Provocation & Ideation||I.C Knowledge Generation & Validation||I.D Dissemination & Preservation||II.A Community Leadership||II.B Functional Specifications & Implementation||II.C Validation||II.D Acquisition||II.E Ingest||II.F Data Aggregation & Linking||II.G Database Management||II.H Access||II.I User Support||II.J Administration||II.K Common Services||II.L Data Retention or Replacement||III.A Preservation Planning||III.B Ingest & Data Transformation||III.C Archive Storage||III.D Common Services||III.E Data Export or Deaccession|
|E. Data Life Cycle|
|E.1 Anticipated growth||✓|
|E.2 Update and versions||✓||✓||✓||✓|
|E.3 Useful lifetime||✓||✓|
|E.4 Offline and deep storage||✓||✓||✓|
|F. Contributors and Users|
|F.1 Contributor base||✓||✓||✓||✓||✓||✓||✓|
|F.2 User base and usage scenarios||✓||✓||✓||✓||✓|
|F.3 Training and support requirements||✓||✓||✓||✓||✓||✓|
|G.1 Tolerance for outages||✓||✓||✓||✓||✓||✓||✓|
|G.3 Response time||✓||✓||✓||✓||✓||✓|
|G.4 Local versus remote access||✓||✓||✓|
|H. Confidentiality, Ownership, and Security|
|I. Maintenance and Operations|
|I.1 Periodic integrity checking||✓||✓||✓||✓||✓|
|I.2 Data-transfer capacity||✓||✓||✓||✓||✓||✓|
|I.3 Risk management||✓||✓||✓||✓||✓|
|I.4 System-reporting requirements||✓||✓||✓|
|I.5 Billing and collections||✓||✓||✓||✓||✓||✓|
|J. Standards, Regulatory, and Governance Concerns|
|J.1 Applicable standards||✓||✓||✓||✓||✓||✓||✓||✓|
|J.2 Regulatory and legislative environment||✓||✓||✓||✓||✓||✓||✓||✓||✓||✓||✓|
|J.4 External consultation||✓||✓||✓||✓|
The aspects covered in this section deal with the amount, kinds, and qualities of data that a biomedical information resource is expected to host.
There are at least two facets to size—overall size (e.g., volume of data in bytes) and number of identifiable items. The overall size affects media costs, time required to replicate and transfer data, and perhaps time to verify or index data. The number of identifiable items can affect the sizes of indexes and amount of metadata (e.g., the descriptive, structural, administrative, reference, or statistical information about data found in a database), as well as the time to curate the data. Identifying data at a finer granularity can help make searches more specific and might help a user avoid downloading large amounts of extraneous information.
Example decision points related to size:
- How many files will be in a single data submission?
- How large is an average data submission in total?
- Are the data sizes likely to stay stable over the life of the resource?
- What is the total amount of data expected?
- In what kind of medium will data be captured in the short and long terms?
A.2 Complexity and Diversity of Data Types
Data in some biomedical information resources, such as The Cancer Genome Atlas,2 were collected expressly for the resource. In such a situation, the resource managers have strong influence over the specific formats, standards, required fields, and other elements. The resulting homogeneity in the data makes them easier to process. In other situations, the data that end up in the resource are originally collected for other purposes, such as a specific research project or patient care. In that case, one expects that the data will be more heterogeneous and not necessarily conform to the conventions of the resource, thus requiring more effort to ingest and curate.
The items in an information resource might be structurally simple (e.g., deoxyribonucleic acid [DNA] sequences) or complex (e.g., patient medical histories). The resource might contain a single kind of data or several. The more data types and the greater their complexity, the greater the cost to design and maintain the storage schema, as well as the number and complexity of load scripts, quality control routines, query interfaces, and documentation. Cost-efficiently integrating multiple data types in a high-quality manner requires expertise in each of those data types.
The organization of data to be included in the resource will affect the effort required to assess whether the data should be included in the resource and the order in which they should be processed. For example, data items might be in separate files, organized into hierarchical collections, or perhaps grouped by species, chromosome number, patient identification, or phenotype. Such an arrangement can make it easier to select appropriate subsets for inclusion in the resource. In contrast, data items in a data set might all be in a single large file (e.g., the result of a backup), requiring a scan of the entire data set to extract the subsets appropriate for inclusion in the information resource.
Example decision points related to complexity and diversity of data types:
- How complex is the underlying structure of the data?
- How are the included data to be organized?
- How complex is the experimental paradigm that produced the data?
2 The website for The Cancer Genome Atlas is https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga, accessed December 5, 2019.
- What sort of additional files might be necessary to upload with the data to properly understand them?
- How many different data types are being produced?
- What are the relationships among these data types (e.g., are the data correlated)?
A.3 Metadata Requirements
An information repository might contain more or less metadata per identifiable item. Possible metadata elements include contributor, provenance (source and record of possession), lineage (derivation history), data uncertainty or quality, and search attributes. It may be possible to derive some of those metadata, such as summarizations, from the data themselves. Other parts could depend on external context and need to be supplied by the producers or curators of the data. Still other metadata might be in a form intended for human rather than machine understanding, in which case human effort will be necessary to interpret them. What portion of the metadata falls in each case affects costs, as the former case can be automated, while the latter two incur labor costs. Those labor costs might be borne by those that produce the data or by those who curate the information resource. The metadata (or some part of it) may need to be uploaded to a clearinghouse—a platform dedicated for publishing metadata for the purpose of discoverability.3 Doing so will entail extra steps when adding data but will make them locatable by more people. Even small amounts of metadata can help with decisions about storage, archiving, or removal of data, thus saving long-term costs. One example is knowing if a data set represented is difficult or impossible to reobtain (i.e., “base” data)—either collected internally or obtained from an outside source—versus data that can be derived from base data. Another is whether the data have or were influenced by protected health information in any way. If the community has agreed on standards and tools, or if the targeted repository has data submission tools with standard data formats, the initial cost of managing data in the research environment may increase, but downstream costs may be reduced.
A data schema is a description of the data structure. At a minimum, the presence of a schema helps resource developers to understand the precise structure of the data they need to process. The absence of a schema is likely to require creation of special data processing software. The schema might be quite specific, such as a relational database schema, which lists the column names for each table, along with their data types, which columns must hold unique values, and other constraints, or it might include only column names or define possible structures. An Extensible Markup Language (XML) schema is an example of the latter. The schema might be held separately from the data, as in the case of relational or XML schemas, or embedded with the data in the form of, for example, column names in a spreadsheet or the header in a hierarchical data format (HDF) file. Some schemas can be supplied in machine-readable form, allowing for automated or semi-automated processing. The presence of a data schema can help biomedical-information-resource developers and managers in multiple ways. It might support an automated means to upload the data into the resource’s data store or simplify scripts to do so. If data are guaranteed to conform to the schema, then less checking is required during data ingest.
The provenance of a data set is an account of how the data came to be in their current state. It might indicate who collected or generated the data, where, and when. Having such information with the data in the repository can improve trust in the data, thereby increasing the value of a biomedical information resource. The provenance could also include the processing history of the data, which might reveal biases in the data or indicate the appropriateness of particular further processing. For example, indicating if outliers were cleaned from the data might affect the suitability of certain statistical analyses of the data. The provenance could also contain parameters relevant to the collection or generation of the data, such as settings of an instrument used to collect the data, or the configuration file for a computational model. That type of information supports correct interpretation of the data by the biomedical-information-resource developers. Without such information, those managing the resource may need to reverse-engineer the values of the relevant parameters (or seek them elsewhere). If the parameters cannot be recovered, it might not be possible to determine if the data meet the conditions for inclusion in the resource, necessitating their rejection.
3 An example of a metadata clearinghouse is the National Biological Information Infrastructure Metadata Clearinghouse hosted by the U.S. Geological Survey (see https://www.sciencebase.gov/catalog/item/4f4e48bee4b07f02db53a643).
Example decision points related to metadata requirements:
- How much metadata must be stored with each data object to make them findable, accessible, interoperable, and reusable (FAIR)?
- Will metadata be entered manually by the submitter/curator?
- Will the data to be deposited include a data schema, or will one be generated?
- Is the provenance of a data set sufficiently described, or does it need to be?
- How much metadata can be extracted computationally?
A.4 Depth Versus Breadth
A biomedical information resource might be directed at a certain class of data (e.g., DNA sequences or cell images), regardless of the kind of study that generated them. Alternatively, the resource might target all types of data arising out of a particular domain of study or from a specific community. For example, a resource that hosts data from brain-damage studies might include functional magnetic resonance imaging images, optical images of brain slices, genomic and proteomic analyses, cognitive function tests, and clinician reports. A resource with responsibility to collect a wide range of data will likely be more expensive per data unit, as schemas, search capabilities, curation procedures, and other aspects of the resource design and management will need to be replicated on a per-data-type basis. Furthermore, as a field evolves and new experimental and computational techniques are developed, the resource potentially needs to extend its capabilities to handle data generated by those techniques. Decisions related to depth versus breadth of data will need to be informed by user needs and expectations. While it may be less expensive per data unit to have a resource focus on a single class of data, integration across multiple resources can be time intensive and costly for users. If researchers frequently use multiple data modalities, it may be more cost effective in the long term to design integration solutions early.
Example decision point related to depth versus breadth:
- Will the repository be restricted to certain data classes or types that the repository must support?
A.5 Processing Level and Fidelity
As data proceed from their raw form, through calibration, cleaning, and processing, to analysis, their volume generally shrinks. Thus, the point in this spectrum that a biomedical information resource targets for the data it collects can have a large effect on data volume. For example, storing all raw DNA reads from a sequencing run will require much more space than storing just a consensus sequence for the reads. Storing more-detailed versions of data incurs cost for the larger space and also for the effort of uploading and curating more data. However, the more-detailed data might support a larger range of uses. There are also potential space savings from storing approximate data rather than exact values. For example, DNA sequencing data can come with a quality score at each base position. Most current scoring schemes feature more than 40 different scores, which means that a quality score takes more space than the base it annotates. Replacing exact quality scores by 2-bit approximations could reduce storage by one-third.4 Whether such an approximation is acceptable must be decided based on the intended use of the data.
Structure of the data also affects the processing level. Data intended for a biomedical information resource might be more or less explicitly structured and therefore may require different levels of processing to be incorporated into the resource. Highly structured data residing in a relational database management system, or in comma-separated-value files, may have easily discernible structures. Data might be in a semi-structured representation, such as in
4 e.g., see Illumina, 2014, “Reducing Whole-Genome Data Storage Footprint,” Pub. No. 970-2012-013, Illumina, Inc., April 17, https://www.illumina.com/Documents/products/whitepapers/whitepaper_datacompression.pdf. Accessed May 27, 2020.
an XML data format, in which case more analysis and more complex scripts may be required to ingest the data. Data in simple text formats, or scans of text, may require more extensive processing to ingest the data.
Data intended for a biomedical resource might use a character or numeric encoding for values that differ from those used in the resource, in which case the data will need to be encoded after ingestion. A data source might also be compressed, which can have advantages and disadvantages. The smaller size of compressed data might reduce network-transfer times or intermediate storage of the raw data. On the other hand, the data will likely need to be decompressed, in whole or in part, to perform checks and manipulations at the resource.
Example decision points related to processing level and fidelity:
- Do the raw data need to be stored?
- Do processed data need to be stored?
- Are there compression algorithms that can reduce the file size without compromising fidelity?
- What kind of data structure requirements will the resource have?
- Is the data contributor or the repository responsible for any restructuring necessary?
- How is the data structure verified?
A.6 Replaceability of Data
A biomedical information resource might be the “official” home for certain data sets or could simply be a replica of data whose master copy resides elsewhere. Even if the resource is the official home, it might be possible to regather the data in it through repeating experiments or calculations. The cost of replacing data (or the impossibility of doing so) informs what are reasonable expenditures on redundancy and other means for loss prevention.
Example decision points related to replaceability of data:
- Is the archive the primary steward of the data, or do copies exist elsewhere?
- Can the data be easily recreated?
The previous section covered properties of the data themselves. This section covers aspects of a biomedical information resource that describe what information resource users are able to do with the data in the resource (i.e., without extracting the data into another environment).
B.1 User Annotation
A biomedical information resource might support comments, annotations, and corrections on data items beyond those originally submitted by the contributors of the data. If so, there is a cost for developing such a facility and for overseeing its appropriate use. Human or machine interventions in a data resource also need to be documented, authenticated, and retained as part of the metadata.
Example decision points related to user annotation:
- Will the repository have to provide user annotation capabilities?
- What is the nature of these annotations?
- Are they provided by humans or machines, and how will they be authenticated?
- Are permissions required to annotate the data?
B.2 Persistent Identifiers
A biomedical information resource might want or have to support persistent identifiers (PIDs) for data sets or data items, such as Digital Object Identifiers (DOIs).5 The host of the resource may have to pay directly for the ability to assign such identifiers or indirectly in participation in an organization that has that capability. Support of such identifiers also carries a requirement to maintain a mapping from identifiers to data entities even as those entities are modified or moved.
Example decision points related to persistent identifiers:
- What PID scheme will be used by the archive?
- Is there a cost associated with using the PID?
- How many objects need to be identified?
- Who will be responsible for keeping the PIDs resolvable?
A biomedical information resource might support citation of data items (or sets of items) at a granularity smaller than entire data sets. If so, there might need to be a facility that, given an item or sets of items, generates a citation and, conversely, given a citation, locates the corresponding items.
Example decision points related to citation:
- Will users be able to create arbitrary subsets of data files and mint a PID for citation?
- Will the repository provide machine-readable metadata for supporting data citation?
- Will the repository provide export of data citations for use in reference managers?
B.4 Search Capabilities
Efficient search for items in a biomedical information resource, beyond simple look-up by a reference or access number, usually requires construction and maintenance of indexes over the data. Those indexes take space beyond what is required for the base data, plus time to construct and maintain. If there are many indexes, they can slow updating of the base data. Different kinds of searches require different kinds of indexes. For example, an index that supports look-up of items by exact match to a value might not support searching on a range of values. Full-text searches and approximate matching (such as for genomic sequences) require specialized index structures to execute efficiently.
Example decision points related to search capabilities:
- Will the repository provide a search capability for data sets?
- How much of the metadata will be included in search?
- How complex are the queries that will be supported?
- What types of features for search will be provided?
- Will the repository deploy services to search the data directly?
5 The website for the DOI System is https://www.doi.org/, accessed December 4, 2019.
B.5 Data Linking and Merging
A biomedical information resource might supply users with the ability to navigate from a data item to related items, such as from a DNA sequence region to a protein coded by a region of that sequence, or from a medication to a list of clinical studies for that medication. The resource might combine information from multiple contributors into a single entry, such as a functional annotation from one contributor on a gene sequence from a different contributor. Such capabilities mean that the resource will need to create appropriate links and perform merging operations when new data are added.6 Supporting such data interoperability is aided by the use of standards (e.g., ontologies and common data elements) but can be time consuming and expensive depending on the complexity of the data and their initial level of compliance.
Example decision points related to data linking and merging:
- Will the data require/benefit from linkages to other related items?
- Will the resource provide the ability to combine data across records based on common entities/standards?
B.6 Use Tracking
A biomedical information resource might track uploads, access, and downloads of data items to inform contributors and resource operators about their use. Statistics of such operations may incentivize researchers to contribute data by providing evidence of data use. They could inform life-cycle decisions such as when data could transition to another state, and they might be used to assess the long-term value of the data. In addition, the information could support billing and cost recovery. Tracking would likely have a minor effect on overall costs of operating the resource.
Example decision points related to use tracking:
- Will the resource provide the ability to track uploads, views, and downloads?
- If so, and if made available to users, how will this information be made available?
- Will the resource track data citations to its data?
B.7 Data Analysis and Visualization
Generally, users of a biomedical information resource want to do more than view data. They might want to reformat data for use with a particular tool, visualize them in the context of other data, and conduct statistical analyses upon them, for example. The resource might provide services to perform such operations locally on the data. There might even be a requirement to provide general-purpose co-located computing, where a user can run arbitrary programs on the data. Providing such capabilities will incur costs for provisioning the computing cycles to support those operations. However, such costs might be offset by reductions in costs to resource operators by avoiding bulk downloads by users to perform computations locally. In considering costs to the wider scientific enterprise, supporting data analysis at the resource could avoid user costs of downloading and maintaining local copies, and could increase the value of the data by expanding the audience of researchers who could work with them. The operators of a resource might employ a credit- or token-based system to limit and track the use of computational resources. Such an approach can help control computation costs, although there will be costs associated with implementing and administering such a mechanism.
6 In this case, the committee is talking about linking and merging at the resource rather than before deposit.
Example decision points related to data analysis and visualization:
- What types of data analyses and visualizations will the repository support?
- What types of other data operations will the repository support (e.g., file conversions, sequence comparison)?
- Do these services require significant computational resources?
- Who will pay for computational resources?
This section covers aspects of a biomedical information resource that deal with control and oversight of the resource.
C.1 Content Control
A biomedical information resource can be more permissive or more restrictive in what it chooses to include in its contents. At the permissive end, a resource might allow open posting of any data of the appropriate type. At the other end, there may be a review process to determine whether submissions are to be included in the resource. That review could be minimal—for example, an automated check that the submission is properly formatted—or more extensive—for example, a human review of metadata to assess suitability for inclusion. A more intensive review process increases labor costs but provides a means to limit the amount of data that is hosted.
Example decision points related to content control:
- Will all appropriate data be accepted, or will there be a review process?
- Will the review process be automated, or will it require human oversight?
C.2 Quality Control
A biomedical information resource may exercise more or less rigorous control on the quality of the information within it. At one extreme, it might leave all quality control to be the responsibility of the data contributors. At the other, it might manually or automatically vet all incoming data to detect quality issues. There could also be quality assessments on derived products that are generated internally to the resource. More intensive quality control incurs higher costs, but it can increase the value of the resource for scientific and clinical use. Problems with quality, when encountered, entail increased review and (where possible) repair of data by someone. If repair is not possible, the value of the resource may be compromised. Quality-related properties that need to be verified are correctness, completeness, currency, and duplication.
Correctness is related to how accurate the data are; values in input data may be inaccurate for a variety of reasons (e.g., errors in data processing or transcription, noise in the instrumentation or method used to collect them, mislabeling of samples). There might be internal inconsistencies in the data or incompatibilities with external sources. Cross-links between records might be incorrect. The greater number of types and instances of correctness problems with the input data, the more effort is required by resource managers to address them. To the extent that such problems are not identified or addressed, the value of the resource can be compromised.
Validating the completeness of data means identifying missing items at the record or field level. Such gaps can entail costs for the resource because of added complexity in the processing to ingest the data and possible additional complexity in the data representations in the resource to cope with missing elements—for example, special flags for missing values.
The time sensitivity of data can impact the currency of data. Some kinds of biomedical data are relatively time insensitive (e.g., the amino-acid sequence of a particular protein). However, other data may have more value the more current they are (e.g., disease incidence). In that case, badly out-of-date data may limit the value of a biomedical information resource.
Duplicated information within or between data sets will require more review or processing to remove duplication. A particular instance of duplication is when a contributed data set is revised periodically with corrections and additions. If those changes are not explicitly flagged, then the managers of the resource will need to compare each new version of the data set with previously submitted versions to avoid loading duplicate data.
The biomedical information resource may promulgate guidelines or validation routines that indicate issues with data or conformance with resource expectations and standards. While prevalidation of data quality by the data contributor shifts costs to the data contributor, it could result in lower overall effort, as the providers may be able to incorporate checks into their normal data processing practices. Also, detecting a problem with the data on the provider side avoids back-and-forth communication between contributor and resource managers to point out problems and get corrected data.
Example decision points related to quality control:
- What quality control process will the repository support?
- Will these be automated or require human oversight?
- What level of data correctness will be required, and how will it be validated?
- What gaps in the data at the record or field level will be tolerable?
- Will any of the data be time sensitive, and how will data currency be ensured?
- How will duplication within or between data sets be addressed?
- Will prevalidation guidelines or routines be distributed by the resource to the data contributors?
- Will human curation be necessary?
C.3 Access Control
A biomedical information resource might place restrictions on which users can see which data—for example, if data are embargoed from general release for a certain length of time or the resource might provide private workspaces for individual users or groups. The data may also be consented for particular uses, in which case consent information will need to be linked to particular data items and consulted when deciding access permissions. Such control means having a mechanism to identify users (authentication) and to track which are allowed access to what data (authorization). This capability adds costs both for managing user identifications and for developing access-control mechanisms. Supporting collaborative workspaces, blind review, and mandatory release schedules all complicate those mechanisms.
Example decision points related to access control:
- What types of access control are required for the repository (e.g., will there be an embargo period)?
- At what level are they instituted (e.g., individual users, individual data sets)?
- Does use of the data require approval by a data access committee?
C.4 Platform Control
There might be limitations on what computing platforms are allowed for running a biomedical information resource. Third-party hosting (e.g., commercial cloud providers) might be prohibited, permitted, or required, or there may be restrictions on hosting or mirroring data overseas (e.g., if overseas data privacy laws regarding human-subjects research data may not be aligned with domestic policies). Such restrictions constrain implementation alternatives, which in turn can influence costs.
Example decision point related to platform control:
- Are there restrictions on the type of platform that may or must be used?
D. External Context
This section considers the context of a biomedical information resource in relationship to other, external resources.
D.1 Resource Replication
There might be a requirement to replicate a biomedical information resource at other sites, with other groups operating “mirror” versions of the resource. Mirroring might be required, for example, to provide more convenient access to collaborators at a distant location. If there is such a requirement, then the original site will need to coordinate software updates and data releases with the mirror sites.
Example decision point related to resource replication:
- Is there a requirement to replicate the information resource at multiple sites (i.e., mirroring)?
D.2 External Information Dependencies
A biomedical information resource might have dependencies on other information sources. For example, a resource containing DNA sequences for an organism might depend on a reference sequence to provide position numbers to locate those samples. In another example, metadata records might require certain fields to come from a controlled vocabulary, such as the Medical Subject Headings (MeSH).7 Such a dependence might engender maintenance costs when the external source is updated.
Example decision point related to external information dependencies:
- Will the resource be dependent on information maintained by an outside source?
There might be other biomedical information resources with similar content and that support some of the same tasks. In such a case, it might be that such information resources can substitute for the resource in question, albeit with some “degradation” in result. The type and amount of such degradation can help calibrate the soft cost of risk of loss (see Appendix D).
Example decision point related to distinctiveness:
- Are there existing resources available that provide similar types of data and services?
E. Data Life Cycle
This section deals with aspects of a biomedical information resource that concern how it is expected to evolve over time.
E.1 Anticipated Growth
The ultimate size of a biomedical information resource, and the rate at which it grows to reach that size, influence annual maintenance and expansion costs. Will the resource reach “maturity,” where no new data are
7 The website for the MeSH is https://www.nlm.nih.gov/mesh/meshhome.html, accessed December 12, 2019.
expected because of the end of a project or program that supplies the data, or is it expected to continue growing throughout the lifetime of the resource?
Example decision points related to anticipated growth:
- Is the repository expected to continuously grow over its lifetime?
- Is the likely rate of growth in data and services known?
- Is the use of the repository likely to grow over time?
- Is the likely growth of the user base known?
E.2 Update and Versions
The frequency of updates and the need to retain past versions for a biomedical information resource affect operating costs. Some resources provide periodic releases, which batch updates and apply them all at once, whereas other resources are revised incrementally as updates come in. In the case of the periodic-release model, past releases (versions) might be maintained, for example, to support replicability of a study that used a particular release. Retaining past versions obviously incurs storage costs over just providing the most recent version, and decisions need to be made about if and how prior versions will be made available. In the case of the incremental model, the frequency of update might be a cost driver, if updates entail manual review or curation activities.
Example decision points related to updates and versions:
- Will the deposited data require updates (e.g., in response to new data or error corrections)?
- Will prior versions of the data need to be retained and made available locally or in a different resource?
- How frequently will individual data sets be updated?
E.3 Useful Lifetime
Some data in a biomedical information resource might have a limited period of usefulness, or their utility might decline with time. For example, a collection of cell images might be superseded by later images from a higher-resolution technology. Deaccessioning or archiving such data will reduce operating costs. If there is a predictable end date for a resource as a whole, that knowledge is useful in predicting lifetime costs. Useful lifetime of data can be difficult to predict because even data collected decades ago can still be used for analysis if properly documented (see Box 4.3).
Example decision points related to useful lifetime:
- Are the data to be housed likely to have a limited period of usefulness?
- Does the resource have a defined period of time for which it will operate?
- Does the resource have to provide a guarantee that the data will be available for a finite period of time (e.g., 10 years)?
E.4 Offline and Deep Storage
If it is possible that some data in a biomedical information resource do not need to be available online but still need to be retained, then they might be migrated to a less expensive form of storage. This report distinguishes between offline and deep storage. Data in offline storage can be brought back online in the resource, albeit with some delay. Data in deep storage (typically State 3 data) are not intended to be brought back online in the same resource. Rather, they are preserved in the event that someone wants to “rehydrate” them in the future, either for individual use or as part of another information resource. In the case of offline storage, there can be costs for
offline data beyond basic storage costs, regardless of who is managing the offline storage.8 For example, several commercial cloud platforms have lower-cost archival storage services or tiers that assume only a small fraction will be accessed during any period. Access in excess of that fraction incurs additional cost.
Rehydration costs can appropriately be ascribed to future users of data that are in deep storage.
Example decision points related to offline and deep storage:
- Can the resource take advantage of offline storage for data that are not heavily used?
- Does the resource have a plan for moving unused data to deep storage (i.e., State 3)?
F. Contributors and Users
This section covers aspects of a biomedical information resource associated with user characteristics and numbers that might influence costs.
F.1 Contributor Base
The number of individuals or sources that generate information to be hosted can affect development and operating costs for a biomedical information resource. If data originate from the same source (e.g., a single instrument, as with sky surveys and particle-physics experiments or a single organization or community), then less effort is required to coordinate with data providers than in situations where data originate from many communities and organizations. In addition, if all contributors are internal to the same organization that hosts the information
8 Note that cloud providers of archival storage are known to use tape to support that storage (e.g., Lantz, 2018).
resource, good compliance with data formats and standards may be more likely and costs for review and curation might be less. Alternatively, data originating from a wide range of individual autonomous investigators spread across multiple disciplines may require more interactions between the resource managers and those investigators to collect them, and there will likely be more variation in the data to address.
How data are transferred to the resource will also affect costs. They may arrive periodically in large batches or incrementally in smaller amounts. In the extreme case, data might stream in continuously—for example, directly from wearable devices. Data transfer can be initiated by the data contributor (data push) or by those managing the resource (data pull). To the extent that data ingestion by the information resource has a manual component, more frequent arrivals means the more often a person has to manage that task. In the continuous case, automated processing will probably be a necessity, which will entail development costs.
The time and network resources required to transfer a data set from a provider to a data resource scales (although not necessarily linearly) with the size of a data set. However, in some cases a data set may be so large that network transfer is not feasible (or will not be complete in the required time frame). In such cases, physical transfer of storage media may be needed, which entails costs for purchasing the media, loading them, shipping them, and extracting the data at the resource.
Also affecting the cost of integrating contributed data into a resource are, for example, whether there are direct charges (e.g., purchasing costs, licensing fees) or indirect charges (e.g., membership fees to access) associated with acquiring the data and whether the data contributor is willing to be responsible for the data and serve as their steward. The steward is the point of contact regarding the data who responds to questions about them, addresses errors or other problems associated with them, and tracks their current location(s). Data can be effectively “orphaned” if the data collector is no longer affiliated with the organization where the data were collected. Trying to locate, obtain, and understand data with no identifiable steward can require significant effort.
Example decision points related to the contributor base:
- Is the number of contributors known? If not, can it be estimated?
- Are all the data originating from the same source (e.g., a single instrument or a single organization)?
- How will data be transferred into the data resource (e.g., periodic large batches, more frequent smaller data sets, constantly streamed, by physical transfer)?
- Will the data be pushed by the contributor or pulled by the resource?
- Are there direct or indirect fees associated with acquiring the data from a source?
- Will a data steward be available from among the contributors to assist with any data integration into the data resource?
F.2 User Base and Usage Scenarios
The number of people accessing an information resource and the frequency and kinds of access can all influence costs for a biomedical information resource. A resource that serves an entire research community will likely see much more use than, say, an internal project repository for a single research group. While actual storage costs will probably not depend on the number of users (unless data must be replicated to serve high access rates), computation and network costs will rise with increased use. The kinds of access can also affect those costs. For example, retrieving single items will require less network bandwidth than a bulk download of a whole data set.
Example decision points related to the user base and usage scenarios:
- How many users will likely access the data?
- What will be the frequency of access?
- How will users access the data?
- Will the resource be building analysis tools?
- Will the resource support individual file download or bulk download?
- Will there be any fees for downloading/accessing the data?
- How many different types of users must be supported?
F.3 Training and Support Requirements
There may be expectations, or it may be found beneficial, that operators of a biomedical information resource provide training for resource users. That training could be more or less labor intensive and involve conducting tutorials, preparing training materials, or maintaining help pages on a website. A “help desk” function might be required that provides either live consultation or message-based responses, both of which require training and staffing. On the other hand, investing in training of and consulting with users may result in easier data integration and lower future data-collection and curation costs.
Example decision points related to training and support requirements:
- Will training for resource use be offered?
- What form will the training take?
- Will a “help desk” be provided?
- When does live help need to be available?
- What is the expected skill level of the user base?
If the existence and features of a biomedical information resource need to be publicized, then there may be associated labor, travel, and media costs for preparing articles, giving conference presentations, producing newsletters, conducting print or e-mail campaigns, reaching out on social media, and so forth. In some communities, there may be reticence or even resistance toward using shared information resources, which might require extensive outreach efforts to overcome.
Example decision points related to outreach:
- Does the existence of the repository need to be advertised?
- How many conferences per year should resource representatives attend?
- Will the resource have a booth at the conference for live demos or conduct hands-on tutorials?
- Are users required by funders or journals to deposit data in the repository?
The aspects in this section relate to expectations about how available the data in a biomedical information resource will be. Data availability encompasses the reliability of the resource hosting the data, how quickly new data appear, how fast requests for data are serviced, and from where the data can be accessed.
G.1 Tolerance for Outages
Different biomedical information resources have different tolerances for system outages. While an outage of hours or days might be tolerable for a resource that supports, for example, retrospective analysis, a similar loss of availability might be highly undesirable for a resource that is used continuously every day in support of clinical decision making. A low tolerance for outages often entails data replication, which will incur storage and other costs, possibly including the cost of network bandwidth for transferring data to a backup site. Guarantees of high availability (for example, 99 percent up-time) require support staff to be on call around the clock, which is a large labor expense.
Example decision points related to tolerance for outages:
- What is the tolerance for outages of the resource?
- What measures will be taken to avoid and mitigate outages?
- How quickly and completely does the resource need to recover from an outage?
Data submitted to a biomedical information resource may need to be available to users within a fixed time frame. It might be acceptable that new items appear in data resources in monthly or quarterly releases, but other types of data resources may need to be updated daily (e.g., outputs for flu forecasting models). This requirement might affect labor costs, since in the latter case there is no opportunity to amortize effort over all items in an update batch.
Example decision points related to currency:
- How often will the data be released?
- How soon do data need to be made available after they are received?
G.3 Response Time
A biomedical information resource may have a target or requirement for how quickly requests are serviced, either by a computer system or by a human agent. Interactive response times (a few seconds) might require replicating the data and additional computing services, so that multiple requests for popular data can be handled at the same time. Interactive response times also limit what data can be held in lower-cost near-line or offline storage. There can also be a human element in response time, such as approvals for access or review of submitted data sets. In general, lower response times correlate with higher labor costs.
Example decision points related to response time:
- Are there requirements for response time for service?
- Are there requirements for responses from humans?
G.4 Local Versus Remote Access
While most biomedical information resources of which the committee is aware support remote access over the Internet, there are examples in other domains (e.g., film archives, defense-personnel information) where users must physically come to the resource to access it. Such a scenario generates space, staffing, and equipment costs for hosting users. Some resources may be accessed remotely over a network, but large data transfers may still entail shipping physical media (e.g., tapes, disks), which incurs preparation and shipping costs.
Example decision points related to remote access:
- Does the resource require that any data be shipped via physical media?
- Will the resource be built using commercial clouds?
- Do users have to travel to the resource to use the data?
H. Confidentiality, Ownership, and Security
This section covers aspects of a biomedical information resource related to protecting the data and the rights of those associated with the data. These issues are complex subjects and warrant more attention than can be given in this report, but the questions provided here will allow the cost forecaster to identify the relevant cost drivers.
A biomedical information resource may need to protect the confidentiality of the data it holds, because those data contain either personally identifiable information or sensitive intellectual property. In the case of personally identifiable information, there may be a need to deidentify information or restrict access. There may be additional requirements to track and audit use. If so, credentials and permissions will need to be assigned to users; systems, analytical output, and space to maintain use records will be required; and added system complexity for tracking access to and use of items will be necessary. All of these items entail added costs. Inclusion of machine-actionable metadata that capture restrictions on use could reduce cost.
Example decision points related to confidentiality:
- Will any of the data require special protections?
- Will any of the data have embargo periods or embargo-related limitations that may entail costs?
- Are there any audit requirements for who has accessed or downloaded the data?
If the data have been managed by a variety of entities (e.g., companies, laboratories, public repositories, or individual investigators and their staff), different custodians may have spent more or less time to locate and appropriately format data for forwarding to the resource, even given data-sharing requirements. Their data release processes might also be cumbersome for those wanting to use their data. In contrast, some data are maintained on behalf of patient collectives or disease organizations that actively promote and facilitate their use, possibly making such data easier and less expensive to use.
If a biomedical information resource contains proprietary information, then there may be requirements to track ownership of particular data sets and to ensure that data use conforms with any licensing conditions and to the preferences of the participants from which it came. Support for tracking and conforming use may have costs beyond those paid to license the data sets.
Hospitals or clinics can have specific release forms for allowing the transfer of patient data. Collecting patient data from a large number of such establishments might mean obtaining and executing a different release form for each patient—a time-consuming and labor-intensive process. If these release forms only consent to certain uses, then use must be audited and tracked, which may incur additional costs. Inclusion of machine-actionable metadata that capture any ownership characteristics could reduce cost.
Example decision points related to ownership:
- If data are contributed from multiple sources, will there be a need to process multiple kinds of release forms?
- Will all the data be released by the data resource under the same license, or will different permissions be assigned to different data sets?
- Will data submission agreements be necessary?
Similar to confidentiality, security for a biomedical information resource implies preventing unauthorized access, but it also implies protection against loss or corruption of its data, intentional or otherwise. Measures taken by a biomedical information resource likely include internal or external security audits, special operator and user training, active monitoring of the resource, applying security patches expediently, or using specially protected computing, storage, and networking platforms, all of which incur costs.
Security might also encompass offering services such as ensuring a resource complies with Health Insurance Portability and Accountability Act and Health Information Technology for Economic and Clinical Health Act requirements. Sensitive data produced by or under the auspices of federal agencies have distinct security requirements. For example, if a Federal Information Security Act (FISMA)-certified environment is necessary to comply with the National Institute of Standards and Technology regulation associated with protecting controlled unclassified information in nonfederal systems and organizations (Ross et al., 2020), additional costs will be entailed. At a minimum there are costs to documenting FISMA compliance. Those costs increase if it is determined that a higher level of FISMA certification is required. If the data are kept in a cloud environment, certification associated with the Federal Risk and Authorization Management Plan (FedRAMP)9 may also be necessary, entailing additional costs.
Example decision points related to security:
- What measures need to be taken to ensure the integrity and availability of the data?
- Do these measures require using protected computing, storage, or networking platforms?
I. Maintenance and Operations
This section covers aspects of a biomedical information resource related to obligations for maintenance and operation of the resource.
I.1 Periodic Integrity Checking
As part of ongoing maintenance, operators of a biomedical information resource will need to assess the integrity of its hardware, software, and data. The frequency and detail of such assessments will affect operating costs. Processes put in place will flow from an understanding of error and failure rates and the tolerance for data corruption and loss.
Example decision points related to periodic integrity checking:
- What processes will be put in place for checking the integrity of the hardware, software, and data?
- How frequently will these checks be performed?
I.2 Data-Transfer Capacity
Insufficient data-transfer capacity of the facility that hosts a biomedical resource can place constraints on the operation of the resource. For example, limited connectivity can constrain the amount of data that can be downloaded from a resource, the ability to replicate the contents, or the ability to perform off-site backups.
Example decision point related to data-transfer capacity:
- Will the bandwidth available to the resource be sufficient for the data sizes and rates required?
9 The website for FedRAMP is https://www.fedramp.gov/, accessed April 6, 2020.
I.3 Risk Management
With any biomedical information resource, there is a risk of corruption or loss of content. Who assumes that risk (and hence must take steps to ameliorate it) will influence where certain costs fall. If a resource is a data-sharing portal but not the repository of record for the data it holds, then the risk may fall largely on the contributors, who will bear the cost of maintaining backup copies of their data elsewhere. If, on the other hand, the resource is the “official” repository for the data it holds, the operators of the resource will be responsible for risk-mitigation measures in line with the perceived value of the data and hence bear the concomitant costs. For sensitive information, there is also risk of leakage (i.e., unauthorized export of data to external recipients), either through unintentional or malicious action. Even if an information resource is not the repository of record for its data, it must bear the costs of mitigating this type of risk. In addition, a response plan might be necessary to address circumstances (e.g., unexpected loss of funding or dissolution of the organization hosting the resource) that force the early termination of the information resource.
Example decision points related to risk management:
- Will the repository be solely responsible for risk mitigation?
- Is a response plan for unexpected termination required?
I.4 System-Reporting Requirements
The overseers and operators of a biomedical information resource may require regular reports on the status of the system, in terms of both content (e.g., number of items, storage space used) and computer-resource usage (e.g., central processing unit hours, network usage). Setting up such reports will likely be a one-time cost, with perhaps a small amount of recurring labor cost if the reports have to be invoked manually.
Example decision point related to system-reporting requirements:
- What types of system reporting will the resource be required to do?
I.5 Billing and Collections
If the biomedical information resource charges for upload, access, and download of data, then there will need to be an operational function responsible for billing for and collection of those charges.
Example decision point related to billing and collections:
- Will there be charges for use of the resource?
J. Standards, Regulatory, and Governance Concerns
Standards for the interchange of biomedical data, for the description of various kinds of biomedical data objects, and for other data practices are important enablers for the research data ecosystem. This section considers community conventions, rules, policies, laws, and stakeholder concerns with which the operators of a biomedical information resource may have or want to comply.
J.1 Applicable Standards
A biomedical information resource might have to conform to one or more standards for the content and format of the data hosted. Some standards are created and maintained by formal national or international standards-development organizations; in other cases, they are developed and managed by ad hoc community mechanisms,
particularly when the scale of the community and the scope of the standard’s application are limited. Where well-established and domain-based standards exist, and especially where tools also exist that automate the use of those standards at the time the data are generated, conformance to them may significantly lower the cost of later data ingest, curation, dissemination, and preservation. If tools do not yet exist, software routines to parse and extract needed data from the proprietary structures will likely need to be developed, possibly at significant expense. Even if the data as a whole do not match a standard, particular fields might be standardized—say, taken from a controlled vocabulary or a reference list of codes—in a way that matches the assumptions of the resource, thus avoiding the overhead associated with converting those fields. Highly structured data may also be more or less difficult to process, depending on whether they conform to a widely used standard (e.g., FASTQ file format for genomic-sequence data [see, e.g., Cock et al., 2010] or HDF for array data10) or are in some system-specific format (e.g., for a particular manufacturer’s microscope or as used by a particular medical-records systems).
If no standards exist, then they may need to be developed to increase the quality of the data sets and the efficiency of data ingest and use. This process can be costly up front, as it involves bringing together groups of experts, often repeatedly over time, to achieve and document consensus (see Box 4.4). Formal national and international standards development is slow, expensive, and highly structured: there are complex bureaucratic and procedural issues and elaborate governance formalities. Typically, most of the work is volunteer labor by experts, perhaps facilitated by a paid editor. Funding of formal standards bodies is beyond the scope of this report, but it is worth noting that research grants have been used to good effect to accelerate creating new less-formal community standards.
A key aspect is whether the standard will evolve during the lifetime of the resource and whether the resource must conform to updated versions of the standard. If so, there will be associated costs for modifying the resource and possibly for restructuring or augmenting existing data holdings. A particularly challenging case is one in which a resource developed prior to the development of standards must be “retrofitted” to accommodate a standard that later emerges. The development of different national standards (e.g., the European Union’s General Data Protection Regulation)11 is another version of this challenge. Likewise, there could be standards-related costs associated with transforming data and metadata from one data state to another.
Example decision points related to applicable standards:
- How many different standards will the resource have to support?
- Do these standards exist?
- If not, is the resource expected to lead their development?
- What is the plan for accepting data while standards are in development?
- If so, are the standards mature (i.e., how much are they expected to evolve)?
- Are the data validators and converters available for the standards, or do they have to be developed?
- What is the plan for “retrofitting” data that have been uploaded without the standards in place?
- How frequently will the standards update?
- Do the standards require spatial transformations (e.g., will they need to be aligned to a common coordinate system)?
- How many file formats will be supported?
- Is there an open file format available?
J.2 Regulatory and Legislative Environment
A biomedical information resource may be bound by laws and government regulations, particularly if it maintains information on individuals. Those requirements may entail additional record keeping or notification of
10 See the HDF Group at https://portal.hdfgroup.org/display/HDF5/Introduction+to+HDF5, accessed on May 12, 2020.
11 Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), 2016 O.J. (L 119), 1-88.
those about whom information is maintained. The resource might be covered by an open-records act if it is maintained by a government agency. Obtaining compliance certification could involve costs associated with precisely documenting policies and procedures.
Example decision points related to the regulatory and legislative environment:
- What laws and regulations cover the data and operation of the resource?
- Is the resource covered by an open-records act?
A biomedical information resource may have a policy-setting body for itself or as part of a larger organization. Policies may be set either initially or on an ongoing basis. Having such a governing body incurs some personnel costs to engage with it and possibly for convening it. Following the guidance and directives of the governing body may entail changes or extensions to the resource, which will likely come with costs.
Example decision points related to governance:
- Does the resource need to maintain an external advisory board?
- Does the resource set policy for itself, or is it part of a larger organization?
J.4 External Consultation
The developers of a biomedical information resource may need or want to consult with external stakeholders—data contributors, potential users, and funding agencies—about the initial resource design and about later updates. Such consultation can extend the timeline for resource development and maintenance, which might increase some costs, such as labor.
Example decision points related to external consultation:
- Will external stakeholders be consulted for initial design?
- Will external stakeholders be consulted on an ongoing basis?
ATTACHING DOLLARS TO THE COST FORECAST
The committee applied its cost-forecasting framework to scenarios that exemplify numerous decisions about the treatment of data at different points in the data life cycle and in their various data states (see Box 2.1). The committee did not attempt to quantify the costs of data as an investigator or a data platform manager would do because there are too many variables related to the myriad of factors unique to the data, the institutional platform host, and the data contributors and user communities to be able to come up with meaningful numbers. However, now that the investigator or data platform manager has considered where the data are coming from, and how they will be used, it will be necessary to begin to quantify the costs.
Forecasting for a State 1 (Primary Research) Resource
A State 1 (i.e., the primary research environment) researcher creating a cost forecast will necessarily focus on costs that must be budgeted. Additional costs may be reported in tabulations for the public record (e.g., the “cost” of a discovery), sunk costs (e.g., equipment from the researcher’s prior research project), and costs borne by the host institution that are not reimbursed in paid overhead rates (e.g., that underwrite subsidies for IT services). The researcher will use the prices paid for services in the cost forecast, whether those service costs are less than what is being paid (e.g., to support institutional “taxes”), or if they actually cost more (e.g., those subsidized by her institution).
The State 1 researcher may face many choices throughout the course of the research, with alternatives that imply different time profiles for costs. One course of action may entail low up-front investment but higher long-term operating costs, another the reverse. While the researcher may be bound by rules of the financing entity in selecting the preferred alternative, good decision making argues that the present discounted value of the alternatives be calculated, with the economically preferred alternative having the lowest present discounted value. However, if the funding entity pays expenses for only an arbitrary fixed number of years, it may be reluctant to pay the immediate cost of an up-front investment that cuts long-term operating costs, even if a present discounted value calculation would argue it should do so. This sort of shortsightedness is not confined to physical investments. It might involve a situation in which the individual researcher may not value what the larger community needs, such as the additional costs of preparing data to meet a long-term repository standard. In this case, both costs and benefits differ, requiring that both be weighed in making a decision. Although the funding entity may not sympathize with taking the long view, the researcher needs to understand what might be the better course of action in designing a long-run strategy.
The State 1 researcher can use the data set characteristics, activities, and cost drivers described in this chapter and in the template in Appendix E. Many of the activities and cost drivers in the template in Appendix E may not be directly applicable to a State 1 information resource, but the forecaster needs to remain aware of potential future cost drivers so that decisions might be made that could keep life-cycle costs low. In most circumstances, labor costs will be the largest single element of her cost forecast. The activity list for State 1 (Table 2.1) can serve as a guide to the data steps that need to be considered. In the best of circumstances, the data set characteristics
provide a way to estimate the amounts of each labor type that will be required, based on past experiences of the researcher or of those at her institution.
Since labor drives much of the cost, a rough first estimate might be informed by using data from any pilots of the proposed or current project and regressing the labor hours that were required for the development, population, and support of the pilot resource to the characteristics of the current or proposed resource. Labor costs will not necessarily increase linearly as the size and complexity of the resource increases given that some amount of efficiency of labor will be gained (i.e., the “learning curve”). If data from a pilot are not available, data from a similar research project might inform the estimate.
In many cases, that tabulated experience may not be available. The researcher could then resort to estimating the relative amount of each labor type, both within and potentially across activities (i.e., a set of ratios, based on experience and judgment). If one type of labor for one activity (perhaps based on a pilot) can be reasonably estimated, the ratios implied by such estimates will provide a way to forecast the quantity of the others. The institution may specify the rates that should be used for labor prices, but the tables in this report give her a backup resource. Since the State 1 researcher is typically looking at a relatively short time horizon, disruptors of the sort discussed in this report (see Chapter 7) are unlikely to play an important role, although to the extent they create trends that affect the near future she may want to modify the estimated prices involved to reflect that reality.
Forecasting for a State 2 (Active Repository) Resource
Cost forecasts for a State 2 (active) repository may have to address many of the same issues as discussed for a State 1 resource but with additional complexities. To the extent that the resource requires external funding, the first cost estimate will be one that focuses on the proposed budget. Sunk costs will be omitted from consideration (unless the funding entity allows for their cost recovery), and, as for the case of the State 1 researcher, costs subsumed in the overhead rate will be omitted. In preparing a proposed budget, the host institution will use the prices it faces. But for its strategic planning, especially for periods beyond that covered by any near-term financing, the State 2 repository host would be well advised to prepare an estimate that includes all resources necessary to sustain the repository over the long run, even if some of those resources are currently “free” or significantly subsidized. The State 2 resource host will be in the repository business for many years, and any subsidy structure from which it now benefits could change. It should understand not only the costs for which it must budget today, but also the total cost of its repository responsibilities should it eventually have to cover that entire forecast cost.
Since this forecast will likewise be forward looking, sunk costs would still be omitted, but refreshment or replacement of investments must be included. For its “all-resource” forecast, the State 2 resource host should only use prices in those cases where they actually reflect what is required to produce the necessary input. Otherwise, the actual resources consumed should be the basis for the cost forecast. In situations where the social cost differs from the market price (e.g., environmental effects of generating power), the State 2 resource host will at least want to understand the approximate magnitude of that difference, given the interest of stakeholder communities.
Considering the cost implications of alternative courses of action will be especially important for State 2 resource hosts. Again, calculating the present discounted values of various options and courses of action will give the State 2 resource host a method to weigh the costs of one course of action against another. The present discounted value calculation will be particularly helpful given that a State 2 resource host must necessarily look a long way into the future, providing a way meaningfully to sum up the long stream of operating costs that will be encountered, as well as required periodic reinvestments.
While labor costs may generally be the largest single budget item, other costs are likely to be more significant for the State 2 resource than for the State 1 resource. Physical facilities may be important, and licenses may constitute a significant element of expense. Software costs may be significant if proprietary software is used and licenses required, if existing software needs to be customized, or if entirely new code needs to be created. Purchased services are likely to be important, raising critical “build” versus “buy” choices for the State 2 resource host. An obvious such choice is the use of an in-house data environment versus that of a cloud provider (essentially, outsourcing). Note that cloud services may be provided by a commercial vendor, or through a research community cloud built on open-source software, such as that offered by the European Organization for Nuclear Research
(CERN).12 Service providers of any type offer no guarantee of price stability, making it particularly challenging to forecast costs, especially given the substantial expense entailed in transferring data from one provider to another.
One of the several variables in selecting a service provider is whether the degree of security chosen will prove adequate over the long run and what the cost of upgrading security might be. Using a cloud provider does not relieve the State 2 resource host of security responsibilities, although it does change and, perhaps, reduce them. The cloud provider may have advantages related to, for example, economies of scale and ability to attract top-tier experts, but it may also represent a more attractive target for attackers. The cost of security across solution needs to be compared.
The State 2 resource host will likely have more experience than the State 1 researcher on which to base its estimates of the amount of labor required for each State 2 activity (see Table 2.2), and it may also be able to solicit advice from sister institutions. It may have the ability to pilot many of the repository activities for which it will be responsible. But to the extent it cannot construct labor forecasts based on experience or pilots, it can fall back on the technique sketched above for the State 1 researcher. Unlike the State 1 researcher, however, the State 2 institution may be free to set wage rates—and because it will be operating the repository for many years, it will need to think about how real wage rates (i.e., adjusted for inflation) will change in the future. For professional activities, real wage rates have been increasing steadily for many years, and the State 2 institution will need to take into consideration that likely trajectory. (Taking into account fringe benefits, especially health care, real costs for other classes of labor have also been rising, although much more gently.) For those classes of labor it currently employs, the State 2 resource host could start with its current rates, modified to reflect its contemporary recruiting and retention experience, and use its own recent trends as the basis for at least the intermediate-term trajectory of what will be necessary to attract and keep the labor force it will need.
Changes in the labor market are only one of the several disruptors the State 2 institution must consider. The full range of the disruptors raised by this report could affect its cost forecast, creating a situation where complexity could overwhelm its understanding of what the long-term commitment it is undertaking might require. For that reason, it may be useful to first forecast based on the (unrealistic) assumption of no change, then discuss which disruptors might significantly affect the forecast, versus those whose effects might be more modest.
Forecasting for a State 3 (Long-Term Preservation) Resource
However daunting it might be to forecast costs for State 1 and 2 resources, it could be more difficult to forecast costs for a State 3 (long-term preservation) resource. The forecaster may be making decisions about the format in which the data should be preserved and the nature of access to be supported years in advance of the actual transfer of data to a State 3 environment. Community guidance and standards may be helpful in making such decisions, with due allowances for how such guidance, standards, and access might evolve. Above all, decisions should be documented since they constitute the assumptions on which the forecast rests. If they are clearly stated, it will be easier to adjust for changed circumstances as the actual transition to a State 3 environment becomes likely. Once again, the characteristics of the data sets will probably be important predictors of storage costs and IT services; these will likely dominate the State 3 forecast. Labor costs may not be especially important once the data set is formatted for long-term retention, and facilities costs may be negligible. It is probably wise to use estimates of underlying costs for storage and IT services rather than current prices—planning for a State 3 environment should assume that the State 3 resource managers will bear the actual costs (e.g., it will not enjoy subsidies). This approach also facilitates embedding appropriate trends in the forecast. Because the State 3 environment will extend over many years of costs, it is essential to calculate present discounted values when comparing alternative courses of action.
In a sense, the State 3 resource investment could be viewed as an option on the future availability of the data set. While there is no market for such options, that intellectual construct could help guide decisions regarding the State 3 resource. If preservation options make a data set more discoverable or more easily reconstructed and used,
12 The website for CERN is https://home.cern/, accessed March 27, 2020.
the potentially more valuable is that option. Conversely, decisions that make data harder to discover, reconstruct, and use, then the less valuable is that preservation option.
Reliability of cost forecasts is a critical issue, especially for State 3 environments with their high degrees of uncertainty. While distributions for cost parameters may not be available, the forecaster should nonetheless attempt to establish ranges for parameter values that capture central tendencies. These can be used to estimate how much “reserve” for various contingencies should be established or at least guide managers regarding the “what ifs” to which they should pay attention.
Data for Forecasting
As this discussion implies, the life-cycle forecasts of dollar values for each activity in the data states outlined in this report will depend on the specifics of the research project in the State 1 environment, the nature of the State 2 (active) repository, and the data preservation and access ambitions of the State 3 (long-term preservation) resource. Those differing dependencies make life-cycle forecast a unique undertaking. Is it possible to move beyond the qualitative observations on relative cost magnitudes of this report, perhaps based on a few top-level parametric forecasting equations, using just a handful of the key activities and data characteristics listed in Tables 2.1-2.3? In other cost-forecasting domains, the outcome has been the result of a multiyear sustained effort by dedicated professional staffs (e.g., for military weapon systems), capitalizing on detailed—and often proprietary—cost data. No such cadre now exists for the biomedical data challenge. Equally important, the committee could not discover any organized data-collection effort that such a cadre would need to create top-level forecasting tools. With the explosion of life science research and clinical data, and the hunger for good cost forecasts, establishing such a data-collection effort would be the first step to a better understanding of what will be needed, whether it is for the State 1 researcher, the State 2 active repository, or for State 3 long-term preservation.
INFRASTRUCTURAL ELEMENTS NOT CONSIDERED IN THE COST MODEL
There are many infrastructural or data environment systems, standards, services, and activities that are essential to data preservation and access broadly, and to biomedical data in particular, but where it does not make sense to try to allocate costs to specific sources or collections of data. Much of this is general infrastructure that supports many other activities of the university or other data platform host institution. Other costs are more specific to the work in the biomedical sciences and the communication of scholarship in those disciplines. Here, components that are particularly important to preservation and access for biomedical data are addressed.
The organizations, governance, standards, systems, and common knowledge structures are viewed as “community” problems rather than research; thus, funders do not want to support their solutions as part of an individual research project. It is worth, at the funding program manager level, considering investments in reflecting these standards and knowledge structures in common tools that can help the relevant research community. At the level of funding bodies and stewardship institutions, consideration needs to be given on how to support all parts of this infrastructure, particularly operations and maintenance.
Object Identifier Standards, Systems, and Governance
Identifiers are mechanisms for unambiguously referencing people, organizations, data objects, and things (e.g., genes, molecules, proteins, species). Typically, a sustainable organization needs to be emplaced to oversee and govern the assignment and use of identifiers, but funding is often not available for such oversight and governance. There will also be systems that look up information associated with an identifier. Perhaps the most critical identifier operationally is the DOI, most usually assigned through DataCite13 (see Box 4.5). It is important to recognize, however, that many large, important State 2 (active repository) data aggregations also assign identifiers outside
13 The website for DataCite is https://datacite.org/, accessed December 5, 2019.
of the DOI system; an example of this would be Genbank14 sequence identifiers, which are widely used in the scientific literature.
Personal Identifier Systems, Standards, and Governance
The cost of supporting the many identifiers is important for the production of good metadata and for accurate discovery by searching those metadata. For example, productive reuse of data, once identified, is often dependent on being able to identify a point of contact associated with those data if the data are not compliant with current standards, if the metadata are not complete, or if there is some other query about the data. Assigning identifiers to researchers and including that information in data sets becomes important. An example of an organization that governs and assigns such PIDs is the Open Researcher and Contributor Identifier (ORCID).15 Repositories will need to understand the costs of using PIDs to identify contributors and contact people, and researchers will need to be trained on proper maintenance of their PIDs so that they may continue to be tracked if, for example, they change institutions. Using unambiguous PIDs, rather than normalizing personal names and dealing with variant name forms, will more efficiently provide better results when describing and searching.
It is important to recognize that, just as the formulation of preserving and storing data sets and related metadata is often oversimplified, oversimplification pervades the discussion of data discovery. The problem of aggregating and searching metadata for data sets held in a collection of repositories (e.g., the National Science Foundation Data Observation Network for Earth project for ecological and environmental data)16 is complex. As the number of information resources multiplies, discovery systems will be needed (and will need to be supported) that allow people to find relevant resources. It is unclear who will support, build, and operate the key discovery systems. As indicated, DataCite operates a registry, but its searching capabilities are somewhat limited. Google has built Google Cloud Public Datasets17 and Amazon Web services has built the Registry of Open Data,18 although these are still best viewed as experimental. Many State 2 systems offer some kinds of searching over information that they host, but those capabilities will not extend to other repositories. The literature offers an important pathway to resource discovery, and the National Center for Biotechnology Information has invested heavily over the years in
14 The website for Genbank is https://www.ncbi.nlm.nih.gov/genbank/, accessed December 5, 2019.
15 The website for ORCID is https://orcid.org/, accessed December 5, 2019.
16 The website for the Data Observation Network for Earth project is https://www.dataone.org/, accessed December 5, 2019.
17 The website for Google Cloud Public Datasets is https://cloud.google.com/public-datasets/, accessed December 5, 2019.
18 The website for the Registry of Open Data is https://registry.opendata.aws/, accessed February 12, 2020.
interconnecting PubMed19 with some major State 2 platforms. As platforms for State 2 data aggregations multiply, there is also going to be a growing need for discovery tools, training, and outreach related to these resources.
Standards and best practices for description of biomedical data objects rely not only on the use of identifiers as previously discussed but also on tools such as managed vocabularies and ontologies (i.e., knowledge structures). Many of these are highly specific to particular biomedical applications, and they often need regular maintenance to reflect new scientific developments. For example, NLM designed and maintains the MeSH thesaurus,20 which serves to index the data in multiple NLM databases, including PubMed. NLM also provides a resource, through its Unified Medical Language System (UMLS),21 that interlinks more than 200 terminologies in the biomedical domain. Some example terminologies include MeSH for the literature, SNOMED International22 for clinical applications, and The Gene Ontology Resource23 for genetic data.
Cock, P.J.A., C.J. Fields, N. Goto, M.L. Heuer, and P.M. Rice. 2010. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 38(6):1767-1771.
DataCite. 2018. DataCite Annual Review. https://datacite.org/documents/datacite_annual_review_2018_final.pdf.
Lantz, M. 2018. Why the Future of Data Storage is (Still) Magnetic Tape. IEEE Spectrum. https://spectrum.ieee.org/computing/hardware/why-the-future-of-data-storage-is-still-magnetic-tape.
Ross, R., V. Pillitteri, K. Dempsey, M. Riddle, and G. Guissanie. 2020. Protecting Controlled Unclassified Information in Nonfederal Systems and Organizations. NIST Special Publication 800-171.
Rübel, O., A. Tritt, B. Dichter, T. Braun, N. Cain, N. Clack, T.J. Davidson, et al. 2019. NWB:N 2.0: An Accessible Data Standard for Neurophysiology. BioRxiv: The Preprint Server for Biology. Cold Spring Harbor Laboratory.
Teeters, J.L., K. Godfrey, R. Young, C. Dang, C. Friedsam, B. Wark, H. Asar, et al. 2015. Neurodata without borders: Creating a common data format for neurophysiology. Neuroview 88(4):629-634.
19 The website for PubMed is https://pubmed.ncbi.nlm.nih.gov/, accessed December 5, 2019.
20 The website for the MeSH thesaurus is https://www.nlm.nih.gov/mesh/meshhome.html, accessed December 5, 2019.
21 The website for the UMLS is https://www.nlm.nih.gov/research/umls/index.html, accessed December 5, 2019.
22 The website for SNOMED International is http://www.snomed.org/, accessed December 5, 2019.
23 The website for the Gene Ontology Resource is http://geneontology.org/, accessed December 5, 2019.