National Academies Press: OpenBook

Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs (2020)

Chapter: 5 Applying the Framework to a New State 2 Data Resource

« Previous: 4 The Cost-Forecasting Framework: Identifying Cost Drivers in the Biomedical Data Life Cycle
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

5

Applying the Framework to a New State 2 Data Resource

In its statement of task (see Box 1.1), the committee was asked to apply the cost-forecasting framework to two case studies relevant to the National Library of Medicine’s (NLM’s) data resources. The case studies presented are based on hypothetical examples provided by NLM to the committee (personal communication, E. Kittrie, January 4, 2019). This chapter presents the first use case which describes decisions by a policy maker, program officer, or research group to estimate costs for a new repository hosting a large amount of data. Although the scenarios presented in this chapter and Chapter 6 represent traditional research environments, the framework could be applied to different research scenarios.

These case studies are not quantitative cost analyses. A quantitative forecast of a real-life scenario would require greater resources and time than the committee had to accomplish the task, and values obtained for the hypothetical cases would be meaningless given the number of variables presented by the data, the institutions involved, the resources available, the requirements of the funding entity, and so on. Instead, the committee provides high-level examples of how an investigator, a data-resource developer, or a resource manager could use the framework to systematically identify all the cost components which they then can use to develop their own meaningful forecast. In a true quantitative forecast, many more details—as dictated by circumstance—would need to be considered.

Cost forecasters creating or managing a State 1 (primary research) or State 2 (active) platform will unlikely be able to quantify the costs of data beyond the performance of their respective research projects. However, as stated in Chapter 4, understanding the potential future value of data and how decisions in early data states are made may affect the effectiveness and efficiency of future data preservation, curation, and use of the data. Considering the life cycle of data beyond the current data state and the resources necessary to transition between states will be increasingly important as data sets become bigger and more complex. Future cost ramifications could inform near-term decisions.

The cost forecasts take advantage of the cost-driver template in Appendix E. This template, based on the cost drivers in each state outlined in Table 4.2, assists in developing a narrative regarding the data life cycle. The approach to forecasting costs for the use cases follows the basic steps described in Table 4.1.

Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

USE CASE 1: ESTIMATING COSTS ASSOCIATED WITH SETTING UP A NEW DATA REPOSITORY FOR THE U.S. BRAIN INITIATIVE

The cost-forecasting framework is applied to a new State 2 (active repository) platform. The study committee applied the framework as would a likely cost forecaster, in this case, a neuroscientist (see Box 5.1), and provides some information about existing platforms for context (see Box 5.2).

Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

Applying the Framework to Use Case 1

Using the forecasting steps provided in Table 4.1, the researcher begins to construct the cost forecast.

Step 1. Determine the type of data resource environment, its data state(s), and how data might transition between those states during the data life cycle.

The first two columns in Table 5.1 list the data archive requirements as specified in the RFA. After considering the activities associated with each of the data states described in Tables 2.1, 2.2, and 2.3, the researcher concludes

Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

TABLE 5.1 Specific Services Specified in the Request for Application Mapped to Data States, Activities, and Subactivities

Research Objective Number Archive Requirements as Specified in the BRAIN Initiative RFA-MH-17-255 States, Activities, and Subactivitiesa
1 The data archive is expected to use relevant standards that describe BRAIN Initiative experiments. Such standards may be developed under RFA-MH-17-256 or may already exist. II.A.1, II.B.1
2 A data archive will develop a data submission pipeline ensuring appropriate quality control standards for laboratories that are trying to upload data. For example, if an experimental standard defines an allowable range of values for a particular data element, the submission pipeline should make sure that uploaded data respect the current data standard. II.B.1, II.B.7, II.E.2
3 Ideally, the data archive will create both a submission pipeline and a related validation tool to allow researchers to check the quality of their data even if they are not trying to upload data. . . . Data submission pipelines that originate with the data-collection instrument in the depositor’s laboratory and require minimal manual intervention would be ideal but are not required. II.B.1, II.B.7, II.C
4 A data archive will work closely with BRAIN Initiative awardees and others to collect and archive relevant data sets. II.A.3, II.D.1, II.D.2
5 Each data archive should plan for a help desk to work with those who are trying to upload data. II.I.2
6 Each data archive must develop plans to make the data readily available to the broad research community and to citizen scientists, as appropriate. II.B, II.I
7 Depending on the type of data, data submission agreements and data access agreements may be necessary. II.D.2, II.D.3
8 In many cases, processed data may be as useful to the research community as the raw data produced in the laboratory. Each data archive should consider storing and curating the appropriate data (either raw or processed) and make them available to the community. II.E, II.H
9 A data archive may propose evaluating deposited data and scoring them to allow the research community to have some guidance about data quality. II.E.2, II.E.3
10 Each data archive should plan to assign persistent identifiers to deposited data and to processed data to allow the research community a very easy way to cite the data sets that are being used. II.E.5
11 A data archive should allow researchers to have a space where they can share data privately to facilitate collaboration prior to publication. Such private enclaves must last for only a defined period of time before that data set is shared with the rest of the research community. II.B.9
12 A data archive may help users deposit data into other sustainable databases, such as those supported by the National Center for Biotechnology Information, but this is not a requirement. II.I.2, II.L.3
13 There may be cases where data are stored in more than one data archive. In those cases, a data archive funded under this funding opportunity announcement will ensure that the user community can find all relevant data using appropriate linkages or database federation strategies no matter where the data are actually stored. II.F.2
14 Furthermore, each data archive will provide an interface that is accessible to anyone with a web browser. II.B.7, II.H.3, II.H.4
15 A data archive will make appropriate query tools and summary data easily available to allow the research community to check whether data of interest are held in the archive. II.B.3, II.B.4, II.B.8, II.B.10, II.H
16 The user interface should make the maximum amount of information available to the research community while considering user friendliness and ease of interpretation. II.B.7, II.H
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Research Objective Number Archive Requirements as Specified in the BRAIN Initiative RFA-MH-17-255 States, Activities, and Subactivitiesa
17 The website is expected to have a broad user base that will include both naïve users and experienced bioinformaticians, and should provide an interface that will accommodate both types of users. II.B.7, II.H
18 In many cases, users will want to analyze or use visualization tools to interact with the data without downloading any data. Those interactions should be anticipated by the data archive. II.B.3, II.B.5
19 Expensive computations could result from some analysis activities, and the data archive should explain plans to deal with such eventualities. II.B.3, II.B.4, II.B.5, II.I
20 A data archive may, but is not required to, use cloud storage and computing capabilities to enable the research community to analyze data without downloading them. A data archive should (but is not required to) allow users to bring their own analysis tools to the data. II.A.3, II.B.1, II.K.1
21 Each data archive will be expected to have staff who are knowledgeable about informatics and the experimental data being collected. The informaticists will be responsible for coordination with other relevant informatics efforts. II.B
22 In particular, a data archive will be expected to identify and federate the archive with other data repositories and knowledge bases, as appropriate. II.F.2, II.F.3
23 This data archive integration should create ways for users to query all relevant data repositories for relevant information. Funded data archives will be members of a larger BRAIN Initiative Data Network that will work across BRAIN Initiative activities to promote integration of a variety of data types. II.F.2, II.F.3
24 In addition, the data archive will interact, as appropriate, with informatics activities outside the BRAIN Initiative such as the NIH Big Data to Knowledge effort and the work of the International Neuroinformatics Coordinating Facility (INCF). II.A
25 When possible, a data archive is expected to use existing infrastructures and standards. These could include persistent identifiers such as Digital Object Identifiers (DOIs) or Resource Identifiers. II.B.10, II.E.4, II.E.5

a Activities are defined in Tables 2.1, 2.2, and 2.3 of this report. The Roman numeral refers to the data state, the capital letter refers to the major activity, and the Arabic numeral refers to the subactivity. The cost forecaster can use this information to consult Table 4.2 to identify likely cost drivers for each activity.

that the proposed data resource will be an active repository and platform and therefore a State 2 resource. The researcher begins to match activities associated with a State 2 (Table 2.2) resource with the specific research objectives described in the RFA and lists which State 2 activities found in Table 2.2 would be necessary to accomplish each of the research objectives (the third column in Table 5.1). The costs and cost drivers associated with the activities will be revisited later in the cost forecast by consulting Table 4.2.

Because the researcher is interested in preserving the long-term value of the data and increasing the efficiency and effectiveness of their long-term curation and use, the researcher also considers activities related to eventual transfer of data to another State 2 resource or long-term State 3 archive. These latter considerations were not activities specified in the RFA.

Step 2. Identify the characteristics of the data (Chapter 4), data contributors, and users.

The next sections summarize a high-level consideration of this step, although, in reality, this step would be revisited several times as resources are characterized; choices about the repository are refined; and characteristics of the data, data platform, and contributors and users are better defined through use of the template in Appendix E.

Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

Data Characteristics (Sections A and E of the Cost-Driver Template in Appendix E)

  • The files are large: TB per individual data set.
  • There are many files and large size of individual files. Raw data may be contained in thousands of individual files. For example, a single serial section electron microscopy data set covering less than 0.5 mm3 of cortex by Bock et al. (2011) comprised 36 TB of raw data and 10 TB after processing to stitch the individual tiles together and reconstruct the volume.
  • Sizes are likely to increase over the life span of the resource.
  • There are multiple modalities.
  • The data are complex: two-, three-, and four-dimensional images.
  • There are significant metadata requirements.

Because of the rapid development in algorithms for processing and reconstructing the data, both raw and processed data will likely need to be stored, and compression algorithms for high-resolution scientific imaging data are likely to interfere with the reuse of the data for many applications. Imaging will likely be from animal subjects, minimizing costs associated with security and confidentiality (Section H in Table 2.2). The repository has decided that all data will be offered under the same license, minimizing any costs associated with enforcing multiple permissions.

Contributors/User Community (Section F of the Cost-Driver Template in Appendix E)

Assuming 200 BRAIN-funded users submit data twice a year, 400 independent submissions per year could be expected. Contributor support needs will likely be high, given data complexities and size, particularly in the early years when data validation and upload pipelines may not be fully mature. Data contributors will likely have a sense of urgency to upload backlogs of data before their grant funding runs out. If standards and best practices are not fully in place when the resource begins to acquire data, then a backlog of data will need curation. The resource has to decide whether to devote extra staff and funding to re-curating those data when standards and tools are in place.

The user community is also expected to be diverse, with scientists working in different environments. The RFA requires the resource to work closely with contributors (Table 5.1, research objective 4), maintain a help desk (research objective 5), and make data available to the broad research community, including citizen scientists as appropriate (research objective 6). Given the range of user skills to be accommodated—and the cost to develop intuitive user interfaces to do so—general help, training, and outreach materials are likely to be increased. Frequent updating of help materials may be necessary during early phases, when the technology is changing on a regular basis.

Step 3. Identify the current and potential value of the data and how the data value might be maintained or increased with time.

The perceived and long-term value of data can be informed by answers to Sections A, D, and E of the cost-driver template in Appendix E and through consultation with experts and colleagues. The perceived long-term value of the data in the proposed resource will depend on hard-to-estimate factors. Some data will derive from new and rapidly developing techniques. Colleagues and experts think that some data may be superseded as technologies improve. On the other hand, data that are the result of a complex experimental paradigm—for example, the carefully correlated light and electron microscopy work of Bock and others (2011)—may be quite valuable even if of lower quality. Well-annotated imaging data tend to be interpretable and usable for a long time in different contexts. The long-term value cannot be estimated at the time of proposal preparation, making decisions about the level of replication and access (e.g., transfer to a less expensive form of storage) difficult in the early stages.

Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

Step 4. Identify the personnel and infrastructure likely necessary in the short and long terms.

The forecaster considers which of the activities described in Table 2.2 are relevant to the RFA requirements (Table 5.1) and the proposed database more deeply. This resource will not handle sensitive information; thus, some activities will not be necessary. The forecaster next considers how these activities might be accomplished (and by whom), again referring to the list of expertise included for each activity in Table 2.2. To estimate personnel costs the forecaster needs to consider how long each task will take, the skill levels necessary, the availability of people with the skills, and the tool support or training necessary for the people to perform the tasks. In reality, many of the positions listed in Table 2.2 are likely not to be included or consulted when setting up a typical researcher-led scientific infrastructure, but it is worth considering the value of including them and how their involvement influences overall cost. For example, many new resources struggle with metadata. Consulting a data librarian or records specialist early may help to reduce cost, improve quality, and increase FAIRness by providing advice on community standards for high-level metadata and specialized metadata schemas.

Because the repository infrastructure already exists in this fictional use case, many setup costs might be reduced, but significant customization will likely be necessary. If the infrastructure had to be developed from scratch, the forecasters might consider whether instances of existing infrastructure could be set up or whether they could partner with an existing repository to provide the back-end infrastructure.

Step 5. Identify the major cost drivers associated with each activity based on the steps above, including how decisions might affect future data use and its cost.

Table 4.2 is consulted to identify the cost drivers often associated with a State 2 resource, and the cost-driver template found in Appendix E is completed (the template is based on the cost-driver questions found in Chapter 4). The completed template is presented as Table 5.2, following the discussion of Use Case 1, below. The completed template will help the forecaster determine which decision points will likely control costs now and in the future, and it will help the forecaster understand when specific costs will be borne and by whom.

The responses to the cost driver questions shown in Table 5.2 allowed the forecaster to create a narrative to help him identify exactly what will be involved in establishing the State 2 (active repository) resource. From that narrative, the forecaster could determine how influential each of the respective costs is likely to be in the overall costs (listed below). In a quantitative cost forecast, the costs for the activities could then be quantified, and each of the major cost components (e.g., Box 3.2) worked out.

  • A: Content → Likely high
  • B: Capabilities → Likely medium-high
  • C: Control → Likely medium
  • D: External Context → Likely low
  • E: Data Life Cycle → Likely high
  • F: Contributors and Users → Likely high
  • G: Availability → Likely medium-high
  • H: Confidentiality, etc. → Likely low
  • I: Maintenance and Operations → Likely low
  • J: Standards, etc. → Likely medium

Step 6. Estimate the costs for relevant cost components based on the characteristics of the data and information resource.

As noted elsewhere in the report, the ability to estimate actual costs is dependent on so many factors that the committee elected not to attempt this exercise. How data size can influence costs can be exemplified by using cost

Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

estimators provided by commercial cloud services (in this case, the Amazon Simple Storage Service1 cost tools). The absolute size beyond a certain threshold may not impose many additional costs for cloud storage. For example, as of this writing, the cost to store up to 50 TB is $0.023 per gigabyte (GB)/month, whereas over 500 TB for a month brings the cost down to $0.021 per GB, according to that cost estimator. However, given the anticipated growth of the data, the storage cost is not insignificant in absolute terms. For one PB of data, the cost, absent any institutional discounts, would be $21,000 per month depending on level of access. Cost over time will need to be considered. Cloud service prices may change, or circumstances may warrant a change to a different provider with different cost structures, services, and data formatting requirements. The size of the data may also impose costs for functions such as external backup, replication, and data transfers (G.4), depending on what infrastructure is available to the resource. The forecaster will want to compare full costs of storage from multiple service providers, including the fully loaded costs of local computer resources.

The complexity of the data can also impose significant costs. The capabilities related to functional specification and implementation (Activity II.B) will need to be developed or modified and maintained and the standards for multiple data types and paradigms developed or implemented. These functions may need to be multiplied by the number of data types and modalities to be supported, depending on how well the tool set generalizes.

Once the resource is mature and data access and use patterns emerge, some significant cost savings may be realized by moving unused or obsolete data to cold storage. Again, using commercial cloud provider cost tools illustrates how storage costs are affected by access and responsiveness requirements. For the Amazon Web Services S3 Intelligent Tier pricing model, designed for data where access is infrequent or unknown, the cost for storage that is accessed at high frequency is $0.021-$0.023 per 50-500 TB but only $0.0125 if infrequently accessed. If it is known that the data are infrequently accessed and users can tolerate slow retrieval times (minutes to hours), then the cost of access will drop to $0.004 per GB. For a 500-TB data set, the cost of storage would drop from $10,500 per month to $2,000. Cold-storage options are best considered during the first funding period and in consultation with the community served so that expectations are clear.

The data and the user community characteristics will also be a major determinant of decisions about infrastructure for hosting and accessing the data, as well as the necessary user support levels. The RFA does not require that the resource utilize the cloud; the large size of the data and the unknown growth characteristics of both the data and the user community make the cloud attractive, as it can scale with increasing demand. Costs associated with data transfers, search, computational services, and downloads will need to be carefully monitored. Costs might be driven by unexpected demand surges (e.g., a data set is posted on social media and is heavily accessed). This fictional use case will protect itself from unexpected and uncontrollable charges by passing the cost for download to the end user. Many cloud providers now provide tools and safeguards for monitoring and limiting costs. Taking advantage of local or government programs (e.g., Cloudbank and the Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability) to make informed decisions and gain access to expertise about building platforms in the cloud could also lower costs.

Last, although the RFA does not specify that the resource should develop an exit strategy, thinking about the long-term data and resource viability is good stewardship. Bilder and others (2015), in their principles of open scholarly infrastructures, recommend that every resource have a “living will” that describes how a resource would wind down. In the proposed large (multiple PB) hypothetical BRAIN repository, the costs of transferring data to another State 2 active repository or to a long-term archive could be significant.

REFERENCES

Bilder, G., J. Lin, and C. Neylon. 2015. “Principles for Open Scholarly Infrastructure-v1.” https://doi.org/10.6084/m9.figshare.1314859.

Bock, D.D., W.-C. Allen Lee, A.M. Kerlin, M.L. Andermann, G. Hood, A.W. Wetzel, S. Yurgenson, et al. 2011. Network anatomy and in vivo physiology of visual cortical neurons. Nature 471(7337):177-182.

___________________

1 See Amazon Web Services, “Amazon S3 Pricing,” https://aws.amazon.com/s3/pricing, accessed December 16, 2019.

Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

TABLE 5.2 Completed Cost-Driver Template for Use Case 1: Setting up a BRAIN Archive

Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
A. Content
A.1 Size (volume and number of items)

> size = higher costs
  1. How many files will be in a single data submission?
    Varies, likely from 10 to 10,000.
  2. How large is an average data submission in total?
    Multiple TB.
  3. Are the data sizes likely to stay stable over the life of the resource?
    No, file sizes will likely increase as technologies are developed.
  4. What is the total amount of data expected?
    PBs.
  5. In what kind of medium will data be captured in the short and long terms?
    Data upload into the cloud for short and long term will be captured.
H
A.2 Complexity and Diversity of Data Types

>
complexity + diversity = higher cost
  1. How complex is the underlying structure of the data?
    Complex-image data.
  2. How are the included data to be organized?
    To be determined after interviewing funded investigators. Likely individual data sets that include the raw and processed data, but need to determine whether the data should be organized according to studies or projects.
  3. How complex is the experimental paradigm that produced the data?
    Varies—some simple acquisitions; some associated with complex behavioral paradigms.
  4. What sort of additional files might be necessary to upload with the data to properly understand them?
    Experimental protocols, fiducial maps.
  5. How many different data types are being produced?
    Multiple types of imaging data (multimodal data—light and electron microscopy, multiple microscopy types within each; correlated physiology and genomics.
  6. What are the relationships among these data types (e.g., are the data correlated)?
    Some correlated data sets; related data sets will be deposited in the appropriate repository.
H
A.3 Metadata Requirements

> metadata amounts + type = higher cost
  1. How much metadata must be stored with each data object to make them FAIR?
    Basic descriptive metadata, imaging parameters, experimental metadata, processing metadata, anatomical metadata.
  2. Will metadata be entered manually by the submitter/curator?
    Yes, by both submitters and curators.
  3. Will the data to be deposited include a data schema, or will one be generated?
    Eventually, a common schema will be created for all data based on standards such as Open Microscopy Environment and those deriving from BRAIN.
  4. Is the provenance of a data set sufficiently described, or will it need to be?
    It will need to be described, as the processing pipelines are not standardized.
  5. How much metadata can be extracted computationally?
    Imaging parameters may be able to be extracted from image headers.
M
A.4 Depth Versus Breadth

> breadth = higher cost
Will the repository be restricted to certain data classes or types that the repository must support?
It will primarily focus on imaging data, but there will be multiple types of imaging data from multiple domains.
M
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
A.5 Processing Level and Fidelity

> compression = lower cost
  1. Do the raw data need to be stored?
    Yes, as the reconstruction algorithms are rapidly developing.
  2. Do processed data need to be stored?
    Yes, it would be computationally intractable to reconstruct two-, three-, and four-dimensional data sets each time a user accesses them. Some data may be mapped to a common coordinate framework, and both the raw and aligned data will likely be stored.
  3. Are there compression algorithms that can reduce the file size without compromising fidelity?
    Some, but generally they result in signal loss, so it is not advised.
  4. What kind of data structure requirements will the resource have?
    The goal is a common data structure to organize the large numbers of raw files and any derived data, e.g., reconstructions.
  5. Is the data contributor or the repository responsible for any restructuring necessary?
    Data contributor.
  6. How is the data structure verified?
    A validator will be developed as per the RFA. It may not be ready in year 1 because of testing of data set structures based on data likely to be received.
H
A.6 Replaceability of Data

> replaceability = lower cost
  1. Is the archive the primary steward of the data, or do copies exist elsewhere?
    The resource is expected to assume stewardship of the data.
  2. Can the data be easily recreated?
    Not currently, but possibly in the future.
H
B. Capabilities
B.1 User Annotation

> user annotation functions = higher cost
  1. Will the repository have to provide user annotation capabilities?
    Given the size of the data, it would be ideal to have annotation capability available to the user base.
  2. What is the nature of these annotations?
    Anatomical delineations, molecular distributions.
  3. Are they provided by humans or machines, and how will they be authenticated?
    Mainly through machine-based segmentation. Users will have an account to annotate; annotations will be tied to their Open Researcher and Contributor Identifiers.
  4. Are permissions required to annotate the data?
    No, the data will be in the public domain and they are free to annotate, although stored annotations will have to be attributed to the individual researcher.
H
B.2 Persistent Identifiers

type of identifier = potential costs
  1. What persistent identifier (PID) scheme will be used by the archive?
    DOIs for data sets and for reconstructed images/volumes.
  2. Is there a cost associated with issuing the PID?
    Yes, but covered by institutional membership to DataCite.
  3. How many objects need to be identified?
    DOIs to data sets and reconstructions will be issued, but not individual files; therefore, two to five identifiers per data set.
  4. Who will be responsible for keeping the PIDs resolvable?
    The database administrator will be responsible for notifying DataCite of any changes in data object location.
L
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
B.3 Citation

> citation functions = increased cost
  1. Will users be able to create arbitrary subsets of data files and mint a PID for citation?
    Yes.
  2. Will the repository provide machine-readable metadata for supporting data citation?
    No.
  3. Will the repository provide export of data citations for use in reference managers?
    No.
L
B.4 Search Capabilities

> search capabilities = increased cost
  1. Will the repository provide a search capability for data sets?
    Yes. The repository is also required to provide means to search other BRAIN repositories through the BRAIN Initiative Data Network.
  2. How much of the metadata will be included in search?
    Initially, the repository will support search via database-level metadata. More detailed fields may be added in response to user requirements.
  3. How complex are the queries that will be supported?
    Keyword search, structured search on basic metadata fields.
  4. What type of features for search will be provided?
    Synonym expansion using the Neuroscience Information Framework’s vocabulary services.
  5. Will the repository deploy services to search the data directly?
    Data-feature search capability is planned.
H
B.5 Data Linking and Merging

> linking and merging = increased cost
  1. Will the data require/benefit from linkages to other related items?
    Required to link to data in other BRAIN repositories.
  2. Will the resource provide the ability to combine data across records based on common entities/standards?
    Will build a knowledge graph on top of our data records so that we can combine across data records.
H
B.6 Use Tracking

> tracking = increased cost
  1. Will the resource provide the ability to track uploads, views, and downloads?
    Yes.
  2. If so, and if made available to users, how will this information be made available?
    Tracked and displayed per data set.
  3. Will the resource track data citations to its data?
    No.
L
B.7 Data Analysis and Visualization

> services = higher cost
  1. What type of data visualization will the repository support?
    Interactive viewing of images and three-dimensional volumes using image services.
  2. What types of other data operations will the repository support (e.g., file conversions, sequence comparison)?
    Will develop an image-feature search capability.
  3. Do these services require significant computational resources?
    Yes.
  4. Who will pay for these computations?
    We will assume the costs of the search algorithms.
H
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
C. Control
C.1 Content Control

> review processes = increased cost
  1. Will all appropriate data be accepted, or will there be a review process?
    All relevant data from BRAIN investigators will be accepted; Data from outside BRAIN will be accepted after the infrastructure is built.
  2. Will the review process be automated, or will it require human oversight?
    Data must pass our automated validation checks, but human curators will oversee the project and provide additional curation of metadata.
H
C.2 Quality Control

> quality control = increased cost
  1. What quality control processes will the repository support?
    Format and metadata review quality control on quality of data and reconstructions will be up to the submitter.
  2. Will these be automated or require human oversight?
    See C.1.
  3. What level of data correctness will be required, and how will it be validated?
    Data are expected to pass our validation checks with no errors.
  4. What gaps in the data at the record or field level will be tolerable?
    Difficult to estimate at this time. Most likely applicable at raw-data level—missing or corrupted files may impact the quality of the reconstruction.
  5. Will any of the data be time sensitive, and how will data currency be ensured?
    Not beyond ensuring that data referred to in publications are released after the agreed-upon embargo period.
  6. How will duplication within or between data sets be addressed?
    Given the size of the data, if any data are cross-referenced across data sets or resources, it will be in the form of a link and not a duplication of the data.
  7. Will prevalidation guidelines or routines be distributed by the resource to the data contributors?
    Yes, as per the RFA. Researchers should be able to validate their data as they are acquired.
  8. Will human curation be necessary?
    Data submissions will be monitored in the early phase to determine whether human curators will be necessary to improve the quality of the data. While the hope is that automated tools may be sufficient, some human curation likely will be necessary.
L
C.3 Access Control

> controls = increased cost
  1. What types of access control are required for the repository (e.g., will there be an embargo period)?
    Data are public; embargo period will be provided.
  2. At what level are they instituted (e.g., individual users, individual data sets)?
    Embargo periods will be instituted for individual data sets where only specific users, including reviewers if required, will have access to them. After the embargo period, all data are public.
  3. Does use of the data require approval by a data access committee?
    No.
M
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
C.4 Platform Control

> platform restrictions = increased cost
Are there restrictions on the type of platform that may or must be used?
No, free to use the cloud if desired, and there are no requirements to use a specific cloud provider. Data will not be mirrored overseas.
L
D. External Context
D.1 Resource Replication

> replication = increased cost
Is there a requirement to replicate the information resource at multiple sites (i.e., mirroring)?
No.
L
D.2 External Information Dependencies

> external dependencies may or may not = increased cost
Will the resource be dependent on information maintained by an outside source?
Will use community ontologies for certain metadata.
L
D.3 Distinctiveness

> distinctiveness = increased cost
Are there existing resources available that provide similar types of data and services?
Yes, the Cell Image Library. The EU also has a Bioimaging Database.a
L
E. Data Life Cycle
E.1 Anticipated Growth

> growth = increased costs
  1. Is the repository expected to continuously grow over its lifetime?
    Yes.
  2. Is the likely rate of growth in data and services known?
    Not entirely.
  3. Is the use of the repository likely to grow over time?
    Yes.
  4. Is the likely growth of the user base known?
    No.
H
E.2 Update and Versions

> updates + multiple versions = increased cost
  1. Will the deposited data require updates (e.g., in response to new data or error corrctions)?
    Yes, some data will be submitted in batch mode, as the data need to be deposited at regular intervals. A policy on error correction will be developed (i.e., related to when corrections trigger a new DOI).
  2. Will prior versions of the data need to be retained and be made available locally or in a different resource?
    Yes, if the data are in the public domain; no, if they are in the embargo phase.
  3. How frequently will individual data sets be updated?
    Unknown.
H
E.3 Useful Lifetime

limited lifetime = decreased cost
  1. Are the data to be housed likely to have a limited period of usefulness?
    Hard to predict; later acquisitions likely to have longer periods of usefulness.
  2. Does the resource have a defined period of time for which it will operate?
    No.
  3. Does the resource have to provide a guarantee that the data will be available for a finite period of time (e.g., 10 years)?
    No, there is no set period specified.
L
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
E.4 Offline and Deep Storage

> offline/deep storage = decreased costs

> transfers = increased cost
  1. Can the resource take advantage of offline storage for data that are not heavily used?
    As the resource grows, access to data sets will be monitored. Those not accessed heavily will be moved to less expensive storage.
  2. Does the resource have a plan for moving unused data to deep storage (i.e., State 3)?
    At the end of the life span of the project, data will be moved to a suitable State 3 archive; however, the specific archive has not yet been identified.
H
F. Contributors and Users
F.1 Contributor Base

> number and diversity of contributors = increased cost
  1. Is the number of contributors known? If not, can it be estimated?
    The precise number is unknown, but it is assumed that one-third to one-half of the 700 BRAIN grant awardees generate imaging data that would be appropriate for this resource. Assuming ~200-250 contributors.
  2. Are all data originating from the same source (e.g., a single instrument or a single organization)?
    No, the data will be coming from all different laboratories and therefore different environments and instruments.
  3. How will data be transferred into the data resource (e.g., periodic large batches, more frequent smaller data sets, constantly streamed, by physical transfer)?
    Periodic large batches as specified by the data-sharing policy.
  4. Will the data be pushed by the contributor or pulled by the resource?
    Pushed by the contributor.
  5. Are there direct or indirect fees associated with acquiring the data from a source?
    No.
  6. Will a data steward be available from among the contributors to assist with any data integration into the data resource?
    Unknown, but unlikely.
H
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
F.2 User Base and Usage Scenarios

> access and diversity of users = increased cost
  1. How many users will likely access the data?
    Unknown, but assuming that it will be around 5,000 to 10,000 per month based on analytic data from similar resources.
  2. What will be the frequency of access?
    Difficult to estimate and likely depends on whether we get the image-analysis services running.
  3. How will users access the data?
    As these are large image data sets, most researchers will likely interact with the data using our image services and computational platform rather than downloading it. For some operations and tool sets, e.g., manual segmentation, it may be necessary to download the data or transfer them to another cloud.
  4. Will the resource be building analysis tools?
    No, beyond basic functions, the RFA states that the resource does not have to build analysis pipelines or tools.
  5. Will the resource support individual file download or bulk download?
    Download will be at the level of data sets (all relevant files) and individual files. No bulk download will be provided. However, it is anticipated that researchers will not download data but bring their compute needs to the data.
  6. Will there be any fees for downloading/accessing the data?
    If cloud provider used, users will pay for downloads and for the deployment of their algorithms.
  7. How many different types of users must be supported?
    Three different types of users are anticipated: (1) neuroscience researchers with domain expertise, (2) computational researchers with little domain expertise, and (3) citizen scientists, as per the RFA.
H
F.3 Training and Support Requirements

> training + services = increased cost
  1. Will training for resource use be offered?
    Yes.
  2. What form will the training take?
    Online tutorials, hackathons, webinars, live demos at conferences.
  3. Will a “help desk” be provided?
    Yes, as per the RFA.
  4. When does live help need to be available?
    During normal business hours.
  5. What is the expected skill level of the user base?
    As indicated in F.2., the resource will need to support a broad user base with a range of skills.
H
F.4 Outreach

> outreach = increased costs
  1. Does the existence of the repository need to be advertised?
    Yes, to attract outside users. It is assumed that BRAIN awardees will know of our existence.
  2. How many conferences per year should resource representatives attend?
    At least two.
  3. Will the resource have a booth at the conference for live demos or to conduct hands-on tutorials?
    Booths at at least one conference per year for live demos.
  4. Are users required by funders or journals to deposit data in the repository?
    Yes.
M
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
G. Availability
G.1 Tolerance for Outages

< tolerance for outages = increased costs
  1. What is the tolerance for outages for the resource?
    As users from around the world are expected, the resource will be up 24/7, except for scheduled maintenance. However, it is difficult to predict the amount of usage for the resource at this time. If the resource is heavily used, the tolerance for outages will be less and we will aim for > 99 percent availability.
  2. What measures will be taken to avoid and mitigate outages?
    A fully redundant system with high fault tolerance will be implemented as the resource scales up. Such redundancy would roughly double the cost of maintaining the system. However, as this level of tolerance will not be needed in the first few years, databases and system architecture will be designed such that it will be easy to fully replicate the system in the future, if level of usage demands it.
  3. How quickly and completely does the resource need to recover from an outage?
    As researchers are required to submit their data to the archive to meet their requirements, the data will likely be used by third parties. Outages not due to regular maintenance will be kept to under 30 minutes. However, even the major commercial cloud providers can experience outages that last for several hours. Given the size of the data, replicating the full resource at multiple sites, including our local site, would be cost prohibitive. So although rare, outages that last longer may occur.
M
G.2 Currency

> currency = increased cost
  1. How often will the data be released?
    Data sets will be released to embargo as soon as they are uploaded; when the embargo period passes, they will be automatically released to the public with a DOI and the appropriate license.
  2. How soon do data need to be made available after they are received?
    Will have to be negotiated with individual users, but we expect within 1 week on average.
M
G.3 Response Time

> responsiveness = increased cost
  1. Are there requirements for response time for service?
    For interactive browsing of the two- and three-dimensional image data, very responsive image services needed.
  2. Are there requirements for responses from humans?
    Users should receive an automated response for any help request immediately and a human follow-up within 1 business day.
M
G.4 Local Versus Remote Access

> cloud could lead to increased costs
  1. Does the resource require that any data be shipped via physical media?
    Yes. Depending on the size of the data set and the bandwidth available, data may need to be shipped back and forth via physical media.
  2. Will the resource be built using commercial clouds?
    Yes, commercial clouds for storage and computation will be used.
  3. Do users have to travel to the resource to use the data?
    No. All access is through the web.
H
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
H. Confidentiality, Ownership, and Security
H.1 Confidentiality

> confidentiality = increased cost
  1. Will any of the data require special protections?
    No, identified human data will not be hosted.
  2. Will any of the data have embargo periods or embargo-related limitations that may entail costs?
    There are initial costs for implementing embargo features; ongoing costs will be minor.
  3. Are there any audit requirements for who has accessed or downloaded the data?
    No.
L
H.2 Ownership

> ownership = increased costs
  1. If data are contributed from multiple sources, will there be a need to process multiple kinds of release forms?
    No, all data will be released under the same license and it will be an open license per the requirements of the BRAIN Initiative for public data.
  2. Will all the data be released under the same license, or will different permissions be assigned to different data sets?
    All data will be released under the same license, CC-4.0-BY, as per requirements by the funder.
  3. Will data submission agreements be necessary?
    Not anticipated. Data acquired under BRAIN are required to be submitted to an archive and made available.
L
H.3 Security

> security = increased cost
  1. What measures need to be taken to ensure the integrity and availability of the data?
    Standard practices will be used.
  2. Do these measures require using protected computing, storage, or networking platforms?
    No.
L
I. Maintenance and Operations
I.1 Periodic Integrity Checking

> integrity checking = increased cost
  1. What processes will be put in place for checking the integrity of the hardware, software, and data?
    A hashing function will be implemented to ensure data integrity. Checksums will be used for each data upload and download.
  2. How frequently will these checks be performed?
    Every 3 to 6 months for system checks. At every upload and download for data use.
M
I.2 Data-Transfer Capacity

> data-transfer upgrades = increased cost
Will the bandwidth available to the resource be sufficient for the data sizes and rates required?
Campus connectivity was recently upgraded, so no internal problems anticipated, but there is no control over our submitters and users. See G.4.
L
I.3 Risk Management

> risk mitigation = increased cost
  1. Will the repository be solely responsible for risk mitigation?
    As the repository of record, responsibility for the data assumed and therefore appropriate backup strategies will be implemented.
  2. Is a response plan for unexpected termination required?
    No requirements were given to have an exit plan and we would assume that funding would be given by NIH to terminate our resource and transfer the data to an archive of their choosing.
H
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
I.4 System-Reporting Requirements

> system-reporting requirements = increased costs
What types of system reporting will the resource be required to do?
No specific information has been requested in the RFA; monthly reports on acquisitions, total size, and amount of use will be generated for internal purposes.
L
I.5 Billing and Collections Will there be charges for use of the resource?
There will not be a charge for accessing data within our resource, nor for invoking services we provide. However, users will be required to bear costs associated with download and any custom computations they want to perform.
J. Standards, Regulatory, and Governance Concerns
J.1 Applicable Standards

> mature standards = decreased costs
  1. How many different standards will the resource have to support?
    Descriptive metadata standards, data standards for different types of light microscopy and electron microscopy data, ontologies or controlled vocabularies for anatomy, imaging, cellular components, gene/protein names.
  2. Do these standards exist?
    Some do.
    1. If not, is the resource expected to lead their development?
      The RFA specifies that the resource is to use relevant standards, but that it is not responsible for the creation of the standards.
    2. What is the plan for accepting data while standards are in development?
      Data will be accepted as soon as the infrastructure is ready, regardless of the state of the standards. Human curators will review all metadata to avoid common problems like cryptic abbreviations and nonstandard usage of terms.
    3. If so, are the standards mature?
      With the exception of descriptive metadata.
  3. Are the data validators and converters available for the standards, or do they have to be developed?
    No, they have to be developed as per the RFA.
  4. What is the plan for “retrofitting” data that have been uploaded without the standards in place?
    Data sets will be tagged accordingly, but unless specifically requested to do so, data will not be re-curated absent automated tools to do so.
  5. How frequently will the standards update?
    The first release of a standard will be subject to extensive revision and so the standards will not be implemented until vetted by the community. The INCF standards review and endorsement process will be helpful here.
  6. Do the standards require spatial transformations?
    Some data may be aligned to a common coordinate system to spatially align it with other data. Transformation coordinates and perhaps aligned files (depending on the volume of this type of data) will be stored.
  7. How many file formats will be supported?
    Resource will be built around open file formats for large images using the Bioformats recommendation. The user will ensure that their data are in the required format.
  8. Is there an open file format available?
    Yes. See previous item.
H
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
J.2 Regulatory and Legislative Environment

> regulation = increased cost
  1. What laws and regulations cover the data and operation of the resource?
    The resource is expected to have a large user base in Europe, so the resource will be General Data Privacy Regulation compliant. The website will meet accessibility requirements of the Americans with Disabilities Act and the institution. The resource will not include human-subjects data.
  2. Is the resource covered by an open-records act?
    No.
L
J.3 Governance

> outside governance = increased costs
  1. Does the resource need to maintain an external advisory board (EAB)?
    There is no requirement for an EAB.
  2. Does the resource set policy for itself, or is it part of a larger organization?
    Subject to BRAIN Initiative polices; otherwise, policy set by the resource.
L
J.4 External Consultation

> consultations = increased time = increased costs
  1. Will external stakeholders be consulted for initial design?
    Yes, outreach ahead of designing our website and services will be conducted.
  2. Will external stakeholders be consulted on an ongoing basis?
    Yes, agile user testing for all new features will be employed.
M

a The website for Euro Bioimaging is https://www.eurobioimaging.eu/, accessed January 11, 2020.

Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 78
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 79
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 80
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 81
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 82
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 83
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 84
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 85
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 86
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 87
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 88
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 89
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 90
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 91
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 92
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 93
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 94
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 95
Suggested Citation:"5 Applying the Framework to a New State 2 Data Resource." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 96
Next: 6 Applying the Framework to a New Data Set »
Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs Get This Book
×
Buy Paperback | $75.00 Buy Ebook | $59.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Biomedical research results in the collection and storage of increasingly large and complex data sets. Preserving those data so that they are discoverable, accessible, and interpretable accelerates scientific discovery and improves health outcomes, but requires that researchers, data curators, and data archivists consider the long-term disposition of data and the costs of preserving, archiving, and promoting access to them.

Life Cycle Decisions for Biomedical Data examines and assesses approaches and considerations for forecasting costs for preserving, archiving, and promoting access to biomedical research data. This report provides a comprehensive conceptual framework for cost-effective decision making that encourages data accessibility and reuse for researchers, data managers, data archivists, data scientists, and institutions that support platforms that enable biomedical research data preservation, discoverability, and use.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!