National Academies Press: OpenBook

Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs (2020)

Chapter: Appendix E: Template to Map Cost Drivers to Data Resource Properties

« Previous: Appendix D: Soft Costs for Digital Preservation
Suggested Citation:"Appendix E: Template to Map Cost Drivers to Data Resource Properties." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

E

Template to Map Cost Drivers to Data Resource Properties

Table E.1 is a template that compiles all the questions regarding the cost drivers described in Chapter 4 in a single place. Information about the definitions of all terms used is found in Chapter 4.

TABLE E.1 Cost-Driver Template

Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
A. Content
A.1 Size (volume and number of items)
  1. How many files will be in a single data submission?
  2. How large is an average data submission in total?
  3. Are the data sizes likely to stay stable over the life of the resource?
  4. What is the total amount of data expected?
  5. In what kind of medium will data be captured in the short and long terms?
A.2 Complexity and Diversity of Data Types
  1. How complex is the underlying structure of the data?
  2. How are the included data to be organized?
  3. How complex is the experimental paradigm that produced the data?
  4. What sort of additional files might be necessary to upload with the data to properly understand them?
  5. How many different data types are being produced?
  6. What are the relationships among these data types (e.g., are the data correlated)?
A.3 Metadata Requirements
  1. How much metadata must be stored with each data object to make them findable, accessible, interoperable, and reusable (FAIR)?
  2. Will metadata be entered manually by the submitter/curator?
  3. Will the data to be deposited include a data schema, or will one be generated?
  4. Is the provenance of a data set sufficiently described, or will it need to be?
  5. How much metadata can be extracted computationally?
A.4 Depth Versus Breadth Will the repository be restricted to certain data classes or types that the repository must support?
Suggested Citation:"Appendix E: Template to Map Cost Drivers to Data Resource Properties." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
A.5 Processing Level and Fidelity
  1. Do the raw data need to be stored?
  2. Do processed data need to be stored?
  3. Are there compression algorithms that can reduce the file size without compromising fidelity?
  4. What kind of data structure requirements will the resource have?
  5. Is the data contributor or the repository responsible for any restructuring necessary?
  6. How is the data structure verified?
A.6 Replaceability of Data
  1. Is the archive the primary steward of the data, or do copies exist elsewhere?
  2. Can the data be easily recreated?
B. Capabilities
B.1 User Annotation
  1. Will the repository have to provide user annotation capabilities?
  2. What is the nature of these annotations?
  3. Are they provided by humans or machines, and how will they be authenticated?
  4. Are permissions required to annotate the data?
B.2 Persistent Identifiers
  1. What persistent identifier (PID) scheme will be used by the archive?
  2. Is there a cost associated with using the PID?
  3. How many objects need to be identified?
  4. Who will be responsible for keeping the PIDs resolvable?
B.3 Citation
  1. Will users be able to create arbitrary subsets of data files and mint a PID for citation?
  2. Will the repository provide machine-readable metadata for supporting data citation?
  3. Will the repository provide export of data citations for use in reference managers?
B.4 Search Capabilities
  1. Will the repository provide a search capability for data sets?
  2. How much of the metadata will be included in search?
  3. How complex are the queries that will be supported?
  4. What types of features for search will be provided?
  5. Will the repository deploy services to search the data directly?
B.5 Data Linking and Merging
  1. Will the data require/benefit from linkages to other related items?
  2. Will the resource provide the ability to combine data across records based on common entities/standards?
B.6 Use Tracking
  1. Will the resource provide the ability to track uploads, views, and downloads?
  2. If so, and if made available to users, how will this information be made available?
  3. Will the resource track data citations to its data?
B.7 Data Analysis and Visualization
  1. What types of data analyses and visualizations will the repository support?
  2. What types of other data operations will the repository support (e.g., file conversions, sequence comparison)?
  3. Do these services require significant computational resources?
  4. Who will pay for computational resources?
C. Control
C.1 Content Control
  1. Will all appropriate data be accepted or will there be a review process?
  2. Will the review process be automated or will it require human oversight?
Suggested Citation:"Appendix E: Template to Map Cost Drivers to Data Resource Properties." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
C.2 Quality Control
  1. What quality control process will the repository support?
  2. Will these be automated or require human oversight?
  3. What level of data correctness will be required, and how will it be validated?
  4. What gaps in the data at the record or field level will be tolerable?
  5. Will any of the data be time sensitive, and how will data currency be ensured?
  6. How will duplication within or between data sets be addressed?
  7. Will prevalidation guidelines or routines be distributed by the resource to the data contributors?
  8. Will human curation be necessary?
C.3 Access Control
  1. What types of access control are required for the repository (e.g., will there be an embargo period)?
  2. At what level are they instituted (e.g., individual users, individual data sets)?
  3. Does use of the data require approval by a data access committee?
C.4 Platform Control Are there restrictions on the type of platform that may or must be used?
D. External Context
D.1 Resource Replication Is there a requirement to replicate the information resource at multiple sites (i.e., mirroring)?
D.2 External Information Dependencies Will the resource be dependent on information maintained by an outside source?
D.3 Distinctiveness Are there existing resources available that provide similar types of data and services?
E. Data Life Cycle
E.1 Anticipated Growth
  1. Is the repository expected to continuously grow over its lifetime?
  2. Is the likely rate of growth in data and services known?
  3. Is the use of the repository likely to grow over time?
  4. Is the likely growth of the user base known?
E.2 Update and Versions
  1. Will the deposited data require updates (e.g., in response to new data or error corrections)?
  2. Will prior versions of the data need to be retained and made available locally or in a different resource?
  3. How frequently will individual data sets be updated?
E.3 Useful Lifetime
  1. Are the data to be housed likely to have a limited period of usefulness?
  2. Does the resource have a defined period of time for which it will operate?
  3. Does the resource have to provide a guarantee that the data will be available for a finite period of time (e.g., 10 years)?
E.4 Offline and Deep Storage
  1. Can the resource take advantage of offline storage for data that are not heavily used?
  2. Does the resource have a plan for moving unused data to deep storage (i.e., State 3)?
Suggested Citation:"Appendix E: Template to Map Cost Drivers to Data Resource Properties." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
F. Contributors and Users
F.1 Contributor Base
  1. Is the number of contributors known? If not, can it be estimated?
  2. Are all the data originating from the same source (e.g., a single instrument or a single organization)?
  3. How will data be transferred into the data resource (e.g., periodic large batches, more frequent smaller data sets, constantly streamed, by physical transfer)?
  4. Will the data be pushed by the contributor or pulled by the resource?
  5. Are there direct or indirect fees associated with acquiring the data from a source?
  6. Will a data steward be available from among the contributors to assist with any data integration into the data resource?
F.2 User Base and Usage Scenarios
  1. How many users will likely access the data?
  2. What will be the frequency of access?
  3. How will users access the data?
  4. Will the resource be building analysis tools?
  5. Will the resource support individual file download or bulk download?
  6. Will there be any fees for downloading/accessing the data?
  7. How many different types of users must be supported?
F.3 Training and Support Requirements
  1. Will training for resource use be offered?
  2. What form will the training take?
  3. Will a “help desk” be provided?
  4. When does live help need to be available?
  5. What is the expected skill level of the user base?
F.4 Outreach
  1. Does the existence of the repository need to be advertised?
  2. How many conferences per year should resource representatives attend?
  3. Will the resource have a booth at the conference for live demos or conduct hands-on tutorials?
  4. Are users required by funders or journals to deposit data in the repository?
G. Availability
G.1 Tolerance for Outages
  1. What is the tolerance for outages of the resource?
  2. What measures will be taken to avoid and mitigate outages?
  3. How quickly and completely does the resource need to recover from an outage?
G.2 Currency
  1. How often will the data be released?
  2. How soon do data need to be made available after they are received?
G.3 Response Time
  1. Are there requirements for response time for service?
  2. Are there requirements for responses from humans?
G.4 Local Versus Remote Access
  1. Does the resource require that any data be shipped via physical media?
  2. Will the resource be built using commercial clouds?
  3. Do users have to travel to the resource to use the data?
H. Confidentiality, Ownership, and Security
H.1 Confidentiality
  1. Will any of the data require special protections?
  2. Will any of the data have embargo periods or embargo-related limitations that may entail costs?
  3. Are there any audit requirements for who has accessed or downloaded the data?
H.2 Ownership
  1. If data are contributed from multiple sources, will there be a need to process multiple kinds of release forms?
  2. Will all the data be released by the data resource under the same license, or will different permissions be assigned to different data sets?
  3. Will data submission agreements be necessary?
Suggested Citation:"Appendix E: Template to Map Cost Drivers to Data Resource Properties." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Category Cost Driver Decision Points/Issues Relative Cost Potential (Low, Medium, High)
H.3 Security
  1. What measures need to be taken to ensure the integrity and availability of the data?
  2. Do these measures require using protected computing, storage, or networking platforms?
I. Maintenance and Operations
I.1 Periodic Integrity Checking
  1. What processes will be put in place for checking the integrity of the hardware, software, and data?
  2. How frequently will these checks be performed?
I.2 Data-Transfer Capacity Will the bandwidth available to the resource be sufficient for the data sizes and rates required?
I.3 Risk Management
  1. Will the repository be solely responsible for risk mitigation?
  2. Is a response plan for unexpected termination required?
I.4 System-Reporting Requirements What types of system reporting will the resource be required to do?
I.5 Billing and Collections Will there be charges for use of the resource?
J. Standards, Regulatory, and Governance Concerns
J.1 Applicable Standards
  1. How many different standards will the resource have to support?
  2. Do these standards exist?
    1. If not, is the resource expected to lead their development?
    2. What is the plan for accepting data while standards are in development?
    3. If so, are the standards mature (i.e., how much are they expected to evolve)?
  3. Are the data validators and converters available for the standards, or do they have to be developed?
  4. What is the plan for “retrofitting” data that have been uploaded without the standards in place?
  5. How frequently will the standards update?
  6. Do the standards require spatial transformations (e.g., will they need to be aligned to a common coordinate system)?
  7. How many file formats will be supported?
  8. Is there an open file format available?
J.2 Regulatory and Legislative Environment
  1. What laws and regulations cover the data and operation of the resource?
  2. Is the resource covered by an open-records act?
J.3 Governance
  1. Does the resource need to maintain an external advisory board?
  2. Does the resource set policy for itself, or is it part of a larger organization?
J.4 External Consultation
  1. Will external stakeholders be consulted for initial design?
  2. Will external stakeholders be consulted on an ongoing basis?
Suggested Citation:"Appendix E: Template to Map Cost Drivers to Data Resource Properties." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 151
Suggested Citation:"Appendix E: Template to Map Cost Drivers to Data Resource Properties." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 152
Suggested Citation:"Appendix E: Template to Map Cost Drivers to Data Resource Properties." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 153
Suggested Citation:"Appendix E: Template to Map Cost Drivers to Data Resource Properties." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 154
Suggested Citation:"Appendix E: Template to Map Cost Drivers to Data Resource Properties." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 155
Next: Appendix F: Comparison of the Contents Across the Three Data States »
Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs Get This Book
×
Buy Paperback | $75.00 Buy Ebook | $59.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Biomedical research results in the collection and storage of increasingly large and complex data sets. Preserving those data so that they are discoverable, accessible, and interpretable accelerates scientific discovery and improves health outcomes, but requires that researchers, data curators, and data archivists consider the long-term disposition of data and the costs of preserving, archiving, and promoting access to them.

Life Cycle Decisions for Biomedical Data examines and assesses approaches and considerations for forecasting costs for preserving, archiving, and promoting access to biomedical research data. This report provides a comprehensive conceptual framework for cost-effective decision making that encourages data accessibility and reuse for researchers, data managers, data archivists, data scientists, and institutions that support platforms that enable biomedical research data preservation, discoverability, and use.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!