Page 97 Cite

Suggested Citation:"6 Applying the Framework to a New Data Set." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

6

Applying the Framework to a New Data Set

Per the statement of task, the cost-forecasting framework was applied to a second scenario, in this case, to the development of a new data set in a State 1 (primary research) platform.

USE CASE 2: ESTIMATING COSTS ASSOCIATED WITH A PRIMARY RESEARCH DATA SET

The cost-forecasting framework is applied to a proposed State 1 (primary research) data platform. The study committee applied the framework as might a young investigator (see Box 6.1). Box 6.2 demonstrates the logic introduced by the forecaster who, although enthusiastic, might be less experienced and unaware of available resources.

Applying the Framework to Use Case 2

Using the forecasting steps in provided in Table 4.1, the forecaster (in this case, the researcher) begins to construct the cost forecast.

Step 1. Determine the type of data resource environment, its data state(s), and how data might transition between those states during the data life cycle.

The forecaster examines the request for application (RFA) for requirements related to data management. Comparing the RFA requirements with the descriptions of the data states in Chapter 2, the forecaster determines this will be a State 1 (primary research) platform for her laboratory’s use. However, the forecaster also plans to transfer the data to a State 2 active repository. Funding for transfer activities between platforms will also be considered.

Step 2. Identify the characteristics of the data (Chapter 4), data contributors, and users.

In light of needs, goals, and RFA requirements, the following preliminary assumptions about the data are made that will be refined throughout the conduct of the cost forecast.

Page 98 Cite

Suggested Citation:"6 Applying the Framework to a New Data Set." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

BOX 6.1
The Use Case 2 Forecaster

A young investigator is exploring functional magnetic resonance imaging (fMRI) as a measuring technique for determining anatomical correlates for cognitive decline in patients diagnosed with Alzheimer’s disease. The investigator is deciding whether to use conventional methods or to try new multiband imaging techniques in development. The researcher/forecaster explores several funding options, finding that the National Institute on Aging (NIA) has announced funding. NIA requires that all genomic data on Alzheimer’s disease will be deposited in one of their databases, but she is not sure whether the policies cover her neuroimaging data. She is aware, however, that changes to the National Institutes of Health (NIH) data management and sharing policies are coming. If implemented, details regarding data stewardship and how and when data will be made available will need to be provided.

BOX 6.2
A Demonstration of Information Gathering to Inform Use Case 2

A cost forecaster, in this case a researcher in a primary research environment, gathers information prior to preparing a proposal (see Box 6.1). She first tries to identify the state of the art in fMRI imaging in Alzheimer’s disease through the literature but wonders if there are data sets in the public domain that might be used for exploratory work. Not knowing where to look, she conducts an online search using the keywords “Alzheimer’s disease + data + fMRI”, which returns mostly articles. She has heard of and searches the Alzheimer’s Disease Neuroimaging Initiative (ADNI),^a but it does not have fMRI data. OpenNeuro has no publicly available fMRI data. She is not aware of the NeuroImaging Tools and Resources Collaboratory (NITRC)^b or the Neuroscience Information Framework (NIF),^c two NIH Blueprint for Neuroscience Research^d initiatives that could point toward potential resources (e.g., NeuroVault,^e the Open Access Series of Imaging Studies (OASIS),^f or the 1000 Functional Connectomes Project).^g A Google data set search lists only epidemiology studies. The Human Connectome Project^h has a resting state fMRI longitudinal study under way, the Alzheimer’s Disease Connectome Project, but the data will not be available for several months.ⁱ She reads Alzheimer’s studies in the literature and hopes that a few made data available or referenced public data sets. She concludes that no suitable public data are available. Had the researcher/forecaster some way to know about available resources, she might have found data sets to inform her research direction, strategies, and her data management plan. This knowledge might have yielded cost savings and informed her, for example, of existing data sets that might have been aggregated or even that her study might not need to be conducted at all.

When considering how to share her neuroimaging data when the study is complete, the forecaster/researcher wonders if the NIH pilot program with figshare^j is an option. She consults a data librarian at her institution for help designing her data management plan. The librarian searches the list of repositories on the National Library of Medicine’s website^k and sees that the OpenfMRI database (now OpenNeuro) takes fMRI data. Figshare has no specific requirements for formats or metadata, while OpenNeuro requires her data to be in Brain Imaging Data Structure (BIDS) format.^l In either case, she will have to deidentify her data to publish them. The data librarian notes that while it may be easier and perhaps less costly in the short term to publish in figshare, data shared through a more specific repository such as OpenNeuro are aggregated with similar data and generally formatted to a common standard, likely giving the data greater value and visibility. Further, domain-specific repositories tend to have supporting software for upload and analysis. In fact, the researcher finds that the BIDS community is building a series of applications that significantly lower the cost of use.

Page 99 Cite

Suggested Citation:"6 Applying the Framework to a New Data Set." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

The researcher/forecaster decides to deposit data in OpenNeuro and prepares her data management accordingly. OpenNeuro runs a validator that ensures format compliance, so it will be the researcher’s responsibility to ensure data conform to the BIDS format. She may or may not be aware that there is a metadata specification designed for neuroimaging (Maumet et al., 2016) that will help guide the description of data and that harmonize specific variables with other data. Although not required by OpenNeuro, the researcher/forecaster is aware that rich metadata are critical for data reuse, not only by others but in her own laboratory as well.

By submitting data to a State 2 repository, data stewardship is transferred to the repository. The researcher can continue to benefit from the data, although they have been processed to comply with the Health Insurance Portability and Accountability Act (HIPAA) requirements. The researcher must decide whether to preserve the original data for the long term and about the disposition of other unpublished data related to the study. Decisions must also be made regarding how preserved data will be stored long term (i.e., by herself, or through institutional or commercial cloud resources). After consulting with a data librarian and her information technology (IT) department, the researcher decides against storing preserved data long term herself because institutional or cloud services can ensure the data are backed up appropriately and migrated to new platforms as necessary. Her institution has contracts with a cloud provider that provides generous storage allowances, but the cloud is not HIPAA-compliant. She therefore contracts with institutional IT services for long-term data management. The data librarian provides the researcher with a metadata template to ensure that the data can be retrieved reliably and that critical information about privacy and data ownership are documented.

__________________

^a The website for the ADNI is http://adni.loni.usc.edu/, accessed April 15, 2020.

^b The website for NITRC is https://www.nitrc.org/, accessed April 15, 2020.

^c The website for NIF is https://neuinfo.org/, accessed April 15, 2020.

^d The website for the NIH Blueprint for Neuroscience Research is https://neuroscienceblueprint.nih.gov/, accessed April 15, 2020.

^e The website for NeuroVault is https://neurovault.org/, accessed April 15, 2020.

^f The website for OASIS is https://www.oasis-brains.org/, accessed April 15, 2020.

^g The website for the 1000 Functional Connectomes Project is https://www.nitrc.org/projects/fcon_1000/, accessed April 15, 2020.

^h The website for The Human Connectome Project is http://www.humanconnectomeproject.org/, accessed April 15, 2020.

ⁱ The website for The Alzheimer’s Disease Connectome Project is https://humanconnectome.org/study/alzheimers-disease-connectome-project, accessed April 15, 2020.

^j The website for Figshare is https://datascience.nih.gov/news/nih-funded-researchers-invited-use-nih-figshare#:~:targetText=NIH%20Figshare%20is%20a%20one,.com%2Ff%2Ffaq%20, accessed April 15, 2020.

^k The website for the National Library of Medicine is https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html, accessed April 15, 2020.

^l The website for the BIDS is https://bids.neuroimaging.io/, accessed April 15, 2020.

Page 100 Cite

Suggested Citation:"6 Applying the Framework to a New Data Set." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

Data Characteristics (Section A, Appendix E)

The data are moderate in size: gigabytes (GB) per individual data set (several mature packages currently support fMRI).
There are a moderate number of files and moderate size of individual files.
Sizes of data sets will be stable over the life of the project.
There are multiple neuroimaging modalities.
The data are complex.
There are significant metadata requirements.
Data will come from a single contributor.

Data acquisition costs can be estimated because the number of subjects will be known ahead of time through institutional approvals. If the researcher decides to use a newer technology (e.g., multiband imaging), data sizes will increase fourfold to fivefold and the computational methods for processing and analyzing the data are less well known. In that case, the raw k-space data¹ will be kept available for reprocessing as new algorithms and approaches emerge.

As the forecaster, at this point, is only estimating the costs for her own use of the data, she skips the questions regarding the user community (Section F, Appendix E) but does keep in mind that the data may be of value to others in the future.

Step 3. Identify the current and potential value of the data and how the data value might be maintained or increased with time.

Perceived value is difficult to predict. However, all data sets underlying the results of a study will be made public so that the data can be inspected and reanalyzed. The availability of public data sets may also encourage technology development if she chooses to use more advanced techniques. As outlined in Box 6.2, if the data are well annotated and prepared according to community standards, they might be an important source of information and data for designing future studies.

Step 4. Identify the personnel and infrastructure likely necessary in the short and long terms.

Based on consideration of State 1 (primary research) and activities necessary to prepare data for State 2 (active) as described in Tables 2.1 and 2.2, respectively, the forecaster identifies the relevant major activities. The project objectives, informed by the RFA, the relevant activities, and personnel necessary (based on Table 2.1) are listed in Table 6.1.

Step 5. Identify the major cost drivers associated with each activity based on the steps above, including how decisions might affect future data use and its cost.

Table 4.2 is consulted to understand the likely important cost drivers for a State 1 resource, and the cost-driver template in Appendix E is filled in too (see Table 6.2 shown after the discussion of the use case). In this application of the framework, the guiding questions in Chapter 4 and the template about cost drivers are not all applicable, and so the forecaster revises the template to help delineate costs and decision points in as complete a manner as possible.

The relative costs related to data acquisition for this use case are straightforward to predict using the cost-forecasting framework. Relative costs associated with cost drivers identified in Table 4.2 are provided below based on the assessment made while filling out Table 6.2. In a real-world application of the cost-forecasting framework, these costs would be quantified with the help of State 2 (active) repository resources.

___________________

¹ K-space data are arrays of numbers that represent different spatial frequencies of the image.

Page 101 Cite

Suggested Citation:"6 Applying the Framework to a New Data Set." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

A: Content → Likely low-medium
B: Capabilities → Likely low
C: Control → Likely medium
D: External Context → Likely low
E: Data Life Cycle → Likely low-medium
F: Contributers and Users → Likely low-medium
G: Availability → Likely low-medium
H: Confidentiality, etc. → Likely medium
I: Maintenance and Operations → Likely low
J: Standards, etc. → Likely medium-high

TABLE 6.1 Map of the Use Case 2 Scenario to Data States, Activities, and Subactivities

Project Objectives and Tasks	States, Activities, and Subactivities^a	Personnel
Review of the literature and publicly available resources leads to a proposal to assess the feasibility of fMRI measurement techniques for this purpose.	I.B.1	Researcher, data scientist, software engineer, research domain project manager, policy specialist, administrative staff
Consider various funding sources and determine that potential funders expect collected data to be publicly shared.	I.A.1, I.B.2	Researcher, records management specialist, data scientist, data librarian, education specialist, policy specialist, software engineer, research domain project manager, administration staff
Assess suitability of existing repositories for the ultimate data deposit. Outline in data management plan the management and sharing approaches and costs estimates while data are under her stewardship. Consent methods for sharing data described.	I.B.3., I.B.4, I.B.5	Researcher, data scientist, software engineer, research domain project manager, policy specialist, administrative staff
Consider available tools for collecting, processing, and validating data using community-accepted standards. Considers documentation and curation levels required.	I.A.2, I.A.3, I.C	Researcher, records management specialist, data scientist, data librarian, metadata librarian, education specialist, policy specialist, research domain project manager, research domain curator, software engineer
Data management processes are in place that maintain primary and derived data (given evolving technologies). Derived data may include data in deidentified form.	I.C.3	Researcher, metadata librarian, data scientist, research domain project manager, research domain curator, software engineer
Deposit data in chosen repository on a regular schedule or when all data collection and analysis are complete.	I.D	Researcher, research domain project manager, IT project manager, software engineer, data wrangler

^a The activity numerals correspond with labels in columns of Table 2.1.

Page 102 Cite

Suggested Citation:"6 Applying the Framework to a New Data Set." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

TABLE 6.2 Decision Points for Use Case 2

Category	Cost Driver	Decision Points/Issues	Relative Cost Potential (Low, Medium, High)
A. Content
A.1	Size (volume and number of items) > size = higher costs	What is the order of magnitude of data that will be produced? GB. How large is an average data set? Per subject ~ 10 GB (multiple scans over time). Are the data sizes likely to stay stable over the life of the project? Yes. What is the total amount of data expected? ~400 GB. How many individual files in a typical data set? Hundreds. If the data are to be transferred to a repository for long-term management, is there a cost depending on size? No. Data will be submitted to OpenNeuro, which currently does not have costs associated with these data. Are there publicly available data that can be used to augment these data or perform preliminary analyses? No relevant data were found.	L-M
A.2	Complexity and Diversity of Data Types > complexity + diversity = higher cost	How complex is the underlying structure of the data? Complex-image data. How complex is the experimental paradigm that produced the data? Standard fMRI block design. What sort of additional data are acquired along with the primary data? Cognitive assessments, statistical maps, demographic data. How many different data types are being produced? Multiple modalities. What are the relationships among these data types—for example, are the data correlated? Not applicable.	M
A.3	Metadata Requirements > metadata amounts + type = higher cost	How much metadata must be stored with the data to make them findable, accessible, interoperable, and reusable? Basic descriptive metadata, imaging parameters, experimental metadata, processing metadata, anatomical metadata. How are metadata recorded? In data file headers, in Neuro Imaging Data Model (NIDM), in laboratory notebooks, in BIDS manifests.	M
A.4	Depth Versus Breadth > breadth = higher cost	Is this study part of a multicenter study? No. How many institutions/collaborators are involved? Not applicable.	L

Page 103 Cite

Suggested Citation:"6 Applying the Framework to a New Data Set." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

Category	Cost Driver	Decision Points/Issues	Relative Cost Potential (Low, Medium, High)
A.5	Processing Level and Fidelity > compression = lower cost	Do the raw data need to be stored? K-space data not stored for standard fMRI. Will likely store k-space data if multiband imaging used. Do processed data need to be stored? Yes. Analyses are performed on the reconstructed data. Are there compression algorithms that can reduce the file size without compromising fidelity? Data files are not that large, so compression not typically used. What kind of data structure requirements will the resource have? No particular structure enforced by imaging center. Data submitted to OpenNeuro must be organized according to the BIDS standard. Is the data contributor or the repository responsible for any restructuring necessary? Researcher is responsible for restructuring data transferred to OpenNeuro. How is the data structure verified? BIDS validator will likely implement it within our imaging pipeline.	H
A.6	Replaceability of Data > replaceability = lower cost	Are there existing data sets that might be used instead of gathering primary data? Not to our knowledge. Are the data managed by an institutional repository? Our imaging center provides primary storage. Are there copies of the data elsewhere? Local copy of data kept on a workstation in laboratory. Can the data be easily recreated? No. It would be expensive to retest subjects. Disease progression information would be lost.	L
B. Capabilities
B.1	User Annotation > user annotation functions = higher cost	How long does it take to annotate/segment a data set? Processing does not take very long. Is the process largely manual or automated? Analysis data annotation is fully automated; experimental and descriptive metadata is added manually. Are these annotations stored with the data? They are in a separate file. Is the relationship (provenance) between the data file and the annotations recorded in the metadata? No, the association is captured through file-naming conventions.	L
B.2	Persistent Identifiers type of identifier = potential costs	What persistent identifiers are used when annotating these data (e.g., Open Researcher and Contributor Identifiers, Ontology IDs)? None. How are these persistent identifiers accessed? Not applicable.	L
B.3	Citation > citation functions = increased cost	Are the contributors to the production of a data set recorded in the metadata? No. Is there a plan to submit the data to a repository that supports data citation? Yes.	L

Page 104 Cite

Suggested Citation:"6 Applying the Framework to a New Data Set." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

Category	Cost Driver	Decision Points/Issues	Relative Cost Potential (Low, Medium, High)
B.4	Search Capabilities > advanced search may lead to decreased cost	Does the platform where the data are stored provide any search functions? Just the native functions of the storage system (search on file name, creation date, owner, etc.). Was a search performed to locate data sets that might be relevant to this study? Yes. What tools were used? OpenNeuro; PubMed.	L
B.7	Data Analysis and Visualization > services = higher cost	What type of data visualization tools are required? Interactive viewing of images and 3D volumes; visualization of statistical maps. Freely available open-source tools used. What types of other data operations need to be supported? Processing pipelines for the data; signal-extraction tools. Do these services require significant computational resources? Moderate. Is there an explicit cost associated with compute resources? Basic compute time is included with the fee paid to imaging center; many operations run locally on workstation.	L
C. Control
C.2	Quality Control > quality control = increased cost	What quality control processes are used? Some automated and manual inspection of the data for issues such as motion artifacts. Does the public data repository have any quality control requirements? OpenNeuro requires the data to be in BIDS format, so BIDS validator run.	L
C.3	Access Control > controls = increased cost	What types of access control are required for the data? Human-subjects data—institutional requirements for handling human-subjects data followed. Only qualified laboratory personnel can access the data. How is access to data managed, e.g., data access committees? The principal investigator is responsible for managing access to the data.	L
C.4	Platform Control > platform restrictions = increased cost	Are there restrictions on the type of platform that must be used for storing or analyzing the data? Yes. Data infrastructure must adhere to our institution’s security requirements for storing human-subjects data.	M
D. External Context
D.1	Resource Replication > replication = increased cost	Is there a requirement to replicate the information resource at multiple sites (i.e., mirroring)? The imaging center backs up primary data to a local private cloud. Costs associated with replication are included in our fee to the imaging center.	L
D.2	External Information Dependencies > external dependencies may or may not = increased cost	Will the resource be dependent on information maintained by an outside source? No.	L

Page 105 Cite

Suggested Citation:"6 Applying the Framework to a New Data Set." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

Category	Cost Driver	Decision Points/Issues	Relative Cost Potential (Low, Medium, High)
E. Data Life Cycle
E.1	Anticipated Growth > growth = increased costs	Is the total amount of data to be generated over the course of the project known? Yes. Are there any factors that might affect the amount of data? Not likely. The possibility that techniques used could increase data sizes has been accounted for, but approval gained to obtain data from a specified number of subjects and the processing pipelines, and so on, are well established.	L
E.2	Update and Versions > updates + multiple versions = increased cost	Are multiple versions of the data created? Yes, sometimes we have to reprocess individual subjects. If so, how are they managed locally? Through the file names.	M
E.3	Useful Lifetime limited lifetime = decreased cost	Are the data likely to have a limited period of usefulness? Hard to predict; it will depend on the rate at which imaging technology evolves and whether new processing approaches are developed to compare our data to data collected by new instruments. Are there specific data retention institutional or regulatory requirements for these data? Copies of all study data generally kept for at least 5 years after the study is completed.	L
E.4	Offline and Deep Storage > offline/deep storage = decreased costs > transfers = increased cost	For long-term storage of laboratory data, are there offline/deep storage resources available? Yes, the institution runs a data archive for faculty research. Is there a plan for migrating laboratory data to a State 3 archive for long-term preservation? Yes, data will be placed in the institutional archive after the study is completed.	M
F. Contributors and Users
F.1	Contributor Base > number and diversity of contributors = increased cost	Is the number of contributors known? If not, can it be estimated? Just our laboratory members. Are all the data originating from the same source (e.g., a single instrument or a single organization)? Yes.	L
F.2	User Base and Usage Scenarios > access and diversity of users = increased cost	How many users will likely access the data? Laboratory members (currently six). What will be the frequency of access? Data accessed daily during the study and processing phase. How will users access the data? Necessary compute infrastructure is available—the data will be on local machines. Will the resource be building analysis tools? Yes, customized pipelines for processing our data, based on open-source toolkits, are built. How many different types of users must be supported? Not applicable.	L

Page 106 Cite

Suggested Citation:"6 Applying the Framework to a New Data Set." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

Category	Cost Driver	Decision Points/Issues	Relative Cost Potential (Low, Medium, High)
F.3	Training and Support Requirements > training + services = increased cost	Is special training required for data upload to the repository? Yes. What form will the training take? Online tutorials and workshops. How long will this training take? We will attend a training workshop on BIDS. What is the skill level required for data wrangling? Moderate knowledge of neuroimaging and computer skills.	M
G. Availability
G.1	Tolerance for Outages > reliability = increased costs	What is the tolerance for outages of the resource? Access to the data reliably is necessary. Will maintain adequate backups and system performance; scheduled outages for system patches and upgrades are tolerable.	M
G.4	Local Versus Remote Access > cloud could lead to increased costs	Does the resource require that any data be shipped via physical media? No, that is not likely. We have adequate bandwidth to transmit our data where required. Will commercial clouds be used? No, not for primary storage.	L
H. Confidentiality, Ownership, and Security
H.1	Confidentiality > confidentiality = increased cost	Will any of the data require special protections? Yes, they are human-subjects data. Are there any audit requirements for those who have accessed or downloaded the data? No, we expect no users outside of laboratory staff.	M
H.2	Ownership > ownership = increased costs	Do rights to use the data have to be negotiated with collaborators, institutions, commercial entities, or funders? No. Will all data be released under the same license, or will different permissions be assigned to different data sets? Data will be released under the license used by OpenNeuro. Will data submission agreements be necessary? No.	L
H.3	Security > security = increased cost	What types of security measures must be taken to protect against loss or corruption of data? Standard practices will be used. Do these measures require using protected computing, storage, or networking platforms? Yes.	L
I. Maintenance and Operations
I.1	Periodic Integrity Checking > integrity checking = increased cost	What processes will be put in place for checking the integrity of the hardware, software, and data? We do not have any specific processes for this. How frequently will these checks be performed? Not applicable.	L

Page 107 Cite

Suggested Citation:"6 Applying the Framework to a New Data Set." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

Category	Cost Driver	Decision Points/Issues	Relative Cost Potential (Low, Medium, High)
I.2	Data-Transfer Capacity > data-transfer upgrades = increased cost	Will the bandwidth available be sufficient for the data sizes and rates required for transfer/access? Yes. Campus connectivity recently upgraded. No problems anticipated.	L
I.3	Risk Management > risk mitigation = increased cost	Will the researcher be solely responsible for risk mitigation? Yes Is a response plan for unexpected termination required? No	H
I.4	System-Reporting Requirement > system reporting-requirements = increased costs	What types of system reporting will the resource be required to do? None.	L
I.5	Billing and Collections	Will there be charges for use of the resource? No. All laboratory members have free access.
J. Standards, Regulatory, and Governance Concerns
J.1	Applicable Standards > mature standards = decreased costs	How many different standards will be needed for the data? Will use BIDS and NIDM along with standard registration tools to a common coordinate space. Do these standards exist? Yes. Has the researcher worked with the standards before? Yes. Are the standards mature? Yes. Are tools (e.g., data validators and converters) available for the standards, or do they have to be developed? Yes. How frequently will the standards update? BIDS is a fairly mature standard. It is currently on version 1.2.1. Do the standards require spatial transformations? Yes. How many file formats will be supported? Digital Imaging and Communications in Medicine used. Is there an open file format available? Yes. Neuroimaging Informatics Technology Initiative.	H
J.2	Regulatory and Legislative Environment > regulation = increased cost	What laws and regulations cover the data and operation of the resource? HIPAA. Is the resource covered by an open-records act? Not applicable.	L
J.3	Governance > outside governance = increased costs	How are decisions regarding data use managed? Not applicable, no use outside the laboratory (i.e., no collaborators). Is a formal data-sharing agreement in place among the collaborators? Not applicable.	L

Page 108 Cite

Suggested Citation:"6 Applying the Framework to a New Data Set." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

Decisions made in the project planning stage, and the information resources available to the researcher during that planning, can influence the overall project costs, the study outcomes, and future data curation and preservation. For example, given that data might be transferred to a repository that has submission requirements, additional data preparation costs may be incurred. If the forecaster/researcher uses no formal data management software in the laboratory, a decision can be made to include additional costs in the budget to account for the effort. Funds could be requested for a data manager or wrangler to manage the data and set up the necessary infrastructure to adhere to data formatting standards. Automated pipelines could also assist transfer to a State 2 active repository on a regular basis. Cost to implement those pipelines may be greater up front but could also save many human hours over the duration of the project.

Because an individual forecaster, in this case a primary research environment researcher, cannot be responsible for estimating all costs for data management in perpetuity, the goal in applying the forecasting framework should be to estimate costs incurred during data acquisition and stewardship while they are in the researcher’s control (i.e., the costs incurred while data are in State 1). However, the forecaster needs to be aware of requirements for long-term stewardship and be ready with the resources required (e.g., time, money, personnel) to prepare data for transfer to a State 2 (active) repository if to be shared or, if not, to a State 3 repository for long-term preservation.

Step 6. Estimate the costs for relevant cost components based on the characteristics of the data and information resource.

In a quantitative cost forecast, the costs for the activities in the previous section would be quantified for each of the major cost components (e.g., Box 3.2). As noted previously in the report, quantifying costs is dependent on numerous case-specific factors such as the objectives for the information resource, the personnel and infrastructural resources available to the forecaster, and host institution requirements. In a real cost forecast, all of these would be considered to arrive at monetary values.

REFERENCE

Maumet, C., T. Auer, A. Bowring, G. Chen, S. Das, G. Flandin, S. Ghosh, et al. 2016. Sharing brain mapping statistical results with the neuroimaging data model. Scientific Data 3:160102. https://doi.org/10.1038/sdata.2016.102.

Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs (2020)

Chapter: 6 Applying the Framework to a New Data Set

6

Applying the Framework to a New Data Set

USE CASE 2: ESTIMATING COSTS ASSOCIATED WITH A PRIMARY RESEARCH DATA SET

Applying the Framework to Use Case 2

Step 1. Determine the type of data resource environment, its data state(s), and how data might transition between those states during the data life cycle.

Step 2. Identify the characteristics of the data (Chapter 4), data contributors, and users.

Data Characteristics (Section A, Appendix E)

Step 3. Identify the current and potential value of the data and how the data value might be maintained or increased with time.

Step 4. Identify the personnel and infrastructure likely necessary in the short and long terms.

Step 5. Identify the major cost drivers associated with each activity based on the steps above, including how decisions might affect future data use and its cost.

Step 6. Estimate the costs for relevant cost components based on the characteristics of the data and information resource.

REFERENCE

Welcome to OpenBook!

Get Email Updates