6
Applying the Framework to a New Data Set
Per the statement of task, the cost-forecasting framework was applied to a second scenario, in this case, to the development of a new data set in a State 1 (primary research) platform.
USE CASE 2: ESTIMATING COSTS ASSOCIATED WITH A PRIMARY RESEARCH DATA SET
The cost-forecasting framework is applied to a proposed State 1 (primary research) data platform. The study committee applied the framework as might a young investigator (see Box 6.1). Box 6.2 demonstrates the logic introduced by the forecaster who, although enthusiastic, might be less experienced and unaware of available resources.
Applying the Framework to Use Case 2
Using the forecasting steps in provided in Table 4.1, the forecaster (in this case, the researcher) begins to construct the cost forecast.
Step 1. Determine the type of data resource environment, its data state(s), and how data might transition between those states during the data life cycle.
The forecaster examines the request for application (RFA) for requirements related to data management. Comparing the RFA requirements with the descriptions of the data states in Chapter 2, the forecaster determines this will be a State 1 (primary research) platform for her laboratory’s use. However, the forecaster also plans to transfer the data to a State 2 active repository. Funding for transfer activities between platforms will also be considered.
Step 2. Identify the characteristics of the data (Chapter 4), data contributors, and users.
In light of needs, goals, and RFA requirements, the following preliminary assumptions about the data are made that will be refined throughout the conduct of the cost forecast.
Data Characteristics (Section A, Appendix E)
- The data are moderate in size: gigabytes (GB) per individual data set (several mature packages currently support fMRI).
- There are a moderate number of files and moderate size of individual files.
- Sizes of data sets will be stable over the life of the project.
- There are multiple neuroimaging modalities.
- The data are complex.
- There are significant metadata requirements.
- Data will come from a single contributor.
Data acquisition costs can be estimated because the number of subjects will be known ahead of time through institutional approvals. If the researcher decides to use a newer technology (e.g., multiband imaging), data sizes will increase fourfold to fivefold and the computational methods for processing and analyzing the data are less well known. In that case, the raw k-space data1 will be kept available for reprocessing as new algorithms and approaches emerge.
As the forecaster, at this point, is only estimating the costs for her own use of the data, she skips the questions regarding the user community (Section F, Appendix E) but does keep in mind that the data may be of value to others in the future.
Step 3. Identify the current and potential value of the data and how the data value might be maintained or increased with time.
Perceived value is difficult to predict. However, all data sets underlying the results of a study will be made public so that the data can be inspected and reanalyzed. The availability of public data sets may also encourage technology development if she chooses to use more advanced techniques. As outlined in Box 6.2, if the data are well annotated and prepared according to community standards, they might be an important source of information and data for designing future studies.
Step 4. Identify the personnel and infrastructure likely necessary in the short and long terms.
Based on consideration of State 1 (primary research) and activities necessary to prepare data for State 2 (active) as described in Tables 2.1 and 2.2, respectively, the forecaster identifies the relevant major activities. The project objectives, informed by the RFA, the relevant activities, and personnel necessary (based on Table 2.1) are listed in Table 6.1.
Step 5. Identify the major cost drivers associated with each activity based on the steps above, including how decisions might affect future data use and its cost.
Table 4.2 is consulted to understand the likely important cost drivers for a State 1 resource, and the cost-driver template in Appendix E is filled in too (see Table 6.2 shown after the discussion of the use case). In this application of the framework, the guiding questions in Chapter 4 and the template about cost drivers are not all applicable, and so the forecaster revises the template to help delineate costs and decision points in as complete a manner as possible.
The relative costs related to data acquisition for this use case are straightforward to predict using the cost-forecasting framework. Relative costs associated with cost drivers identified in Table 4.2 are provided below based on the assessment made while filling out Table 6.2. In a real-world application of the cost-forecasting framework, these costs would be quantified with the help of State 2 (active) repository resources.
___________________
1 K-space data are arrays of numbers that represent different spatial frequencies of the image.
- A: Content → Likely low-medium
- B: Capabilities → Likely low
- C: Control → Likely medium
- D: External Context → Likely low
- E: Data Life Cycle → Likely low-medium
- F: Contributers and Users → Likely low-medium
- G: Availability → Likely low-medium
- H: Confidentiality, etc. → Likely medium
- I: Maintenance and Operations → Likely low
- J: Standards, etc. → Likely medium-high
TABLE 6.1 Map of the Use Case 2 Scenario to Data States, Activities, and Subactivities
Project Objectives and Tasks | States, Activities, and Subactivitiesa | Personnel |
---|---|---|
|
I.B.1 | Researcher, data scientist, software engineer, research domain project manager, policy specialist, administrative staff |
|
I.A.1, I.B.2 | Researcher, records management specialist, data scientist, data librarian, education specialist, policy specialist, software engineer, research domain project manager, administration staff |
|
I.B.3., I.B.4, I.B.5 | Researcher, data scientist, software engineer, research domain project manager, policy specialist, administrative staff |
|
I.A.2, I.A.3, I.C | Researcher, records management specialist, data scientist, data librarian, metadata librarian, education specialist, policy specialist, research domain project manager, research domain curator, software engineer |
|
I.C.3 | Researcher, metadata librarian, data scientist, research domain project manager, research domain curator, software engineer |
|
I.D | Researcher, research domain project manager, IT project manager, software engineer, data wrangler |
a The activity numerals correspond with labels in columns of Table 2.1.
TABLE 6.2 Decision Points for Use Case 2
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
A. Content | |||
A.1 | Size (volume and number of items) > size = higher costs |
|
L-M |
A.2 | Complexity and Diversity of Data Types > complexity + diversity = higher cost |
|
M |
A.3 | Metadata Requirements > metadata amounts + type = higher cost |
|
M |
A.4 | Depth Versus Breadth > breadth = higher cost |
|
L |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
A.5 | Processing Level and Fidelity > compression = lower cost |
|
H |
A.6 | Replaceability of Data > replaceability = lower cost |
|
L |
B. Capabilities | |||
B.1 | User Annotation > user annotation functions = higher cost |
|
L |
B.2 | Persistent Identifiers type of identifier = potential costs |
|
L |
B.3 | Citation > citation functions = increased cost |
|
L |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
B.4 | Search Capabilities > advanced search may lead to decreased cost |
|
L |
B.7 | Data Analysis and Visualization > services = higher cost |
|
L |
C. Control | |||
C.2 | Quality Control > quality control = increased cost |
|
L |
C.3 | Access Control > controls = increased cost |
|
L |
C.4 | Platform Control > platform restrictions = increased cost |
Are there restrictions on the type of platform that must be used for storing or analyzing the data? Yes. Data infrastructure must adhere to our institution’s security requirements for storing human-subjects data. |
M |
D. External Context | |||
D.1 | Resource Replication > replication = increased cost |
Is there a requirement to replicate the information resource at multiple sites (i.e., mirroring)? The imaging center backs up primary data to a local private cloud. Costs associated with replication are included in our fee to the imaging center. |
L |
D.2 | External Information Dependencies > external dependencies may or may not = increased cost |
Will the resource be dependent on information maintained by an outside source? No. |
L |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
E. Data Life Cycle | |||
E.1 | Anticipated Growth > growth = increased costs |
|
L |
E.2 | Update and Versions > updates + multiple versions = increased cost |
|
M |
E.3 | Useful Lifetime limited lifetime = decreased cost |
|
L |
E.4 | Offline and Deep Storage > offline/deep storage = decreased costs > transfers = increased cost |
|
M |
F. Contributors and Users | |||
F.1 | Contributor Base > number and diversity of contributors = increased cost |
|
L |
F.2 | User Base and Usage Scenarios > access and diversity of users = increased cost |
|
L |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
F.3 | Training and Support Requirements > training + services = increased cost |
|
M |
G. Availability | |||
G.1 | Tolerance for Outages > reliability = increased costs |
What is the tolerance for outages of the resource? Access to the data reliably is necessary. Will maintain adequate backups and system performance; scheduled outages for system patches and upgrades are tolerable. |
M |
G.4 | Local Versus Remote Access > cloud could lead to increased costs |
|
L |
H. Confidentiality, Ownership, and Security | |||
H.1 | Confidentiality > confidentiality = increased cost |
|
M |
H.2 | Ownership > ownership = increased costs |
|
L |
H.3 | Security > security = increased cost |
|
L |
I. Maintenance and Operations | |||
I.1 | Periodic Integrity Checking > integrity checking = increased cost |
|
L |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
I.2 | Data-Transfer Capacity > data-transfer upgrades = increased cost |
Will the bandwidth available be sufficient for the data sizes and rates required for transfer/access? Yes. Campus connectivity recently upgraded. No problems anticipated. |
L |
I.3 | Risk Management > risk mitigation = increased cost |
|
H |
I.4 | System-Reporting Requirement > system reporting-requirements = increased costs |
What types of system reporting will the resource be required to do? None. |
L |
I.5 | Billing and Collections | Will there be charges for use of the resource? No. All laboratory members have free access. |
|
J. Standards, Regulatory, and Governance Concerns | |||
J.1 | Applicable Standards > mature standards = decreased costs |
|
H |
J.2 | Regulatory and Legislative Environment > regulation = increased cost |
|
L |
J.3 | Governance > outside governance = increased costs |
|
L |
Decisions made in the project planning stage, and the information resources available to the researcher during that planning, can influence the overall project costs, the study outcomes, and future data curation and preservation. For example, given that data might be transferred to a repository that has submission requirements, additional data preparation costs may be incurred. If the forecaster/researcher uses no formal data management software in the laboratory, a decision can be made to include additional costs in the budget to account for the effort. Funds could be requested for a data manager or wrangler to manage the data and set up the necessary infrastructure to adhere to data formatting standards. Automated pipelines could also assist transfer to a State 2 active repository on a regular basis. Cost to implement those pipelines may be greater up front but could also save many human hours over the duration of the project.
Because an individual forecaster, in this case a primary research environment researcher, cannot be responsible for estimating all costs for data management in perpetuity, the goal in applying the forecasting framework should be to estimate costs incurred during data acquisition and stewardship while they are in the researcher’s control (i.e., the costs incurred while data are in State 1). However, the forecaster needs to be aware of requirements for long-term stewardship and be ready with the resources required (e.g., time, money, personnel) to prepare data for transfer to a State 2 (active) repository if to be shared or, if not, to a State 3 repository for long-term preservation.
Step 6. Estimate the costs for relevant cost components based on the characteristics of the data and information resource.
In a quantitative cost forecast, the costs for the activities in the previous section would be quantified for each of the major cost components (e.g., Box 3.2). As noted previously in the report, quantifying costs is dependent on numerous case-specific factors such as the objectives for the information resource, the personnel and infrastructural resources available to the forecaster, and host institution requirements. In a real cost forecast, all of these would be considered to arrive at monetary values.
REFERENCE
Maumet, C., T. Auer, A. Bowring, G. Chen, S. Das, G. Flandin, S. Ghosh, et al. 2016. Sharing brain mapping statistical results with the neuroimaging data model. Scientific Data 3:160102. https://doi.org/10.1038/sdata.2016.102.