5
Applying the Framework to a New State 2 Data Resource
In its statement of task (see Box 1.1), the committee was asked to apply the cost-forecasting framework to two case studies relevant to the National Library of Medicine’s (NLM’s) data resources. The case studies presented are based on hypothetical examples provided by NLM to the committee (personal communication, E. Kittrie, January 4, 2019). This chapter presents the first use case which describes decisions by a policy maker, program officer, or research group to estimate costs for a new repository hosting a large amount of data. Although the scenarios presented in this chapter and Chapter 6 represent traditional research environments, the framework could be applied to different research scenarios.
These case studies are not quantitative cost analyses. A quantitative forecast of a real-life scenario would require greater resources and time than the committee had to accomplish the task, and values obtained for the hypothetical cases would be meaningless given the number of variables presented by the data, the institutions involved, the resources available, the requirements of the funding entity, and so on. Instead, the committee provides high-level examples of how an investigator, a data-resource developer, or a resource manager could use the framework to systematically identify all the cost components which they then can use to develop their own meaningful forecast. In a true quantitative forecast, many more details—as dictated by circumstance—would need to be considered.
Cost forecasters creating or managing a State 1 (primary research) or State 2 (active) platform will unlikely be able to quantify the costs of data beyond the performance of their respective research projects. However, as stated in Chapter 4, understanding the potential future value of data and how decisions in early data states are made may affect the effectiveness and efficiency of future data preservation, curation, and use of the data. Considering the life cycle of data beyond the current data state and the resources necessary to transition between states will be increasingly important as data sets become bigger and more complex. Future cost ramifications could inform near-term decisions.
The cost forecasts take advantage of the cost-driver template in Appendix E. This template, based on the cost drivers in each state outlined in Table 4.2, assists in developing a narrative regarding the data life cycle. The approach to forecasting costs for the use cases follows the basic steps described in Table 4.1.
USE CASE 1: ESTIMATING COSTS ASSOCIATED WITH SETTING UP A NEW DATA REPOSITORY FOR THE U.S. BRAIN INITIATIVE
The cost-forecasting framework is applied to a new State 2 (active repository) platform. The study committee applied the framework as would a likely cost forecaster, in this case, a neuroscientist (see Box 5.1), and provides some information about existing platforms for context (see Box 5.2).
Applying the Framework to Use Case 1
Using the forecasting steps provided in Table 4.1, the researcher begins to construct the cost forecast.
Step 1. Determine the type of data resource environment, its data state(s), and how data might transition between those states during the data life cycle.
The first two columns in Table 5.1 list the data archive requirements as specified in the RFA. After considering the activities associated with each of the data states described in Tables 2.1, 2.2, and 2.3, the researcher concludes
TABLE 5.1 Specific Services Specified in the Request for Application Mapped to Data States, Activities, and Subactivities
Research Objective Number | Archive Requirements as Specified in the BRAIN Initiative RFA-MH-17-255 | States, Activities, and Subactivitiesa |
---|---|---|
1 | The data archive is expected to use relevant standards that describe BRAIN Initiative experiments. Such standards may be developed under RFA-MH-17-256 or may already exist. | II.A.1, II.B.1 |
2 | A data archive will develop a data submission pipeline ensuring appropriate quality control standards for laboratories that are trying to upload data. For example, if an experimental standard defines an allowable range of values for a particular data element, the submission pipeline should make sure that uploaded data respect the current data standard. | II.B.1, II.B.7, II.E.2 |
3 | Ideally, the data archive will create both a submission pipeline and a related validation tool to allow researchers to check the quality of their data even if they are not trying to upload data. . . . Data submission pipelines that originate with the data-collection instrument in the depositor’s laboratory and require minimal manual intervention would be ideal but are not required. | II.B.1, II.B.7, II.C |
4 | A data archive will work closely with BRAIN Initiative awardees and others to collect and archive relevant data sets. | II.A.3, II.D.1, II.D.2 |
5 | Each data archive should plan for a help desk to work with those who are trying to upload data. | II.I.2 |
6 | Each data archive must develop plans to make the data readily available to the broad research community and to citizen scientists, as appropriate. | II.B, II.I |
7 | Depending on the type of data, data submission agreements and data access agreements may be necessary. | II.D.2, II.D.3 |
8 | In many cases, processed data may be as useful to the research community as the raw data produced in the laboratory. Each data archive should consider storing and curating the appropriate data (either raw or processed) and make them available to the community. | II.E, II.H |
9 | A data archive may propose evaluating deposited data and scoring them to allow the research community to have some guidance about data quality. | II.E.2, II.E.3 |
10 | Each data archive should plan to assign persistent identifiers to deposited data and to processed data to allow the research community a very easy way to cite the data sets that are being used. | II.E.5 |
11 | A data archive should allow researchers to have a space where they can share data privately to facilitate collaboration prior to publication. Such private enclaves must last for only a defined period of time before that data set is shared with the rest of the research community. | II.B.9 |
12 | A data archive may help users deposit data into other sustainable databases, such as those supported by the National Center for Biotechnology Information, but this is not a requirement. | II.I.2, II.L.3 |
13 | There may be cases where data are stored in more than one data archive. In those cases, a data archive funded under this funding opportunity announcement will ensure that the user community can find all relevant data using appropriate linkages or database federation strategies no matter where the data are actually stored. | II.F.2 |
14 | Furthermore, each data archive will provide an interface that is accessible to anyone with a web browser. | II.B.7, II.H.3, II.H.4 |
15 | A data archive will make appropriate query tools and summary data easily available to allow the research community to check whether data of interest are held in the archive. | II.B.3, II.B.4, II.B.8, II.B.10, II.H |
16 | The user interface should make the maximum amount of information available to the research community while considering user friendliness and ease of interpretation. | II.B.7, II.H |
Research Objective Number | Archive Requirements as Specified in the BRAIN Initiative RFA-MH-17-255 | States, Activities, and Subactivitiesa |
---|---|---|
17 | The website is expected to have a broad user base that will include both naïve users and experienced bioinformaticians, and should provide an interface that will accommodate both types of users. | II.B.7, II.H |
18 | In many cases, users will want to analyze or use visualization tools to interact with the data without downloading any data. Those interactions should be anticipated by the data archive. | II.B.3, II.B.5 |
19 | Expensive computations could result from some analysis activities, and the data archive should explain plans to deal with such eventualities. | II.B.3, II.B.4, II.B.5, II.I |
20 | A data archive may, but is not required to, use cloud storage and computing capabilities to enable the research community to analyze data without downloading them. A data archive should (but is not required to) allow users to bring their own analysis tools to the data. | II.A.3, II.B.1, II.K.1 |
21 | Each data archive will be expected to have staff who are knowledgeable about informatics and the experimental data being collected. The informaticists will be responsible for coordination with other relevant informatics efforts. | II.B |
22 | In particular, a data archive will be expected to identify and federate the archive with other data repositories and knowledge bases, as appropriate. | II.F.2, II.F.3 |
23 | This data archive integration should create ways for users to query all relevant data repositories for relevant information. Funded data archives will be members of a larger BRAIN Initiative Data Network that will work across BRAIN Initiative activities to promote integration of a variety of data types. | II.F.2, II.F.3 |
24 | In addition, the data archive will interact, as appropriate, with informatics activities outside the BRAIN Initiative such as the NIH Big Data to Knowledge effort and the work of the International Neuroinformatics Coordinating Facility (INCF). | II.A |
25 | When possible, a data archive is expected to use existing infrastructures and standards. These could include persistent identifiers such as Digital Object Identifiers (DOIs) or Resource Identifiers. | II.B.10, II.E.4, II.E.5 |
a Activities are defined in Tables 2.1, 2.2, and 2.3 of this report. The Roman numeral refers to the data state, the capital letter refers to the major activity, and the Arabic numeral refers to the subactivity. The cost forecaster can use this information to consult Table 4.2 to identify likely cost drivers for each activity.
that the proposed data resource will be an active repository and platform and therefore a State 2 resource. The researcher begins to match activities associated with a State 2 (Table 2.2) resource with the specific research objectives described in the RFA and lists which State 2 activities found in Table 2.2 would be necessary to accomplish each of the research objectives (the third column in Table 5.1). The costs and cost drivers associated with the activities will be revisited later in the cost forecast by consulting Table 4.2.
Because the researcher is interested in preserving the long-term value of the data and increasing the efficiency and effectiveness of their long-term curation and use, the researcher also considers activities related to eventual transfer of data to another State 2 resource or long-term State 3 archive. These latter considerations were not activities specified in the RFA.
Step 2. Identify the characteristics of the data (Chapter 4), data contributors, and users.
The next sections summarize a high-level consideration of this step, although, in reality, this step would be revisited several times as resources are characterized; choices about the repository are refined; and characteristics of the data, data platform, and contributors and users are better defined through use of the template in Appendix E.
Data Characteristics (Sections A and E of the Cost-Driver Template in Appendix E)
- The files are large: TB per individual data set.
- There are many files and large size of individual files. Raw data may be contained in thousands of individual files. For example, a single serial section electron microscopy data set covering less than 0.5 mm3 of cortex by Bock et al. (2011) comprised 36 TB of raw data and 10 TB after processing to stitch the individual tiles together and reconstruct the volume.
- Sizes are likely to increase over the life span of the resource.
- There are multiple modalities.
- The data are complex: two-, three-, and four-dimensional images.
- There are significant metadata requirements.
Because of the rapid development in algorithms for processing and reconstructing the data, both raw and processed data will likely need to be stored, and compression algorithms for high-resolution scientific imaging data are likely to interfere with the reuse of the data for many applications. Imaging will likely be from animal subjects, minimizing costs associated with security and confidentiality (Section H in Table 2.2). The repository has decided that all data will be offered under the same license, minimizing any costs associated with enforcing multiple permissions.
Contributors/User Community (Section F of the Cost-Driver Template in Appendix E)
Assuming 200 BRAIN-funded users submit data twice a year, 400 independent submissions per year could be expected. Contributor support needs will likely be high, given data complexities and size, particularly in the early years when data validation and upload pipelines may not be fully mature. Data contributors will likely have a sense of urgency to upload backlogs of data before their grant funding runs out. If standards and best practices are not fully in place when the resource begins to acquire data, then a backlog of data will need curation. The resource has to decide whether to devote extra staff and funding to re-curating those data when standards and tools are in place.
The user community is also expected to be diverse, with scientists working in different environments. The RFA requires the resource to work closely with contributors (Table 5.1, research objective 4), maintain a help desk (research objective 5), and make data available to the broad research community, including citizen scientists as appropriate (research objective 6). Given the range of user skills to be accommodated—and the cost to develop intuitive user interfaces to do so—general help, training, and outreach materials are likely to be increased. Frequent updating of help materials may be necessary during early phases, when the technology is changing on a regular basis.
Step 3. Identify the current and potential value of the data and how the data value might be maintained or increased with time.
The perceived and long-term value of data can be informed by answers to Sections A, D, and E of the cost-driver template in Appendix E and through consultation with experts and colleagues. The perceived long-term value of the data in the proposed resource will depend on hard-to-estimate factors. Some data will derive from new and rapidly developing techniques. Colleagues and experts think that some data may be superseded as technologies improve. On the other hand, data that are the result of a complex experimental paradigm—for example, the carefully correlated light and electron microscopy work of Bock and others (2011)—may be quite valuable even if of lower quality. Well-annotated imaging data tend to be interpretable and usable for a long time in different contexts. The long-term value cannot be estimated at the time of proposal preparation, making decisions about the level of replication and access (e.g., transfer to a less expensive form of storage) difficult in the early stages.
Step 4. Identify the personnel and infrastructure likely necessary in the short and long terms.
The forecaster considers which of the activities described in Table 2.2 are relevant to the RFA requirements (Table 5.1) and the proposed database more deeply. This resource will not handle sensitive information; thus, some activities will not be necessary. The forecaster next considers how these activities might be accomplished (and by whom), again referring to the list of expertise included for each activity in Table 2.2. To estimate personnel costs the forecaster needs to consider how long each task will take, the skill levels necessary, the availability of people with the skills, and the tool support or training necessary for the people to perform the tasks. In reality, many of the positions listed in Table 2.2 are likely not to be included or consulted when setting up a typical researcher-led scientific infrastructure, but it is worth considering the value of including them and how their involvement influences overall cost. For example, many new resources struggle with metadata. Consulting a data librarian or records specialist early may help to reduce cost, improve quality, and increase FAIRness by providing advice on community standards for high-level metadata and specialized metadata schemas.
Because the repository infrastructure already exists in this fictional use case, many setup costs might be reduced, but significant customization will likely be necessary. If the infrastructure had to be developed from scratch, the forecasters might consider whether instances of existing infrastructure could be set up or whether they could partner with an existing repository to provide the back-end infrastructure.
Step 5. Identify the major cost drivers associated with each activity based on the steps above, including how decisions might affect future data use and its cost.
Table 4.2 is consulted to identify the cost drivers often associated with a State 2 resource, and the cost-driver template found in Appendix E is completed (the template is based on the cost-driver questions found in Chapter 4). The completed template is presented as Table 5.2, following the discussion of Use Case 1, below. The completed template will help the forecaster determine which decision points will likely control costs now and in the future, and it will help the forecaster understand when specific costs will be borne and by whom.
The responses to the cost driver questions shown in Table 5.2 allowed the forecaster to create a narrative to help him identify exactly what will be involved in establishing the State 2 (active repository) resource. From that narrative, the forecaster could determine how influential each of the respective costs is likely to be in the overall costs (listed below). In a quantitative cost forecast, the costs for the activities could then be quantified, and each of the major cost components (e.g., Box 3.2) worked out.
- A: Content → Likely high
- B: Capabilities → Likely medium-high
- C: Control → Likely medium
- D: External Context → Likely low
- E: Data Life Cycle → Likely high
- F: Contributors and Users → Likely high
- G: Availability → Likely medium-high
- H: Confidentiality, etc. → Likely low
- I: Maintenance and Operations → Likely low
- J: Standards, etc. → Likely medium
Step 6. Estimate the costs for relevant cost components based on the characteristics of the data and information resource.
As noted elsewhere in the report, the ability to estimate actual costs is dependent on so many factors that the committee elected not to attempt this exercise. How data size can influence costs can be exemplified by using cost
estimators provided by commercial cloud services (in this case, the Amazon Simple Storage Service1 cost tools). The absolute size beyond a certain threshold may not impose many additional costs for cloud storage. For example, as of this writing, the cost to store up to 50 TB is $0.023 per gigabyte (GB)/month, whereas over 500 TB for a month brings the cost down to $0.021 per GB, according to that cost estimator. However, given the anticipated growth of the data, the storage cost is not insignificant in absolute terms. For one PB of data, the cost, absent any institutional discounts, would be $21,000 per month depending on level of access. Cost over time will need to be considered. Cloud service prices may change, or circumstances may warrant a change to a different provider with different cost structures, services, and data formatting requirements. The size of the data may also impose costs for functions such as external backup, replication, and data transfers (G.4), depending on what infrastructure is available to the resource. The forecaster will want to compare full costs of storage from multiple service providers, including the fully loaded costs of local computer resources.
The complexity of the data can also impose significant costs. The capabilities related to functional specification and implementation (Activity II.B) will need to be developed or modified and maintained and the standards for multiple data types and paradigms developed or implemented. These functions may need to be multiplied by the number of data types and modalities to be supported, depending on how well the tool set generalizes.
Once the resource is mature and data access and use patterns emerge, some significant cost savings may be realized by moving unused or obsolete data to cold storage. Again, using commercial cloud provider cost tools illustrates how storage costs are affected by access and responsiveness requirements. For the Amazon Web Services S3 Intelligent Tier pricing model, designed for data where access is infrequent or unknown, the cost for storage that is accessed at high frequency is $0.021-$0.023 per 50-500 TB but only $0.0125 if infrequently accessed. If it is known that the data are infrequently accessed and users can tolerate slow retrieval times (minutes to hours), then the cost of access will drop to $0.004 per GB. For a 500-TB data set, the cost of storage would drop from $10,500 per month to $2,000. Cold-storage options are best considered during the first funding period and in consultation with the community served so that expectations are clear.
The data and the user community characteristics will also be a major determinant of decisions about infrastructure for hosting and accessing the data, as well as the necessary user support levels. The RFA does not require that the resource utilize the cloud; the large size of the data and the unknown growth characteristics of both the data and the user community make the cloud attractive, as it can scale with increasing demand. Costs associated with data transfers, search, computational services, and downloads will need to be carefully monitored. Costs might be driven by unexpected demand surges (e.g., a data set is posted on social media and is heavily accessed). This fictional use case will protect itself from unexpected and uncontrollable charges by passing the cost for download to the end user. Many cloud providers now provide tools and safeguards for monitoring and limiting costs. Taking advantage of local or government programs (e.g., Cloudbank and the Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability) to make informed decisions and gain access to expertise about building platforms in the cloud could also lower costs.
Last, although the RFA does not specify that the resource should develop an exit strategy, thinking about the long-term data and resource viability is good stewardship. Bilder and others (2015), in their principles of open scholarly infrastructures, recommend that every resource have a “living will” that describes how a resource would wind down. In the proposed large (multiple PB) hypothetical BRAIN repository, the costs of transferring data to another State 2 active repository or to a long-term archive could be significant.
REFERENCES
Bilder, G., J. Lin, and C. Neylon. 2015. “Principles for Open Scholarly Infrastructure-v1.” https://doi.org/10.6084/m9.figshare.1314859.
Bock, D.D., W.-C. Allen Lee, A.M. Kerlin, M.L. Andermann, G. Hood, A.W. Wetzel, S. Yurgenson, et al. 2011. Network anatomy and in vivo physiology of visual cortical neurons. Nature 471(7337):177-182.
___________________
1 See Amazon Web Services, “Amazon S3 Pricing,” https://aws.amazon.com/s3/pricing, accessed December 16, 2019.
TABLE 5.2 Completed Cost-Driver Template for Use Case 1: Setting up a BRAIN Archive
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
A. Content | |||
A.1 | Size (volume and number of items) > size = higher costs |
|
H |
A.2 | Complexity and Diversity of Data Types > complexity + diversity = higher cost |
|
H |
A.3 | Metadata Requirements > metadata amounts + type = higher cost |
|
M |
A.4 | Depth Versus Breadth > breadth = higher cost |
Will the repository be restricted to certain data classes or types that the repository must support? It will primarily focus on imaging data, but there will be multiple types of imaging data from multiple domains. |
M |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
A.5 | Processing Level and Fidelity > compression = lower cost |
|
H |
A.6 | Replaceability of Data > replaceability = lower cost |
|
H |
B. Capabilities | |||
B.1 | User Annotation > user annotation functions = higher cost |
|
H |
B.2 | Persistent Identifiers type of identifier = potential costs |
|
L |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
B.3 | Citation > citation functions = increased cost |
|
L |
B.4 | Search Capabilities > search capabilities = increased cost |
|
H |
B.5 | Data Linking and Merging > linking and merging = increased cost |
|
H |
B.6 | Use Tracking > tracking = increased cost |
|
L |
B.7 | Data Analysis and Visualization > services = higher cost |
|
H |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
C. Control | |||
C.1 | Content Control > review processes = increased cost |
|
H |
C.2 | Quality Control > quality control = increased cost |
|
L |
C.3 | Access Control > controls = increased cost |
|
M |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
C.4 | Platform Control > platform restrictions = increased cost |
Are there restrictions on the type of platform that may or must be used? No, free to use the cloud if desired, and there are no requirements to use a specific cloud provider. Data will not be mirrored overseas. |
L |
D. External Context | |||
D.1 | Resource Replication > replication = increased cost |
Is there a requirement to replicate the information resource at multiple sites (i.e., mirroring)? No. |
L |
D.2 | External Information Dependencies > external dependencies may or may not = increased cost |
Will the resource be dependent on information maintained by an outside source? Will use community ontologies for certain metadata. |
L |
D.3 | Distinctiveness > distinctiveness = increased cost |
Are there existing resources available that provide similar types of data and services? Yes, the Cell Image Library. The EU also has a Bioimaging Database.a |
L |
E. Data Life Cycle | |||
E.1 | Anticipated Growth > growth = increased costs |
|
H |
E.2 | Update and Versions > updates + multiple versions = increased cost |
|
H |
E.3 | Useful Lifetime limited lifetime = decreased cost |
|
L |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
E.4 | Offline and Deep Storage > offline/deep storage = decreased costs > transfers = increased cost |
|
H |
F. Contributors and Users | |||
F.1 | Contributor Base > number and diversity of contributors = increased cost |
|
H |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
F.2 | User Base and Usage Scenarios > access and diversity of users = increased cost |
|
H |
F.3 | Training and Support Requirements > training + services = increased cost |
|
H |
F.4 | Outreach > outreach = increased costs |
|
M |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
G. Availability | |||
G.1 | Tolerance for Outages < tolerance for outages = increased costs |
|
M |
G.2 | Currency > currency = increased cost |
|
M |
G.3 | Response Time > responsiveness = increased cost |
|
M |
G.4 | Local Versus Remote Access > cloud could lead to increased costs |
|
H |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
H. Confidentiality, Ownership, and Security | |||
H.1 | Confidentiality > confidentiality = increased cost |
|
L |
H.2 | Ownership > ownership = increased costs |
|
L |
H.3 | Security > security = increased cost |
|
L |
I. Maintenance and Operations | |||
I.1 | Periodic Integrity Checking > integrity checking = increased cost |
|
M |
I.2 | Data-Transfer Capacity > data-transfer upgrades = increased cost |
Will the bandwidth available to the resource be sufficient for the data sizes and rates required? Campus connectivity was recently upgraded, so no internal problems anticipated, but there is no control over our submitters and users. See G.4. |
L |
I.3 | Risk Management > risk mitigation = increased cost |
|
H |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
I.4 | System-Reporting Requirements > system-reporting requirements = increased costs |
What types of system reporting will the resource be required to do? No specific information has been requested in the RFA; monthly reports on acquisitions, total size, and amount of use will be generated for internal purposes. |
L |
I.5 | Billing and Collections | Will there be charges for use of the resource? There will not be a charge for accessing data within our resource, nor for invoking services we provide. However, users will be required to bear costs associated with download and any custom computations they want to perform. |
|
J. Standards, Regulatory, and Governance Concerns | |||
J.1 | Applicable Standards > mature standards = decreased costs |
|
H |
Category | Cost Driver | Decision Points/Issues | Relative Cost Potential (Low, Medium, High) |
---|---|---|---|
J.2 | Regulatory and Legislative Environment > regulation = increased cost |
|
L |
J.3 | Governance > outside governance = increased costs |
|
L |
J.4 | External Consultation > consultations = increased time = increased costs |
|
M |
a The website for Euro Bioimaging is https://www.eurobioimaging.eu/, accessed January 11, 2020.