National Academies Press: OpenBook
« Previous: 2 Framework Foundation: Data States and Associated Activities
Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

3

Cost and the Value of Data

Cost refers to the resources necessary to accomplish an objective. A cost forecast (i.e., a prediction of costs) is usually expressed in monetary terms and will be based on the quantity and nature of the items and services needed. Cost forecasts inform decision makers about the resources needed to transform an idea into a reality. Costs and cost forecasts come in a variety of forms, and there is a rich set of issues to consider in thinking about them. A cost forecast can inform actions needed to assemble necessary resources or to resolve any issues that would impede success. That may include identifying less costly solutions if the forecast suggests that original plans will face financing difficulties. Indeed, cost forecasts are vital in comparisons of alternative courses of action. Cost is not a measure of value (i.e., the benefit the objective produces), and cost forecasts may be unwelcome to the extent that they raise questions about the merits of an undertaking relative to the resources required. Decision makers focus on the best use of available resources; however, they will want good cost forecasts as the basis for their choices. A good cost forecast allows focus on the idea that resources expended in one use are unavailable for other potentially high-value uses. This “opportunity cost” is an important element of data management. Expending resources on keeping existing data sets available means fewer resources for funding new research activities. There is a trade-off in balancing the allocation of resources to maintain the by-products of past research and allocating resources toward new research.

Controversies created by cost forecasts help explain why so many governments in the United States and abroad use cash budgets (i.e., finance 1 year of costs at a time) versus acknowledging the total commitment implied by today’s decisions through full funding of projects. A variant of this difficulty is created when a fixed horizon (e.g., a “5-year plan”) is used to forecast costs. Large costs may be encountered just beyond that fixed horizon. As described later, this problem arises in planning for the long-term preservation and use of biomedical research data (States 2 and 3, in the terminology of this report), since typical previous State 1 primary research grants provided only for the performance period of the grant (a policy within at least the National Institutes of Health [NIH] that may change [DHHS, 2019]). To the extent that the biomedical research enterprise wants to ensure that good decisions about data management and data access are made at research project inception, or at any point in of the data life cycle (see Chapter 2), it is critical to address all cost components across the full life cycle of the data system.

Because many researchers and data scientists have had no formal education in economics, this chapter begins with a primer that introduces basic economics terms and concepts that may be encountered by the cost forecaster. The text delineates the principal economic issues in creating cost forecasts and the significant variables (i.e., “cost drivers”) affecting the forecast in the biomedical research data life cycle. It then assesses which are most significant

Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

in each element of States 1, 2, and 3 and follows with a review of how the properties of these drivers influence costs. The chapter concludes with an illustration of how the cost drivers might affect forecasts for States 1, 2, and 3.

ECONOMIC ISSUES IN FORECASTING COSTS

When developing a cost forecast, it is important to understand that (1) all goods and services will incur costs to someone—even if a cost seems “free” (e.g., services provided by a university to a researcher, or the “free” access to data repositories managed by a government institution), (2) many costs are not incurred immediately, (3) many costs are not easily anticipated, and (4) cost burdens may shift. The first step when considering costs is to define precisely what one is trying to accomplish—in other words, identify what one is “buying.”

Whose Costs?

From the perspective of the individual or organization that makes management decisions (e.g., a researcher, research institution, or repository host), the costs that matter most are those that must be financed from its budget. Those costs may be less than the total costs of a project or responsibility, distorting comparisons among competing courses of action. For example, a government agency that manages a data repository may underfund pension costs and omit overhead items such as facilities cost that a private organizations must include. A public agency or other organization that manages research or a repository may benefit from services such as a computing environment that is financed outside of its budget.

Parent institutions such as a university that provides services to component units such as research departments may insist that the costs of those services be incorporated into decision making. From the parent institution’s perspective, the decision to proceed will trigger payment of those costs by the institution even if the immediate project manager (e.g., the researcher) does not have to finance them.

Sunk Versus Marginal Costs

It is important to distinguish between sunk costs (i.e., costs that have already been sustained and cannot be recovered, such as previously purchased computer equipment) and marginal costs (i.e., future costs, including costs for the next increment of effort, such as additional servers for data storage). Some sunk costs might be derived from reusing or redeploying previously developed software or infrastructure paid for by others. For example, existing open-source software might be incorporated as a component of a new data information resource, so some amount of development costs for that software would not be included in the present forecast. However, there will still be marginal costs for adapting, maintaining, and integrating that existing software that would need to be incorporated into the cost forecast. Marginal costs tally what costs beyond the fixed overhead that is already financed or investments that do not need to be repeated will be incurred if a project proceeds. Marginal costs might also change if there are savings derived from greater efficiencies incorporated into later project stages, for example, as a result of experience gained managing data or improvements in hardware technologies. Decisions are best informed based on marginal costs because the incremental resources required for a project may be better understood. This dictum is true even if an institution requires a budget to be prepared otherwise (e.g., to amortize a building that has already been constructed).

Because marginal costs may be difficult to calculate, many institutions rely on an average of past costs for their forecasts. If there are significant fixed costs for an activity (e.g., to create the data typology), average costs could decline as additional data sets are acquired. However, in that situation, the average will exceed the marginal cost, thus potentially overstating resource needs and unduly discouraging the next increment of activity. It is also possible that marginal costs could exceed the historical average, for example, if a new facility is required to accommodate a major expansion, and the historical average is based on a building for which the construction cost has not been adjusted for inflation. Presumably, an appropriate adjustment will be made, but the best safeguard against misforecasting is to invest the additional effort to understand costs at the margin.

Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

Cost Versus Price

Institutions or individuals often base their cost forecasts on the prices charged for goods and services. Prices will vary with the extent of services provided. Prices for in-house services (e.g., on-premises data storage) may not cover all the elements needed for the data function (e.g., power, data center overhead), whereas a cloud provider’s price is more likely to bundle these elements (along with a certain level of data security). A price reflects the amount of money needed to purchase a product or service, but it may not always accurately reflect the true cost to provide those inputs (i.e., all the resources that society must use). If, for example, another part of the institution or a different institution subsidizes a research program or data repository by providing “free” or discounted computer services or by lending staff to a project, the prices may understate production costs. This situation also arises if larger social effects are omitted that should be (but are not) reflected in prices (e.g., environmental effects of electricity generation and use required by data repositories). Again, some of these may not be incurred directly as monetary costs, or they may not be incurred immediately. They may result in future costs (e.g., power companies may need to charge higher rates as they become responsible for mitigating environmental impacts). On the other hand, market forces might lead to prices that overstate actual costs for goods and services (e.g., “excessive” download charges from a cloud service provider).

The issue of cost versus price is especially important to consider when projecting the cost of commercial services. Service providers may benefit from much greater economies of scale and thus lower cost than an individual institution or researcher, but their lower costs will not necessarily translate into lower prices for the science community. Even if prices accurately reflect past (marginal) costs, there is no guarantee that they will do so in the future. For example, the widespread adoption of new data practices (or even an adoption by a single large enterprise) could shift demand sufficiently to affect future prices (e.g., for a particular skill) in a way that is not captured by studying the past pricing history. Recent increases in salaries for data scientists is an example (see Box 3.1).

Institutions acting in the public interest may be instructed to include the full cost of producing a result or to avoid practices that impose social costs not reflected in market prices. They may be instructed to finance some of the subsidies (e.g., in the form of scholarships). They may be directed to reduce the impact of imperfections through how they procure an item (e.g., the statutory direction that the Department of Defense use Veterans Administration preferential drug prices).

Investment Versus Operating Costs and Their Time Profiles

The time profile of costs matters when comparing courses of action. That case certainly arises if funds for the immediate budget period are more difficult to obtain than those further in the future, when fewer commitments are perceived to be fixed and there is more discretion about how funds might be employed. Some one-time costs (i.e., investments) may be necessary to begin a project or a project’s next phase. Sometimes, such costs must be expended periodically (e.g., the cost of hardware or software refreshment). These expenditures are often followed by a period in which operating costs require a lower level of continued resources.

Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

Two projects may have the same (total) forecast costs but very different time profiles. The standard solution to comparing these different cost streams is discounting (e.g., Mankiw, 2017)—for example, using a discount (interest) rate to price everything as a single payment made immediately (i.e., the “present value”). This present value can be thought of as a corpus that pays not just for first-period costs but for future expenses as well, using a combination of the principal and the interest theoretically (or actually) earned in the meantime. A “discounted value” recalculates future payments as the equivalent of payments made today on which the earnings at the discount (i.e., interest) rate plus the first-period principal would be exactly enough to cover future obligations. Controversy arises among stakeholders about the choice of discount rate, which affects how courses of action rank. A high discount rate diminishes the present value of future costs, a low one vice versa. Discount rates for U.S. federal agencies are usually mandated by the Office of Management and Budget (usually the rate on Treasury obligations).1

Buildings and other physical facilities present a special problem: they represent large expenditures in short periods, and there is an issue of cost recovery. If the institution already owns them and no refurbishment is needed, their costs may be viewed as sunk and thus omitted from forecast marginal costs. If another entity either owns or is renting the building on the institution’s behalf (e.g., for a federal agency, the General Services Administration), the rent becomes part of the forecast cost. If the institution constructs the facilities for the purpose of the project under consideration, those costs become part of the project’s early-period expenses.

Forecasting costs may be reasonably straightforward when investment costs occur early in the life of an individual project. Forecasters might draw on experience with similar recent projects, or they might even be based on bids for the specific project (with due allowance for how the contracting strategy and other factors might affect the actual price eventually paid). If periodic future investments are needed to sustain the project, forecasting could be more difficult, given changes in the marketplace that affect costs. Some costs may increase (e.g., owing to suppliers leaving the business), while others may shrink (e.g., from technological improvements such as those that have characterized computing power). As a result, there may be substantial uncertainty about future costs.

In reality, investment costs are often underestimated at inception, in part owing to the cost of developing necessary new technology (e.g., new software and hardware), the procurement costs of which may thus be greater than anticipated. Such inaccurate cost forecasts may reflect excessive optimism about what can be achieved, a lack of clarity or precision regarding what is to be accomplished, or deliberate “lowballing” on the part of a proposer seeking to win approval for an initiative.

Principal Elements of Operating Costs

For most public and private enterprises, the principal elements of operating costs are consumable inputs (e.g., power, vendor services) and direct labor (i.e., personnel). Both present interesting forecasting challenges. Box 3.2 lists the major costs to establish and operate a biomedical information resource. Uncertainty may arise from potential changes in the marketplace for non-direct-labor inputs. For example, what is the likelihood of changes in the cost of materials-based inputs (e.g., reductions in energy costs as a result of hydraulic fracturing versus any increase that a carbon tax would impose)? Are vendor prices for services likely to be stable? If not, how plausible are the mechanisms that drive change?

Estimating the cost of direct labor (i.e., for the personnel employed by the organization) may appear to be straightforward, but the institution must allow for the reality that wages are likely to increase over the longer term, driven by general inflation and productivity growth. Moreover, fringe benefits (e.g., health care) account for a major part of direct labor costs, and their costs may be driven by factors outside the institution’s control. Also challenging is forecasting how much direct labor will be required for new or different projects (e.g., for activities such as curating a new data set, and for data security, integrity checking, and addressing the impacts of disruptions to access of those data sets). Collecting early data on what specific activities will be required in any of the data steps will help to improve estimates.

The institution will also need to consider changes in the composition of its direct labor force in its forecasts. A higher proportion of specialized skills would likely increase costs (e.g., more data scientists), whereas new

___________________

1 See OMB Circular A-94, December 18, 2018.

Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

data processing approaches (e.g., the application of artificial intelligence) or the ability to employ a more junior workforce might decrease them, after allowing for any required initial software or process investments. Since changes in the experience composition of the workforce are a product both of organizational and individual decisions, forecasts need to look at a distribution of potential developments for those elements not under direct control (e.g., retirement rates).

Relative Costs of Storage Media and Hardware

A difficult issue in forecasting costs for a data-intensive enterprise is how to deal with the information infrastructure (i.e., the storage media and hardware). First, this infrastructure may be provided by others (e.g., a university or other host institution), and the repository may be charged in such a way that prices and costs diverge substantially. Indeed, the repository may see only an operating cost—that is, charge for services—because the providing entity is making the actual investments. It will nonetheless be useful for repository managers to understand those underlying costs, if only to judge their reasonableness, especially in deciding whether reliance on another provider or investing in some or all of the infrastructure itself is a better course of action. Second, as is widely appreciated, IT changes rapidly, with implications for both the nature of the services the repository is providing (e.g., users wanting the latest level of capability) and for the costs the repository faces, as has been true for storage.

Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

The choice of data storage media may also affect the short- and long-term costs of data storage. Data might be sitting in the potentially volatile main memory of the computer system (e.g., if they just came off an instrument or were output by a simulation model) or in online storage (e.g., solid-state or mechanical disk drives), from where they can be readily transferred to the information resource. They might be held in an offline storage medium, such as removable disk drives, compact discs, or tapes, in which case there will be costs in bringing the data back online by either automated or manual means. In some cases, the data may be stored in a deprecated medium (e.g., a Zip-drive disk), where finding a device to access the data could be a challenge. The data might even be in nondigital format, such as paper or photographs, which entails high costs for scanning or manual transcription (see, e.g., Nielson et al., 2015). The repository may face a one-time cost to shift to contemporary storage media, which may be less expensive on a life-cycle basis, or an ongoing challenge with cost implications for maintaining access to data using storage approaches for which commercial and technical support are shrinking. The relative costs of hardware and storage media for long-term preservation need to be compared in a systematic way, especially in light of how quickly options evolve. Example approaches to such comparison can be found in the literature (e.g., Merrill, 2017; Rosenthal, 2017).

Forecast Reliability

The reliability of a cost forecast is an important consideration. Procuring something new typically involves substantial uncertainty, which should be communicated (Manski, 2019). For example, during its information gathering for this report, the committee heard how the switching of service providers resulted in unexpected costs (see Box 3.3). Throughout the data life cycle, there will be a distribution of estimates for what is needed to sustain the activities in each data state. The Department of Defense, for example, recognized this reality in the call to budget to the “most likely cost” in 1981 (Greene, 1981) and in the later development and evolution of its “Better Buying Power Initiatives,” which acknowledge that there is a distribution of potential cost outcomes (Kendall, 2017). Doing so quantitatively may be difficult, owing to a lack of data, and require substantial additional effort. But at a minimum, the cost forecaster owes decision makers a warning and discussion about the existence of those uncertainties—even if they cannot be precisely characterized.

Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

ASSESSING THE VALUE OF DATA

Data constitute a different type of asset than physical infrastructure, and biomedical research data constitute a different type of data than those which are readily monetized in the commercial sector. The biomedical research community will want data valuation models that are able to attach value to the public good that a data resource can generate, and that can recognize the value society places on the institutions that support the data resources. The value of a single data set reflects factors such as its uniqueness, the number of times it is used, the cost per use, and the impact of each use (e.g., the change in prior probability of a hypothesis). Value differs from cost—it reflects worth in terms beyond the monetary. The value of data might vary for different purposes (e.g., initial discovery versus repeatability). And it may change with time, even if the data themselves do not change (e.g., a data set may lose value if it is superseded by another that is more precise or accurate, or it may gain value if better analysis techniques allow new knowledge to be obtained). Value may accrue in different states to different actors for reasons. A data set may be considered valuable while in regular use in a State 1 (primary research) or State 2 (active repository) environment (see Box 2.1), especially if seen to contribute to advancing science. However, it is difficult to forecast the value of data into the future because it is difficult to know how the data may be used or aggregated and repurposed.

A larger aggregated data resource has the potential to increase the value of a data set by increasing the number of uses and by increasing the benefit per use through linkage to other data sets. A data set increases the value of the aggregate data resource by contributing to its breadth (e.g., variety of data types added by the data set) and depth (e.g., number of instances of a data type or granularity of data in an instance). The degree to which the aggregate resource delivers on the potential to increase the value of a data set depends on how well it handles factors such as accessibility, discoverability, and analysis. The value of a data resource compounds if it sparks connections among diverse users. This compound value reflects factors such as the distribution of user backgrounds, geographic origins, and purposes, including research and nonresearch purposes. In the long term, the greatest value may be realized through the multiplier effect as heterogeneous data sets are aggregated and linked on novel computational platforms in ways that are impossible to predict at the time a data set is created. Uniqueness of a data set may be the best long-term predictor of value.

When determining the value of data from small, individual studies, an important factor is the extent to which they can be combined with other similar studies to increase statistical power. Studies that yield small sample sizes by the end of the study may be considered exploratory. If rigorous community standards and good data management practices have already been implemented by a laboratory, then submitting the data to a specialist repository will require less effort. The inherent value of the data may have increased, as it is more likely that they can be used with other similar data. On the other hand, if the data require significant formatting to meet community standards, the laboratory would have to expend significant resources preparing the data for submission. In this case, the data might be considered only moderately valuable and the researcher may choose to submit the data to a repository with less onerous requirements.

The lack of statistical power in smaller data sets is a key factor in current reproducibility problems (e.g., Ioannidis, 2005). Large efforts like the Human Connectome Project, the Alzheimer’s Disease Neuroimaging Initiative, and the Adolescent Brain Cognitive Development (ABCD) initiative are producing large, well-aligned data sets, but these types of projects are not able to sufficiently sample the phenotype space either within or between conditions. Promising results are starting to emerge, however. It is possible and perhaps even advantageous at times to aggregate data from smaller studies to increase statistical power and to train new machine learning algorithms to take advantage of heterogeneous data. Aggregating heterogeneous data allows a more complete and robust model of preclinical research to emerge, as each individual laboratory samples a small slice of a larger, multidimensional picture (Ferguson, 2019; Williams, 2019). The work of Alan Evans (Moradi et al., 2017) in neuroimaging with multicenter data from the Autism Brain Imaging Data Exchange (ABIDE) database2 also shows the importance of making available multiple independently acquired data sets. The ABIDE initiative includes two large-scale neuroimaging data collections, ABIDE I and ABIDE II, created through the aggregation of data sets

___________________

2 The website for ABIDE is https://fcon_1000.projects.nitrc.org/indi/abide/, accessed December 12, 2019.

Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

independently collected across more than 24 international brain imaging laboratories studying autism. Moradi et al. (2017) showed that machine learning algorithms that are trained across independently acquired data are more robust and generalizable than when trained on data from a single site. This work is consistent with that discussed at the Workshop on Forecasting Costs for Preserving and Promoting Access to Biomedical Data by Ferguson (2019) and Williams (2019), in which investigations using data from multiple laboratories or multiple genetic strains lead to more robust clinical predictions than investigations using more limited data. These results suggest that while a small individual data set on its own may be of limited value, when aggregated with other data, it can potentially increase the value of the pool of data. So to the extent that data are “multiplicatively integrative” through adherence to the findable, accessible, interoperable, and reusable (FAIR) principles and exposure through platforms that make them FAIR, their value increases. If, however, the data are shared through a platform where their discoverability is limited and where standards and curation are not enforced, then their value will be diminished.

To retain data value, reanalysis of data by either a researcher or a repository may be necessary to make them compatible with new data and to ensure that the results derived from the reanalysis are valid. For example, a widely discussed and controversial paper claimed that the statistical methods used in functional magnetic resonance imaging (fMRI) data analyses were leading to an overinflation of false positives in neuroimaging studies (Eklund et al., 2016). Detailed comparisons across major software packages in structural fMRI find major differences in the way that these packages calculate parameters such as cortical thickness (summarized in Kennedy et al., 2019). Software bugs or system dependencies may be uncovered that invalidate results drawn from older studies (Kennedy et al., 2019). Such reanalysis will entail costs that cannot be estimated in early cost forecasts because the evolution cannot always be predicted.

While technological volatility can make data obsolete as higher-quality or higher-resolution data become available, it can also increase the value of data in the long term, provided that the underlying data are valid. Their long-term availability ensures that these data may be reanalyzed with newer algorithms and approaches. As Eklund and colleagues note, “Due to lamentable archiving and data-sharing practices, it is unlikely that problematic analyses can be redone” (PNAS, 2016). In an analysis reported in 2019, Eklund and colleagues estimated that, of the total of 23,000 studies published in neuroimaging, up to 2,500 were likely affected by the misapplication of statistics. If the average cost of a neuroimaging NIH Resarch Project Grant Program (RO1) award to an investigator is $400,000 (Kennedy, 2014), then the total value of these 2,500 studies that must be discarded or reperformed is $1 billion. Thus, many in the neuroimaging community called for more long-term storage of primary neuroimaging data (Eklund et al., 2016, 2019; Kennedy et al., 2019).

Data value will also be related to the quality of the data. Higher-quality data are likely to be more valuable than lower-quality data, although quality control metrics for data are not always known and algorithms in the future may be able to account for suboptimal data characteristics (e.g., motion artifacts) and “rescue” these data for future use. The quality of the data is likely to be affected by the platform and standards used and how well they are supported by automated and human curation. Dr. Greg Farber (committee site visit to NIH, September 18, 2019) and Dr. Russ Poldrack (personal communication with M. Martone, September 18, 2019) described that automated pipelines catch many errors such as inconsistently named files that are difficult for human curators to identify, improving the overall quality of data and metadata submitted. However, human curators, using their human knowledge and insight, can catch discrepancies that software misses. So the same data set might be significantly increased in value when submitted to a repository with both automated and human curation in comparison to one that is submitted to a generalist repository that has minimal curation support. However, if the researcher carefully documents her data and adheres to community standards and best practices independently, data deposited in such repositories can be quite FAIR.

The perceived value of data influences preservation, access, and archiving decisions as well as decisions made regarding transition of data from state to state. Characterizing value for decision making might be related to the number of different tasks or decisions that the data support, and it might be possible to compare values of data sets without quantifying them. For example, if data set “A” supports all tasks that data set “B” supports, then one could assert that data set B’s value is no greater than that of data set A. Some may attempt to relate the value of data with the cost of obtaining them, but data value does not necessarily correlate with the financial investment made to collect them. Decisions made about the disposition of data may be based on the cost to replace the data,

Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

but those decisions should also be informed by the data quality, use, and replacement costs. A data set can have many anticipated uses and thus be viewed as high value, but if those uses do not occur, the value is not realized. The cost to replace the data may change. The day may come, for example, when technology advancements make it less expensive to resequence an organism than to download its genome. On the other hand, some data may be irreplaceable (e.g., surveys done in the past with time-dependent results) or take a long time to re-create (e.g., the Framingham Heart Study [see, e.g., Tsao and Vasan, 2015]).

Identifying data as high value when in any of the data states will have cost ramifications. For data value to be realized, the data need to be discoverable and usable. Metadata are required to make the data more easily discoverable. Without at least some minimal standard for metadata tags, the data are destined to become “dark data”—data that are undiscoverable and, hence, unused (see Box 1.3). Services around the data may be required for data to be usable, and significant labor will likely be necessary to implement and provide those services, including maintaining data standards. From a scientific point of view, data have no value without proper standardization and documentation. Preserving the value of data in any state, particularly in the State 3 environment, requires using accessible formats and keeping high-level context information (e.g., “all these data came from the same clinic”) so that the data are discoverable at reasonable levels of effort that make the search time worthwhile.

APPROACHES TO DATA VALUATION

Before considering the consequences of designating data as high value, it helps to consider how the different facets of value might be assessed for data. It may be useful to look at some commercial approaches to data valuation (see Box 3.4) to see what insights they offer. Information valuation is still a nascent discipline in the commercial world, in part because standard accounting principles do not permit data to be listed as an asset on a company’s balance sheet (Laney, 2017). Nevertheless, commercial information clearly has a value, as witnessed by the stock market valuations and sales prices of information-intensive companies relative to their balance sheets. In particular, the committee finds the taxonomy of valuation approaches set forth by the Gartner Group to be informative (Laney, 2017). Some of the approaches are not well suited to biomedical research data. For example, the “market value of information” approach is of limited use, as biomedical data sets are generally not bought and sold in public marketplaces. However, at least three of the approaches seem relevant in the biomedical-information setting.

  1. Cost value of information (CVI): This approach equates the value of data with the expense of obtaining them. While this approach ignores a multitude of factors about a data set, it is useful in setting a target for how much it makes sense to invest in preserving a data set. Spending more than the replacement cost of the data should raise questions, although there are a few caveats. The replacement cost of the data set may differ from the original cost of obtaining it—perhaps dropping as technology improves or rising with labor costs. Also, replacing data takes time, so it may make sense to spend more than the CVI of a data set to preserve it so as to avoid gaps in availability.
  2. Intrinsic value of information (IVI): IVI is a nonmonetary metric based on the quality (i.e., correctness and completeness), scarcity, and expected lifetime of a data set. While IVI does not determine appropriate costs for preserving data sets, it can be used to prioritize expenditures among data sets when funds are limited.
  3. Business value of information (BVI): BVI is another nonmonetary metric based on the goodness and relevance of a data set for a specific purpose. In a biomedical research or clinical setting, BVI might be interpreted as the range of tasks or investigations that a data set enables and its adequacy to those ends.

With all of these approaches, it is important to recognize that the value of data need not stay fixed over time. As noted with CVI, the replacement cost may change with developments in technology or trends in labor costs. For IVI, quality assurance activities or the retirement of similar data sets can increase value. BVI is especially amenable to change. A data set may become useful in a broader range of tasks (or better suited to current uses) if it is combined with other data sets or new analysis methods are developed that can work with it.

Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

REFERENCES

Burtch, L. 2019. The Burtch Works Study: Salaries of Data Scientists and Predictive Analytics Professionals. Evanston, Ill.: Burtch Works Executive Recruiting. https://www.burtchworks.com/wp-content/uploads/2019/06/Burtch-Works-Study_DS-PAP-2019.pdf.

Chodacki, J. 2019. Forecasting the Costs for Preserving and Promoting Access to Biomedical Data. Presentation to the National Academies Workshop on Forecasting Costs for Preserving and Promoting Access to Biomedical Data, July 12.

DHHS (Department of Health and Human Services). 2019. Request for Public Comments on a DRAFT NIH Policy for Data Management and Sharing and Supplemental DRAFT Guidance. 48 Federal Register 60398 (November 8, 2019). https://www.govinfo.gov/content/pkg/FR-2019-11-08/pdf/2019-24529.pdf.

Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

Eklund, A., H. Knutsson, and T.E. Nichols. 2019. Cluster failure revisited: Impact of first level design and physiological noise on cluster false positive rates. Human Brain Mapping 40(7):2017-2032.

Eklund, A., T.E. Nichols, and H. Knutsson. 2016. Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. Proceedings of the National Academy of Sciences of the United States of America 113(28):7900-7905.

Ferguson, A. 2019. The Burden and Benefits of ‘Long-Tail’ Data Sharing. Presentation to the National Academies Workshop on Forecasting Costs for Preserving and Promoting Access to Biomedical Data, July 11.

Greene, R.D. 1981. DARCOM’s new program and cost control system. Army Research, Development, and Acquisition, July-August. https://asc.army.mil/docs/pubs/alt/archives/1981/Jul-Aug_1981.PDF.

Ioannidis, J.P.A. 2005. Why most published research findings are false. PLoS Medicine 2(8):e124. https://doi.org/10.1371/journal.pmed.0020124.

Kendall, F. 2017. Getting Defense Acquisition Right. Fort Belvoir, Va.: Defense Acquisition University Press.

Kennedy, D.N. 2014. Data persistence insurance. Neuroinformatics 12(3):361-363. http://doi.org/10.1007/s12021-014-9239-0.

Kennedy, D.N., S.A. Abraham, J.F. Bates, A. Crowley, S. Ghosh, T. Gillespie, M. Goncalves, et al. 2019. Everything matters: The ReproNim perspective on reproducible neuroimaging. Frontiers in Neuroinformatics 13:1.

Laney, D. 2015. Why and How to Measure the Value of Your Information Assets. Gartner Research. https://www.gartner.com/en/documents/3106719/why-and-how-to-measure-the-value-of-your-information-assets.

Laney, D. 2017. Infonomics. Abingdon, UK: Routledge.

Mankiw, N.G. 2017. Principles of Economics, 8th ed. Boston, Mass.: Cengage Learning.

Manski, C.F. 2019. Communicating uncertainty in policy analysis. Proceedings of the National Academy of Sciences of the United States of America 116(16):7634-7641.

Merrill, D. 2017. Economic perspectives for long-term digital preservation: Achieve zero data loss and geo-dispersion. White Paper, Hitachi Data Systems.

Moradi, E., B. Khundrakpam, J.D. Lewis, A.C. Evans, and J. Tohka. 2017. Predicting symptom severity in autism spectrum disorder based on cortical thickness measures in agglomerative data. NeuroImage 144(Pt A):128-141.

Nielson, J., J. Paquette, A.W. Liu, C.F. Guandique, C.A. Tovar, T. Inoue, K.-A. Irvine, et al. 2015. Topological data analysis for discovery in preclinical spinal cord injury and traumatic brain injury. Nature Communications 6:8581.

O’Neal, K. 2012. The First Step in Data: Quantifying the Value of Data. https://dama-ny.com/images/meeting/101112/quantifyingthevalueofdata.pdf.

PNAS (Proceedings of the National Academy of Sciences). 2016. Correction to Eklund et al., “Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates.” Proceedings of the National Academy of Sciences of the United States of America 113(33):E4929.

Rosenthal, R.S.H. 2017. The medium-term prospects for long-term storage systems. Library Hi Tech 35(1):11-31.

Schmarzo, B. 2016. Determining the economic value of data. InFocus, June 14. https://infocus.dellemc.com/william_schmarzo/determining-economic-value-data/.

Short, J.E., and S. Todd. 2017. What’s your data worth? Sloan-MIT Management Review Magazine 58:3.

Tsao, C.W., and R.S. Vasan. 2015. Cohort Profile: The Framingham Heart Study (FHS): Overview of milestones in cardiovascular epidemiology. International Journal of Epidemiology 44(6):1800-1813.

Williams, R. 2019. Presentation to the National Academies Workshop on Forecasting Costs for Preserving and Promoting Access to Biomedical Data, July 11.

Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 33
Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 34
Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 35
Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 36
Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 37
Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 38
Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 39
Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 40
Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 41
Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 42
Suggested Citation:"3 Cost and the Value of Data." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 43
Next: 4 The Cost-Forecasting Framework: Identifying Cost Drivers in the Biomedical Data Life Cycle »
Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs Get This Book
×
 Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs
Buy Paperback | $75.00 Buy Ebook | $59.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Biomedical research results in the collection and storage of increasingly large and complex data sets. Preserving those data so that they are discoverable, accessible, and interpretable accelerates scientific discovery and improves health outcomes, but requires that researchers, data curators, and data archivists consider the long-term disposition of data and the costs of preserving, archiving, and promoting access to them.

Life Cycle Decisions for Biomedical Data examines and assesses approaches and considerations for forecasting costs for preserving, archiving, and promoting access to biomedical research data. This report provides a comprehensive conceptual framework for cost-effective decision making that encourages data accessibility and reuse for researchers, data managers, data archivists, data scientists, and institutions that support platforms that enable biomedical research data preservation, discoverability, and use.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!