Page 148 Cite

Suggested Citation:"Appendix D: Soft Costs for Digital Preservation." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

D

Soft Costs for Digital Preservation

Not all costs for creating and sustaining a biomedical information resource show up in the budget of the organizational units creating the data or operating the resource. Some “soft costs” might show in other parts of an organization as effort expended by users of the resource or as lost or delayed opportunities. Merrill (2017) identified seven categories of soft costs for long-term digital preservation in a corporate setting (see Box D.1).

Merrill’s enumeration of soft costs is not necessarily exhaustive for the biomedical-information domain. There may be other soft costs to consider. For example, there is investigator burden: the effort required by researchers to submit their data to a repository. Another soft cost is loss of confidence: data in a resource lose value when users do not use that resource owing to a lack of trust in it or a concern about the results obtained from it. Such loss might arise because there is not enough information to replicate the process that generated the data or to audit their handling once received by a resource. Loss could also arise from lack of curation of the resource or uncertain interpretation of the data owing to no or little metadata.

DIFFICULTY IN QUANTIFYING SOFT COSTS

Merrill (2017) notes that “soft costs are important, but may be hard to isolate, define, or measure. Many soft costs are qualitative in nature. At times they can become hard costs when unusual events happen (like declaring a disaster, or in a pre-trial rush to access data). Project related benefits (staff efficiency, risk) are usually characterized by soft costs, since the IT department does not have the burden to measure or reduce these costs.” Putting a dollar amount to soft costs so that they can be compared directly to hard costs does not seem feasible in most cases, but it is often possible to compare the relative soft costs of alternative approaches. For example, considering Merrill’s soft cost of discovery time, there might be two approaches to supporting a repository of genetic sequences. In the first approach, the sequences can be retrieved only by accession number and organism name. In the second approach, there is an additional index that allows searching by sequence similarity. Discovery time is expected to be lower in the second approach for a task such as determining if a new sequence duplicates an existing sequence. As another example, consider Merrill’s cost of performance. One option for the sequence repository would be to internally support a service for alignment of a deposited sequence to an appropriate reference sequence. A second option is not to support such a service. In that case, an investigator needing a reference alignment would need to download the sequence in question, find the appropriate reference sequence somewhere, and locate and apply an appropriate tool to perform the alignment. Clearly, the second option has a higher cost of performance.

Page 149 Cite

Suggested Citation:"Appendix D: Soft Costs for Digital Preservation." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

BOX D.1
Soft Costs as Defined by Merrill (2017)

Provisioning time “is the time and local effort required to acquire and present capacity for the retention period. Internal processes, procurement, provisioning steps, and delivery lead-time all contribute to this cost” (Merrill, 2017). This category does not include the direct costs for storage capacity itself, nor the ongoing costs of operating that capacity. An attraction of cloud storage is that provisioning time can be reduced from months to minutes. Long provisioning times can be detrimental to research if they delay results or access to those results by others.
Discovery time “is the time it takes a person or an application to find digital content. The format may have some impact on this time, as well the robustness of the meta-data management system. Risks can arise if the time required to discover (and restore) are unsatisfactory” (Merrill, 2017). A resource might get little or no use if locating information within it is cumbersome, say because of limited search capability.
Time to restore is “how long a person (or an application) has to wait for the data to be restored (to the last bit) after the request is made” (Merrill, 2017). This soft cost would pertain to cases in which data in a repository have to be moved from offline to online storage before they are accessed.
Cost of expansion “of the repository must be planned since the ever-increasing growth in stored information will require future increased capacity” (Merrill, 2017). Merrill classifies expansion as a soft cost because it does not appear in current budgets. However, in considering options for a biomedical information resource, the future expansion costs for different options is an important facet.
Risk of loss (or loss expectancy) “is a calculated cost associated with the probability of losing data or digital content. The loss can occur from a variety of sources, including media failure, theft, corruption, transmission error, sabotage, etc.” (Merrill, 2017). While Merrill posits that this cost can be calculated in the commercial setting, it manifests as opportunity cost in the biomedical research setting, which is difficult to quantify. It is difficult to assign a dollar value to delayed or forgone discoveries, especially as the nature of those potential discoveries is challenging to foresee. In the biomedical research setting, this cost might have to be approached as setting a tolerable likelihood of loss and evaluating alternative approaches by whether they fall within that likelihood.
Cost of performance “is often a perceived issue with how long IT tasks take to complete. In a few cases performance can be linked to company revenue or direct costs, but usually is a point of complaint for the IT department. If projects can demonstrate business impact due to slow or inconsistent access or retrieval, then performance can become a hard cost to the preservation architecture” (Merrill, 2017). This soft cost is one of the most relevant—but also one of the most difficult—to measure. Limitations on data, search, and services can all restrict or delay tasks that researchers want to perform with the information in a repository, thus retarding or reducing discovery (or, in cases in which a repository supports clinical uses, compromising treatment). Thus, there are at least two aspects to these costs: the additional time for tasks that are eventually completed and the lost knowledge from tasks that are forgone. While the first aspect can be characterized by the relative ease of performance for alternative approaches, even a qualitative comparison of the second aspect seems daunting. Another complication to this soft cost is that it needs to be evaluated in the context of available alternatives. In considering approaches to Repository C (or whether to establish Repository C at all), it is necessary to consider whether there is an alternative Repository D that would support at least some of the tasks that Repository C would support.
Cost of procurement is “the time it takes to select, quote, bid, negotiate and purchase infrastructure for digital preservation . . . This cycle is heavily weighted with staff from procurement. This cycle tends to occur every few years when older equipment needs to be replaced. In cloud or consumption models, the provisioning process is self-serviced. This cost is different from provisioning time (see above) in that time and effort are internal to IT, and lead-times are planned such that capacity is ready when it is needed. Provisioning time for required future resources will be reduced since the procurement process

Page 150 Cite

Suggested Citation:"Appendix D: Soft Costs for Digital Preservation." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

Thus, while soft costs generally cannot be quantified easily, they can be compared across approaches. The committee believes that it is possible to do so in a disciplined manner. For example, for discovery time, one could make a list of search types that a repository user might want and tabulate for each alternative approach whether or not it supports each search type. Similarly, for cost of performance, one could make a list of likely tasks that a researcher might want to perform with the data. Then for a given approach, one could determine whether it “Does Not Support,” “Partially Supports,” or “Supports” each particular task. With such information, one could easily determine whether Approach C “dominates” Approach D, in terms of C having equal or lower soft costs than D across all facets, or isolate the trade-off points between C and D: on what specific facets does C have higher or lower soft costs than D?

It is tempting to ignore soft costs in forecasting, since they may not be quantitative or they accrue outside the immediate organizational unit. However, they help characterize the usability and value of data for a community. Considering only hard costs might drive one to select options with low direct costs but that are difficult to use and provide little value (in which case, why support the resource at all?).

REFERENCE

Merrill, D. 2017. Economic perspectives for long-term digital preservation: Achieve zero data loss and geo-dispersion. White Paper, Hitachi Data Systems.