Soft Costs for Digital Preservation
Not all costs for creating and sustaining a biomedical information resource show up in the budget of the organizational units creating the data or operating the resource. Some “soft costs” might show in other parts of an organization as effort expended by users of the resource or as lost or delayed opportunities. Merrill (2017) identified seven categories of soft costs for long-term digital preservation in a corporate setting (see Box D.1).
Merrill’s enumeration of soft costs is not necessarily exhaustive for the biomedical-information domain. There may be other soft costs to consider. For example, there is investigator burden: the effort required by researchers to submit their data to a repository. Another soft cost is loss of confidence: data in a resource lose value when users do not use that resource owing to a lack of trust in it or a concern about the results obtained from it. Such loss might arise because there is not enough information to replicate the process that generated the data or to audit their handling once received by a resource. Loss could also arise from lack of curation of the resource or uncertain interpretation of the data owing to no or little metadata.
DIFFICULTY IN QUANTIFYING SOFT COSTS
Merrill (2017) notes that “soft costs are important, but may be hard to isolate, define, or measure. Many soft costs are qualitative in nature. At times they can become hard costs when unusual events happen (like declaring a disaster, or in a pre-trial rush to access data). Project related benefits (staff efficiency, risk) are usually characterized by soft costs, since the IT department does not have the burden to measure or reduce these costs.” Putting a dollar amount to soft costs so that they can be compared directly to hard costs does not seem feasible in most cases, but it is often possible to compare the relative soft costs of alternative approaches. For example, considering Merrill’s soft cost of discovery time, there might be two approaches to supporting a repository of genetic sequences. In the first approach, the sequences can be retrieved only by accession number and organism name. In the second approach, there is an additional index that allows searching by sequence similarity. Discovery time is expected to be lower in the second approach for a task such as determining if a new sequence duplicates an existing sequence. As another example, consider Merrill’s cost of performance. One option for the sequence repository would be to internally support a service for alignment of a deposited sequence to an appropriate reference sequence. A second option is not to support such a service. In that case, an investigator needing a reference alignment would need to download the sequence in question, find the appropriate reference sequence somewhere, and locate and apply an appropriate tool to perform the alignment. Clearly, the second option has a higher cost of performance.
Thus, while soft costs generally cannot be quantified easily, they can be compared across approaches. The committee believes that it is possible to do so in a disciplined manner. For example, for discovery time, one could make a list of search types that a repository user might want and tabulate for each alternative approach whether or not it supports each search type. Similarly, for cost of performance, one could make a list of likely tasks that a researcher might want to perform with the data. Then for a given approach, one could determine whether it “Does Not Support,” “Partially Supports,” or “Supports” each particular task. With such information, one could easily determine whether Approach C “dominates” Approach D, in terms of C having equal or lower soft costs than D across all facets, or isolate the trade-off points between C and D: on what specific facets does C have higher or lower soft costs than D?
It is tempting to ignore soft costs in forecasting, since they may not be quantitative or they accrue outside the immediate organizational unit. However, they help characterize the usability and value of data for a community. Considering only hard costs might drive one to select options with low direct costs but that are difficult to use and provide little value (in which case, why support the resource at all?).
Merrill, D. 2017. Economic perspectives for long-term digital preservation: Achieve zero data loss and geo-dispersion. White Paper, Hitachi Data Systems.