Skip to main content

Currently Skimming:

5 Generating, Integrating, and Accessing Digital Data
Pages 113-140

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 113...
... That is changing now, however, as increasing numbers of biological collections have been digitized. This digitization1 of specimen data, combined with the cyberinfrastructure2 that underlies how digital data are stored, managed, and used, has fundamentally transformed the biological collections community (Ball-Damerow et al., 2019; Hedrick et al., 2020)
From page 114...
... In fact, the digitization of specimens and associated materials and the uploading of these digital data into online platforms has long been a requirement for funding programs such as the National Science Foundation (NSF) Living Stock Collections for Biological Research program and its successor, the Collections in Support of Biological Research program, among others.
From page 115...
... . A B FIGURE 5-1 Publications using digitized natural history data provided and/or served by the National Science Foundation–supported Advancing Digitization of Biodiversity Collections (ADBC)
From page 116...
... These digitization workflows provide institutions that house biological collections with guiding principles that can be adapted to their varied needs, collection sizes, and capabilities. Additionally, workshops organized and sponsored by iDigBio5 and others have made digitization more widely adopted, better understood, and more efficient across the natural history collections community.
From page 117...
... . Once specimens and their associated data have been digitized, the digital datasets can then be used in-house (e.g., tracking loans and users)
From page 118...
... A major global portal for natural history collections is hosted by the Global Biodiversity Information Facility (GBIF) , while iDigBio hosts a portal for collections primarily based in the United States, and the Atlas of Living Australia (ALA)
From page 119...
... As the digitization of biological collections continues to create large and diverse datasets, an effective cyberinfrastructure will need to incorporate mechanisms to improve access to an ecosystem of digital repositories and enable the integration of diverse types of data. Recognizing the need for a more robust cyberinfrastructure, the Earth science community established EarthCube in 2011 with NSF funding from both the Directorate for Geosciences and the Office of Advanced Cyberinfrastructure of the Computer and Information Science and Engineering Directorate at NSF.10 Collaborative projects with the biological collections community (such as enhancing Paleontological and Neontological Data Discovery API11 and Earth-Life Consortium12)
From page 120...
... Dark Data While the majority of data generated today are immediately digitally captured, historical collections typically have a backlog of data that have yet to be digitized. The digital revolution and the increase in the accessibility of digitized specimen data have been so profound that undigitized collections are now referred to as "dark data" -- referring to the fact that they are essentially unavailable for modern scientific study without physical access to the specimens within institutions (Heidorn, 2008)
From page 121...
... For example, for natural history collections, it is estimated that more than 50 percent of vertebrate collections (Krishtalka et al., 2016) and 20 percent of herbarium specimens (Barbara Thiers, Director of the William and Lynda Steere Herbarium at the New York Botanical Garden, personal communication, 2020)
From page 122...
... Furthermore, there is no effective mechanism in the current data publishing model for effectively and efficiently returning user annotations of data to the original data providers for incorporation into the data stream, resulting in a complete loss of this effort on the part of users of the data for the collections community. Leading aggregators such as GBIF, iDigBio, GCM,
From page 123...
... Variability in Data Quality and Format As the quantity of digital data dramatically increases, the presence of incomplete data, data of questionable quality, and a lack of standardization limit both the roles that biological collections data can play in research and education and their usefulness. Issues such as incomplete data records and inaccurate or poorly transcribed data are ubiquitous and lead to limitations on the use of specimen digital data.
From page 124...
... . However, digital datasets are often not maintained and updated for a variety of reasons, ranging from insufficient resources and staff turnover to disputes related to intellectual property rights and to a simple lack of understanding that digital datasets are not static, one-off products.
From page 125...
... . Because living stock and natural history collections databases were established in parallel using different types of identifiers, integrating them has proved to be quite complex, and these difficulties may preclude opportunities to integrate the data from these resources.
From page 126...
... Limited Mechanisms to Support a Cyberinfrastructure That Promotes Collaboration The diversity of biological collections poses many challenges to the effective development and implementation of a cohesive, adaptable, and sustainable cyberinfrastructure that serves the entire collections community. For example, inherent differences between living and natural history collections such as differing needs and goals, compounded by external factors such as different funding opportunities and requirements, have thwarted collaborative efforts to integrate digital data from these collections.
From page 127...
... Thanks to new imaging techniques and technologies, the use of rare or fragile natural history collections is less invasive, and it is possible to carry out detailed examinations of specimen attributes without extensive
From page 128...
... . Batch processing or automation and the use of optical character recognition (OCR)
From page 129...
... The use of convolutional neural networks, a form of machine learning that has been used for species identification (e.g., Carranza-Rojas et al., 2017) and the capture of trait information from specimen images and text such as whether a specimen is in flower or fruit (e.g., Lorieul et al., 2019)
From page 130...
... The natural history collections community has begun to use outside assistance in the digitization process in an effort to reduce the amount of dark data. The impact and contribution of citizen scientists and volunteers to the digitization effort have steadily increased through efforts such as Notes from Nature21 and the Smithsonian Transcription Center,22 among others.
From page 131...
... , and thus digitization is essential for future studies that aim to understand their biology and evolution. Increasing Data Visibility Although digitization and sharing data with online open access data portals continue to provide more data for research and education, vast amounts of data produced through research and collecting endeavors, such as project-based collections data, are still not publicly available.
From page 132...
... and disseminated falls on funding agencies, reviewers, and publishers. The NSF Directorate for Biological Sciences requires a data management plan as part of all research proposals, but while this is a prerequisite for funding for living stock collections, there is no requirement for digitization, publishing, or ensuring the long-term accessibility of specimens and their data for natural history collections.
From page 133...
... Machine learning and other forms of artificial intelligence may provide incremental increases in the annotation of certain collections, primarily through text recognition and OCR technologies using images of labels or card catalogs or ledgers. A systematic and standardized approach to improve data quality will result in optimized user experience.
From page 134...
... Some levels of integration of disparate datasets are currently being achieved on a national and global scale through various aggregators and individual museum data management systems, but more coordination between these aggregators and developers is needed to simplify and standardize the landscape. A cyberinfrastructure for biological collections could enable data integration while also providing annotation tools and a system for attribution of specimen data used in research, education, policy development, or other activities of this scope.
From page 135...
... A blockchaininspired network has the necessary technological components to provide the identification of the various elements of the network while also tracking all transactions associated with each item. The network could take advantage of the existing identifiers commonly used in the collections community (GUIDs, DOIs, ORCIDs, etc.)
From page 136...
... For example, researchers are increasingly interested in patterns of spatial, environmental, and genetic variation, particularly when evaluating how species might respond to climate change. Data from living stock and natural history collections, environmental databases, the National Ecological Observatory Network, and GenBank would all contribute to addressing these questions, and 28 See https://arctosdb.org/about.
From page 137...
... A national cyberinfrastructure for biological collections that will support these collections and facilitate their ever-growing base of end users will require collaboration, especially between the collections community and computer scientists and engineers, but also between collections staff from diverse collections types and communities (e.g., natural history and living stocks)
From page 138...
... Continual updating, augmenting, and improving digital data records using annotation tools and data assertions, for example, will greatly improve overall data quality and, in turn, lead to more comprehensive data integration and greater accessibility of digital data. However, mechanisms for data annotation and attribution require an interoperability of data and systems, which may be impeded by global indecision about the application of globally unique identifiers for specimen
From page 139...
... Moreover, a permanent national cyberinfrastructure that supports the needs noted above in terms of expanded digitization of dark data, improvement in data quality, and an increased accessibility to digital data would certainly spur data use. Without this resource, collections -- both physical and digital -- will continue to be underused.
From page 140...
... ; • establish ongoing mechanisms for the biological collections commu nity to meet, develop best practices, and work toward goals such as establishing and implementing unique identifiers, clear workflows, and standardized data pipelines; and • promote and fund the development of a necessary national cyberinfra structure, with appropriate tools, and technology to affect the efficient multi-layer integration of data and collections attribution.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.