Managing Geoscience Data and Collections: Challenges and Practices
A number of steps are necessary for successful preservation of geoscience data and collections. This chapter outlines the practices and challenges involved in these steps. Key to the overall success of any preservation effort is an effective management plan, grounded in sound advice from the user community. The idea of user-community involvement was introduced in chapter 2, in the context of the Ice Core Working Group that advises managers at the National Ice Core Laboratory (Sidebar 2-11). Figure 4-1 illustrates how the user community interacts with other areas of management within the Ocean Drilling Program.
STORAGE OF GEOSCIENCE DATA AND COLLECTIONS
Storage, reduced to its most basic level, is the housing of material. Adequate storage is fundamental to the preservation and accessibility of data and collections. Storage is related to, but separate from, curation, which involves safeguarding, cataloging, and locating material; curation is discussed in the following section. A well-stored set of samples may not be curated, but a well-curated sample will be stored adequately.
The repositories surveyed as part of this study (see Appendix B) exhibited no general standards for data maintenance and storage. Consequently, practices vary widely. Cores, for example, are stored in such diverse settings as secure, climate-controlled buildings with well-built storage racks (e.g., ODP), to unimproved metal shipping containers, to boxes stacked on pallets outdoors where they are exposed to the elements. Some cores require specialized storage and maintenance if they are to remain useful for long periods. Ice cores must be stored at –15°C or below, for example, and unconsolidated water-saturated cores such as those held at the ODP repositories (Sidebar 3-3) and the Minnesota Lacustrine Core Repository should be kept moist in a temperature-controlled environment. Repositories that handle these types of cores generally have facilities adequate for the task.
Approaches to storage and maintenance of seismic and well-log data are equally diverse. Some repositories hold only paper records, while many contain both paper and digital records. The digital files might be stored on either magnetic tape or CD-ROMs, the former in climate-controlled settings to slow their deterioration.
Without exception, storage conditions for any data must ensure the integrity of the data themselves as well as their containers, labels, and other metadata, otherwise the data become useless (see Table 2-6). For example, exposure-related deterioration of the identification tags on hard-rock mineral samples, stored in 1990, outdoors under tarpaulins for several years at the Alaska Geological Survey, rendered the cores useless (committee survey response, 2001). All data must be protected from the elements, although conventionally drilled cores and cuttings can be stored under less rigorous conditions than, for example, magnetic media, deep sea cores, paleontologic samples, or ice cores that require temperature and humidity controls. Paper and digital collections minimally require a climate-controlled environment, which by some estimates costs six times more than standard core storage facilities (Robert Shafer, C&M Storage Inc., personal communication, 2001). Over time, however, even rock samples gradually deteriorate from oxidation, desicca tion, or disaggregation. For example, more than half the original zinc core the Tennessee Division of Geology stored has been lost because of inadequate protection from the elements (committee survey response, 2001).
As a result of the perception that rocks and cores can survive years without much attention, they are often stored temporarily under tarpaulins on pallets where they are exposed to adverse conditions. Unfortunately, temporary may become long-term, often resulting in the deterioration of the coverings and boxes, or at least their identifying labels, at which point the utility of the entire collection is lost. To
prevent this mistake, the state geological surveys of Alaska (see Sidebar 3-5), Nevada, and Oklahoma have used sea-going shipping containers to store overflow cores until more permanent facilities can be built. Access is limited and not conducive to casual examination, but the vital documentation of sample identity remains intact.
The quality of space provides a degree of security necessary for all collections. While the commercial value of fossil specimens, gems, and meteorites requires that they be protected from theft, all collections deserve protection from loss from other agents of destruction, such as vandalism, weather, insects, mold, and even mishandling by staff and clients. Examples of losses of geoscience data held by state geological surveys extends to earthquake (Alaska), building collapse (Maine), flooding (Kentucky), collapse of shelving (North Carolina and Texas), and exposure (Tennessee) (see Appendix B for sources).
Effective Use of Space
Lack of available space is commonplace at the nation’s repositories (see Table 2-3a,b). The quality and amount of space devoted to geoscience collections are highly variable among institutions, and reflect, to some extent, funding and priority assigned by an institution’s upper management.
The physical layout of a repository involves several elements relating to space. In addition to space for collections, considerations include adequate processing areas for unpacking, washing, drying, cutting, and sorting samples; cataloging and palletizing; shipping and receiving; workflow considerations from receipt through storage safety and comfort of the staff; security for the collection; and sufficient weight-bearing capability of the shelving and floors (particularly to withstand the load from core collections). For effective use of space, the shelves, racks, cabinets, or drawers in repositories must be closely spaced yet accessible, often stacked high, and durable (so as not to require repeated replacement). Another space consideration is adequate layout and examination space (Figure 4-2), which, ideally, is near the storage area, with appropriate examination equipment (e.g., microscopes), services (e.g., sampling and photography), adequate lighting, and privacy (if necessary).
Ideally, storage facilities are designed to be expanded easily. This is usually a direct function of the value of land upon which the facility is sited. Good examples are C&M Storage in Texas (Sidebar 3-1) and the Ocean Drilling Program repository at Texas A&M University (Sidebar 3-3). The New Mexico Bureau of Geology and Mines repository at the New Mexico Institute of Mining and Technology (NMIMT) constructs additional core storage facilities relatively inexpensively and quickly by erecting 30- by 100-foot, uninsulated, ventilated storage facilities equipped with skylights. Since these are on the NMIMT campus, land-acquisition costs are zero. The recently constructed expandable core curation facilities at the state geological
surveys of Kentucky and Ohio also provide multiple use areas for outreach and education.
The facilities mentioned above are in the minority. Most repositories the committee surveyed (Appendix B) are nearly or entirely at capacity (Table 2-3a,b), unable to expand easily, and struggling with old and inappropriate cabinetry for their collections. Innovative actions, however, have allowed some organizations to forestall the need for additional real estate. For example, the National Ice Core Laboratory (see Sidebar 2-11), currently at 90 percent capacity, plans to change its racking system to an adjustable system that will use space more effectively and put the laboratory at 52 percent capacity. Compactor storage, wherein movable racks of drawers ride on rails, saves space and promotes safety and security of the collection. Compactors can postpone more expensive additions to facilities for years. (NSF funds almost all major museum purchases of compactors.) Other space-saving innovations include the use of forklifts with swiveling forks, which allow use of narrower aisles, and therefore a higher density of shelving. Storage space also is saved by trimming and slabbing cores and retaining only a thin slab of the original material. This approach reduced the required storage space by 50 to 80 percent at the USGS facility in Lakewood, Colorado, but many repositories lack the financial resources to fully process all of the core in their collection. Although trimming and slabbing reduces the volume of material to be stored, it is a destructive technique that commonly reduces the types of future analyses that could be performed on the core and thus diminishes their value by some unknown amount. For example, slabbed core is inappropriate for some types of porosity and permeability measurements1 critical to petroleum engineers and hydrologic modelers, among others.
CURATION OF GEOSCIENCE DATA AND COLLECTIONS
All geoscience data and collections, whether cores and cuttings, rocks and minerals, paleontological specimens, or digital archives, require adequate staff to maintain and curate them in usable condition. Otherwise they quickly deteriorate, become permanently unusable, and ultimately are lost to future researchers. Data curation differs considerably from data storage. Storage in its simplest form is warehousing. Curation, on the other hand, is performing the maintenance necessary to safeguard, catalog, and locate samples or records, often bringing them into usable condition through preparation, and keeping them usable for the future. Curation requires protocols for: processing of specimen loans, accession and deaccession (since not everything can or should be archived), promotion of an active research environment to use the collection, ongoing conservation and preservation, and finding long-term funding to ensure their future preservation. Properly curated, the value of a collection will increase through time, and its
SIDEBAR 4-1 Calgary Core Research Centre
The Calgary Core Research Centre, in Calgary’s University Research Park, is operated as part of the Resources Division of the Alberta Energy and Utilities Board, an energy and utility regulatory agency. The center operates under a legislative mandate to collect, process, and preserve core, drill cuttings, and daily drilling reports from oil and gas wells in Alberta (Oil and Gas Conservation Act/Regulation, Part 11—Well Data, 11.010 to 11.040 and 12.150. This is further specified in Informational Letter IL-OG 76-14). The center also is responsible for providing public access to the material.
The 193,680 square feet of climate-controlled facility serves more than 300 organizations. The staff of 28 manages drill cuttings from 109,202 wells in 236,950 trays (with 56 samples per tray) and core storage for 53,716 wells in 1,047,042 boxes. The center consists of a service and administrative area, research areas, a core repository, a repository for drill cuttings and daily drilling reports, a processing area for drill-cutting samples, and additional patron facilities. It contains 60 core research tables, 7 confidential core research rooms, 50 cubicles for examining drill cuttings, 2 seminar rooms, and 100 equipment lockers.
The Core Research Centre, considered among the best facilities of its type in the world by many who testified to the committee, has been used as a model for many design features of other repositories. For example, features that have been duplicated are the layout space at the Bureau of Economic Geology, University of Texas at Austin (see Sidebar 3-4) (Douglas Ratcliff, BEG, personal communication, 2001), and the forklift system at the Glenside Core Library in South Australia (Elinor Alexander and Brian Logan, Minerals and Energy Resources, South Australia, personal communication, 2001).
The facility is large enough to provide space for another 10 years of core at current accession rates. Of the center’s revenues, 70 percent are generated from service fees. The remainder of the budget comes from a combination of the Energy and Utilities Board’s well-license fees and the Alberta government (CAD $2.3 million per year [USD $1.4 million], January 25, 2002).
Committee Conclusions of Best Practices: (1) large, well-placed regional facility; (2) very good examination and screening space; (3) cost-recovery allocation; (4) provincial support; (5) large, complete regional holdings; (6) adequate fiscal support (as of 2001).
scientific usefulness will span many decades or even generations (Cranbrook, 1997).
The logistics of handling large quantities of geoscience samples, digital data, and documents are not without staffing and financial consequences. Among the workflow considerations are packaging (creation, standardization, repair), labeling (standardization and formatting), organizing, and moving (loading, unloading, transporting, stacking, or shelving). Repeated handling of specimens must be planned carefully and minimized, if only to conserve staff energy. Each time specimens are handled, the opportunity for spillage, breakage, misplacement, or loss is introduced anew. A model facility for such considerations is the Alberta Core Research Centre in Calgary, Alberta, Canada (see Sidebar 4-1).
Curation involves dedicated and skilled people. Salaries and wages for collections staff are among the largest expense items for most facilities.2 Consequently, most facilities are short-handed, and curation is concomitantly backlogged. Several facilities utilize innovative means of overcoming staffing shortages by employing part-time student help or volunteers. Typically, volunteers are retired professionals or interested enthusiasts. Reliance on either part-time employees or volunteers can create problems: hiring short-time staff can be difficult because accountability may suffer, work hours can be irregular and unpredictable, often a higher degree of supervision is necessary to avoid errors, and repeated training is necessary to handle turnover. Nonetheless, most museums and other curatorial facilities would be much worse off without dedicated volunteers.
The various types of collections—whether cores, rocks, minerals, gems, fossils, or data—have unique curatorial considerations in addition to the basic curatorial problems of staffing, space, identification, and access. Several case studies, outlined below, illustrate the complexity of curation and the critical roles that staff provide.
Core and Cuttings Collections
Core collections reside in a variety of settings and receive varied degrees of curation and use. The principal curation challenge for core collections is managing their enormous volume and weight. An additional challenge includes archiving core collections in an easily accessible manner. Because lack of space is a constant issue, particularly in public institutions, significant staff time is spent reducing the volume of core collections.
The USGS Core Research Center in Lakewood, Colorado (see Sidebar 3-2), currently has a staff of three and deals with 1,500 to 2,000 users annually. These users are predominantly from the petroleum industry and academia. The center also handles about 1,000 inquiries from people wanting information about the collection each year. With this level of staffing, the USGS can maintain the collection and provide some support services to users, but staff can do only limited processing (slabbing or photography) of new cores. Users needing more intensive processing services must be referred to outside services.
Collections staff at state geological surveys usually range from one to two full-time employees, with additional part-time help (committee survey responses, 2001). Despite budget cuts (e.g., in Iowa and Kentucky), sample collections continue to grow annually at an average rate of about 2 percent. Growth could be greater, but is usually hampered by staff costs or space limitations. Nevertheless, collections staff continue to encourage collections use, while attempting to eliminate curatorial backlogs and encouraging better initial documentation (i.e., better metadata). Geoscience data and collections are used daily at virtually all state geological surveys. The few geological surveys that require collection users to make an appointment do so to schedule access to limited core examination space (e.g., Indiana) or to move boxes because of overcrowded aisles (e.g., Iowa and Kentucky).
Media Containing Subsurface Data
While some subsurface data are in paper format, the majority are electronic, gathered over the years using various techniques and equipment. These data present challenges unique to the electronic environment, such as data migra tion and equipment compatibility. A large volume of seismic data remains on older media such as film or various forms of magnetic tape (see Table 2-1). For example, until the 1980s, seismic data were stored on magnetic tape; now they are routinely preserved on server farms (NRC, 1995a,b). The very large volume of well log data (Table 2-1) solely in paper form presents challenges related to access and utility, as much as preservation of the medium itself.
Even if subsurface data have been transferred to or already exist in a digital medium, the data are not guaranteed immortality. Data can be lost because of obsolete formats, obsolete equipment, or physical degradation of the magnetic medium (particularly magnetic tape, which degrades more rapidly when storage facilities are not climate controlled or otherwise weatherproof, and should be rewritten about every 5 years). IHS is a for-profit company that makes large investments in migrating old seismic and other data into standardized digital form for distribution to customers (Ron Samuels, IHS, personal communication, 2001). In the long term, data migration and assimilation can add value as the dataset grows. For example, restoration of SeaSat (Appendix F) data demonstrated that constantly reworking data is more cost-effective than ignoring them (NRC, 1998). Digital storage of subsurface information is appropriate because digital data are increasingly being stored in smaller physical spaces with greater cost effectiveness, they can be duplicated and stored in different places, thereby safeguarding them from loss, and they can be accessed and shared more easily in digital format than they can be in a paper or tape format.
As a result of its cost and time-consuming nature, migration of data is a challenge for smaller institutions and organizations that lack the necessary short-term funds. At THUMS (see Appendix F) in Long Beach, California, the seismic data collected for the Wilmington oil field in the 1970s were saved on 1,600-bpi tapes. These tapes were not readable in 1995 when THUMS staff tried to integrate the data with a three-dimensional seismic survey completed that year. Similarly, in 1991, Chevron estimated that 11 percent of its tape data were unreadable because of degraded and outdated storage media (Philippe Theys, Schlumberger Ltd., personal communication, 2001). Low-budget, not-for-profit entities such as DERL (see Sidebar 3-9) have no data migration plan in place, and therefore continue to work with paper, fiche, tapes, and other physical subsurface data storage media. To do otherwise would preclude some users who typically are unable to pay for-profit prices, especially users from smaller companies and academic settings.
All paleontological specimens do not need the same level of curation to be scientifically usable. A hierarchy of curation, described by Hughes et al. (2000) as a curatorial continuum, minimally requires that collections be safe from damage, mishandling, or loss. With an increased investment of staff resources, the scientific value of specimens increases as they progress through the continuum. Typically they are cleaned, sorted, boxed, identified, labeled, cataloged, and perhaps reconstructed. Preparation of fossil specimens involves cleaning and, in some cases, reconstruction of missing portions, which can be extremely time consuming. In the Smithsonian Paleobiology Collection, a cataloger can properly process 15 to 20 specimens per day (committee survey response, 2001).
Typically only a fraction of an institution’s collection is brought to a fully prepared state for a specific display, re-
SIDEBAR 4-2 National Geophysical Data Center, Marine Geology and Geophysics Division
The National Geophysical Data Center (NGDC) in Boulder, Colorado, largely handles digital data collection, storage, and processing. As an indication of the scope and scale of the data storage issues, only the marine geology and oceanographic aspects of the NGDC mandate are described herein; however, environmental data in general are within their charge (see NGDC, 2001).
Marine geoscientific data are stored digitally at the NGDC’s Marine Geology and Geophysics Division (MGG). The MGG databases deliver 10 gigabytes of data each month over the Internet. The MGG databases include more than 5 gigabytes of scanned images of marine sediments and rocks, with an additional 2 to 2.5 gigabytes of digital data files, more than 76 gigabytes of multi-beam bathymetry, 2.7 gigabytes of hydrographic data, and 6.9 gigabytes of underwater geophysical trackline data. In addition, archived data include more than 3,100 microfilm reels and 30,000 square feet of seismic sections, among many other types of data. These data are accessed by scientists from various government agencies, by academic researchers, and by private citizens. Uses of these data include engineering studies in preparation for laying undersea cables, fish habitat and sea mammal studies, mineral exploration, international mapping studies, and commercial and sport fishing. The NGDC-MGG web site is: http://ngdc.noaa/mgg/mggd.html.
Committee Conclusions of Best Practices: (1) excellent on-line accessibility and availability of data and metadata; (2) broad, international user-community involvement; (3) coordinated information flow to and from user community.
The committee visited NGDC-MGG in June 2001.
search, or educational purpose. Considering the size of most collections and the expense that would be incurred in fully preparing every specimen, the vast majority of fossil collections are retained unprepared (Hughes et al., 2000).
Budgetary factors also influence the state of sample curation. For example, at the NMNH, budget limitations have prevented critical conservation of the fossil vertebrate and paleobotanical collections (or the replacement of storage cabinets in which they reside) (committee survey response, 2001). In some recent years, the Department of Paleobiology has had no funds at all available to purchase even the most basic supplies, such as specimen boxes.
Rock and Mineral Collections
The curation of rock (including meteorite), mineral, and gem collections poses some unusual challenges. Meteorites are immensely popular among private collectors, so security of the collection is critical to maintaining its integrity and intellectual value. The Smithsonian’s National Meteorite Collection currently consists of 22,000 specimens, and it is growing by 50 to 100 specimens annually (committee survey response, 2001). A staff of one or two curators and one or two collections managers curates the collection and facilitates 400 to 500 loans annually for exhibition or research. The Smithsonian’s Mineral Collection consists of 500,000 rocks and minerals (including gems). It, too, is managed by a staff of two curators and two collections managers.
Security is a serious concern for both collections. Catalogs of the holdings are not readily available, and electronic access to inventories is viewed cautiously because of fear that publicizing the nature and size of the holdings will compound security problems.
Other Data and Documentation
The curation of paper and digital collections is very much like that in any library. A staff is necessary to accept, catalog, shelve, and maintain the collection to function as intended. For example, the Kansas Geological Survey’s Data Resources Library is maintained by eight full-time employees and four part-time (student) employees (Kansas Geological Survey, 2001). In the private sector, the Denver Earth Resources Library uses two full-time employees and two part-time employees to maintain predominantly paper records (see Sidebar 3-9). The library is visited by 30 to 35 self-serve patrons per day.
While holdings of digital data are outside the committee’s immediate interest, the following discussion is included to illustrate that staffing needs for handling digital data are not insignificant and should be considered in any plan for improving access to metadata about physical collections. Data holdings at the National Geophysical Data Center’s Marine Geology and Geophysics Division require 4.5 full-time employees to manage and maintain the marine geophysical databases (see Sidebar 4-2).
Where digital records exist for incoming data, as is often the case at the NGDC, they are reproduced and held as
TABLE 4-1 Libraries and Geologic Repositories—A Comparison of Cataloging Practices
What do they store?
• Printed materials (text and digital) plus audio-visual materials
• Systematics collections (rocks, fossils, etc.)
• Cores (rock, sediment, and ice)
• Geophysical data (digital, paper, film)
• Records (maps, photos, log books, etc.)
• Conform to established international standards
• Mostly digital
• Few standards
• Some digital/some paper/some microfiche
• Little interoperability
SOURCE: Committee survey responses, site visits.
a backup copy. Quality control is performed as data are entered into databases. NGDC has a center-wide metadata entry system following Federal Geographic Data Committee standards (FGDC, 2002). As technology changes, data are migrated in new forms and media as necessary. Archived data are in ASCII format, which can be converted to other formats. Contributors inspect and approve any data modified by NGDC before final posting for public distribution.
Data in digital media periodically require refreshing. At the NGDC, the staff continually needs to refresh software, hardware, and training—to protect against media deterioration and technology evolution, and to guarantee accessibility and retrievability.
CD-ROM storage currently is one of the more popular forms of digital data storage. Within the USGS, paper documents, well logs, and seismic displays are scanned into image files and captured to CD-ROM, as are data stored on mag netic tapes (Linda Gundersen, USGS, personal communication, 2001). Benefits include a simple and low-cost replication process, ability to store multiple datasets (e.g., text, images, video, and audio), and random access of the information.
CATALOGING AND INDEXING
Specimens, samples, or other geoscience data that have no documentation about their origin (metadata) are of little or no scientific value. Materials without such documentation usually are not accessioned into collections and are prime candidates for deaccession efforts when staff time is available. Cataloging is the process of recording metadata in some centralized database, usually with some kind of index numbering system on index cards, ledger books, or computer software. Table 4-1 summarizes general differences between the state of cataloging in libraries and geoscience repositories.
Cataloging facilitates good management of data and collections, and greatly reduces the cost of using them. Without catalogs many collections are useless, except to the rare expert who knows a specific collection intimately. Cataloging is also necessary to gain a better estimation of the staffing and financial needs for properly curating a collection.
Uncataloged materials are almost impossible to use or loan, and most collections facilities have a backlog of uncataloged materials.3 At the Smithsonian Institution, priority for cataloging depends on the commercial value of the specimens, the number of specimens acquired per year, and the size of backlog remaining from years in which large USGS or NASA transfers were accepted (committee survey response, 2001). This is especially true where gems, meteorites, or unusual and rare fossils are involved. In the Department of Paleobiology, cataloging primarily is focused on newly acquired type specimens because of their importance to the scientific community. Within the Smithsonian, the National Museum of Natural History is home to one of the premier geoscience collections in the world. It has an active cataloging program and a database of 5 million records describing 124 million lots of items (collections of fossils or other objects) (committee survey response, 2001). This large number represents just 10 percent of the records required to describe this collection adequately. With the shedding of staff from the Smithsonian’s Collection Management Program over the last 10 years, the rate of cataloging has declined significantly. Processing loan requests has been given priority, so that the larger scientific community neither notices nor is affected by the staffing shortage (committee survey response, 2001). The situation at many other museums is much worse.4 Data on collections of special value (such as type specimens) almost always are available, but data on the great majority of collections are available only by physically examining the paper labels associated with the specimens. In the committee’s view, cataloging is an enormous and pressing need for effective use of the nation’s geoscience data and collections.
For example, new purchases and exchanges at the Los Angeles County Museum of Natural History are cataloged immediately, but old material is catalogued only periodically (committee survey response, 2001). Other examples include, cataloging of cuttings at BEG’s Midland facility, which is backlogged (committee survey response, 2001), and the Anaconda Mineral collection (see Sidebar 3-7), of which less than 20 percent is cataloged.
For example, 50,000 of the 3 million specimens held by the Paleontological Research Institution in Ithaca, New York, are cataloged.
SIDEBAR 4-3 Profiling the Collections at the Smithsonian: A Tiered Approach to Collections Description
The Smithsonian was significantly affected by budgetary cuts throughout most of the late 1980s and 1990s. This resulted in the loss of many critical collections and curatorial staff positions and sharply curtailed other resources necessary for management of the national collections. One result is a backlog of 40,000 volcanic specimens awaiting accession into the Smithsonian’s Rock Collection. The USGS gave these specimens in 1995, but until they are curated, they remain available only to selected researchers, rather than to the research community as a whole.
To obtain a better estimation of the staffing and financial needs for properly curating these and its other estimated 124 million lots, the Smithsonian Institution is undertaking a museum-wide collections profiling assessment. Six irreducible factors are being measured: conservation (i.e., physical condition), processing (how much curation is necessary to bring specimens into full museum ownership), storage (from microscopic to building-sized requirements), arrangement (physical and intellectual sorting to provide access), identification, and current status of inventory. Pilot assessments have been performed, and the process is being fine-tuned. The full assessment will provide a means to plan and budget for staff and space needs.
Committee Conclusions of Best Practices: (1) research on and curation of holdings by staff; (2) large, diverse holdings of great national importance.
SOURCE: Sally Shelton, Smithsonian Institution, personal communication, 2001.
The Smithsonian Institution is experimenting with a tiered approach to collections description so that more general descriptions of collections will be available before detailed cataloging is completed. Such an approach might be suitable for other collections, as well. Sidebar 4-3 illustrates an approach currently in progress at the Smithsonian Institution.
Another challenge for users of geoscience data and collections is the lack of any national catalog. Researchers must search out each site or catalog individually and examine it for data or collections of interest. Such catalogs do exist for bibliographic materials in the geosciences, however.5 Maintained by libraries and archives, these catalogs provide a useful and successful model to follow. A standard system of describing, indexing, and formatting the catalog is essential to assist users in locating materials of interest and to allow interoperability among multiple databases and catalogs of materials.
Two of the key characteristics of these catalogs are their adherence to national standards for catalog records (metadata), description of items, and database interoperability (e.g., Library of Congress, 1995), as well as use of widely accepted thesauri and terms for description. Adherence to these standards ensures that users of these catalogs can determine the appropriateness of material for their research or educational needs. These and similar catalogs have proven track records, and garner worldwide acceptance. The invertebrate paleontology community took an important step toward common computerized standards with the development of a common data model (Morris, 2000) that can be used to relate different collections databases to each other.
The committee concludes that inadequate cataloging is the single biggest inhibitor to productive use of even well-maintained geoscience collections in the United States. Sidebar 4-4 describes the Institute for Museum and Library Services (IMLS), a government agency that, since 1996, has provided funding on a competitive basis for improving access to information at museums and libraries. Although currently not supported by IMLS, cataloging efforts in the geosciences clearly fall within the institute’s mission (Robert Martin, IMLS, personal communication, 2001).
Unless geoscience data and collections are accessible, they are useless. Access to the data and collections themselves, however, is the second step in achieving full access. Access to information about the data and collections (e.g., metadata and catalogs) is the first step in any full-access process.
Before the electronic age, lists of data in collections were kept in serial logbooks or on alphabetical file cards. Someone familiar with the recordkeeping system had to search the data listing and determine the physical whereabouts of the desired data. Access depended on a high degree of institutional memory—that is, individuals who knew the history of the system and who knew and cared about its organization.
One example is GeoRef (http://www.georef.org).
SIDEBAR 4-4 Institute of Museum and Library Services
The Institute of Museum and Library Services (IMLS) is a government agency that allocates funds on a competitive basis to museums and libraries for improving access to information. The institute was formed as a result of the Museum and Library Services Act of 1996 (see IMLS, 2002), which moved responsibility for federal library programs from the Department of Education to the institute. The museum program received $28.7 million in fiscal year 1997 and $23.4 million in fiscal year 1999. The library program received $150 million in fiscal year 1997 and $166.2 million in fiscal year 1999.
In 1999 IMLS dispensed $170 million in grants. Grants typically run as long as 2 years. The institute’s programs foster the development of digital resources and linkages among and between libraries and museums, and assist museums and libraries in evaluating their programs. The IMLS Conservation Project Support program offers matching grants to museums that identify conservation needs and priorities and perform activities to ensure the safekeeping of their collections. Collections may be in one of four categories: 1) non-living; 2) systematics and natural history; 3) living plants; 4) living animals. Grants are available for five broad types of conservation activities: 1) surveys; 2) training; 3) research; 4) treatment; and 5) environmental improvements. In addition, the IMLS offers a Museum Assessment program that supports the assessment of museum operations, collections care, or public service that can result in more effective goals and plans for the museum’s future. IMLS has informal partnerships with NSF on initiatives such as the National Digital Library (DLF, 2002) and e-gov (USGSA, 2002).
Archives reliant on institutional memory are prone to degrade when staff members transfer, retire, or otherwise leave the institution. Today, computerized databases of collections holdings can be searched and queried by any number of descriptive parameters, even remotely over the Internet, utilizing much of the same technology developed by libraries. Yet, to a large extent, these systems are not in place for geoscience data and collections.
Traditionally, indices and catalogs have been the means by which researchers learned about new research, data, and collections. Printed indexes of research findings and catalogs of collections were widely available and used for decades. With the advent of the digital age, many of these printed research tools were converted to electronic form (i.e., computerized; see below), allowing easy access and saving time for researchers. Bibliographic databases such as GeoRef (AGI, 2002a) and Oceanic Abstracts (CSA, 2002), as well as library catalogs, are used worldwide to facilitate the dis covery of geoscience research.
Unfortunately, the tools for locating geoscience data and collections have not made the same successful transition. In some cases, attempts to keep accurate catalogs of holdings have ceased, while in others, a catalog may exist only onsite.6 Good tools for locating geoscience data and collections are not absent due to lack of interest. Rather, in many instances, funds to build electronic catalogs and provide Internet access are available only when garnered from various savings in operational funds; new money for these efforts is rarely afforded.
The limited extent to which paper catalogs and metadata have been converted into digital format (or computerized) is a missed opportunity to enhance the use of geoscience data and collections. This is particularly true as digital catalogs, coupled with current Internet technology, have increased tremendously the usage, value, and societal benefits of such holdings (see Sidebar 4-2). Almost all U.S. collections, particularly fossil and mineral collections, are cataloged incompletely, only a few catalogs are available over the Internet, and no comprehensive tools are available to search multiple repositories at one time.
In the United States, invertebrate paleontological collections are among the most numerous, but least computerized, of systematic natural history collections. A 1993 survey (Cooley et al., 1993) estimated that they were approximately 8 percent computerized. The USGS paleontological collections are mostly cataloged on paper and as multiple discrete collections (see Sidebar 2-10). While digital catalogs exist for different kinds of materials at the USGS, there is no unified catalog of the USGS holdings (Linda Gundersen, USGS, personal communication, 2002).
At the Smithsonian, as at all other museums, almost no specimens are accompanied by digital data when they arrive (committee survey response, 2001). Specimen data nearly always arrive as donor-generated labels, scientific publications, and maps that accompany the samples. Specimen data
are prepared for computer entry by organizing them on handwritten forms. Although this seems cumbersome, the two-step process minimizes error and leaves a tangible trail. The goal is an error-free collections database. Older digital catalogs exist, but their life expectancy is limited. In the Smithsonian, many digital data reside on main-frame computers. Although such data are not in immediate danger of loss or damage, no transcription program is underway, consequently they remain completely inaccessible even to institution staff (committee survey response, 2001). Data for the Smithsonian’s Department of Paleobiology is still managed by use of SELGEM, a database program developed in the late 1960s. Starting in Fall 2002, Paleobiology will use KE EMu,7 a catalog currently under construction that will incorporate digitized images, documents, spreadsheets, and databases. All SELGEM-based specimen data will be migrated to KE EMu, and data in analog formats that never have been in electronic form will be added manually to records. Once data are in KE EMu, they can be reported in a variety of formats. A significant collateral benefit of computerized catalogues is that they also function as an electronic backup of irreplaceable documents related to the collections.8
The situation is somewhat better in other branches of geoscience. The Minerals Management Service is in the process of scanning or digitizing all of its data, and soon it will accept only digital data (Gary Lore, MMS, personal communication, 2001). Other isolated repositories, collections, and projects have been successful in providing digital access to their resources. Some of the best examples include the Catalogue of Meteorites (Natural History Museum, London, 2002), Kansas Geological Survey (2001), the National Geophysical Data Center (NGDC, 2001), the National Ice Core Laboratory (NICL, 2001), the Ocean Drilling Program (ODP, 2002), and the Wyoming Oil and Gas Conservation Commission (see Sidebar 4-5).
The Smithsonian’s Rock Collection is a unique collection of Earth’s crust and mantle rocks. All of its 446,000 lots are computerized, making it the largest curated, completely computerized rock collection in the world. Furthermore, the catalog is accessible over the Internet. Access to most other Smithsonian collections is through curators and collections managers, usually through personal contact via telephone, e-mail, on-site visits, or at professional conferences. With the launch of the Smithsonian’s on-line KE EMu catalog, a
SIDEBAR 4-5 Wyoming Oil and Gas Conservation Commission
The Wyoming Oil and Gas Conservation Commission (WOGCC) operates a geologic data management system with minimal staff and a bare-bones budget. Despite these limitations, the petroleum exploration database exists on a real-time, interactive web site (WOGCC, 2001b) that can be accessed from a field location. As a company reports well data from the field, it is saved immediately to the appropriate database and is available on the web. Accessible on the site are 35 databases containing about 12 million records. Using this system, WOGCC staff can issue permits for more than 1,000 wells per month, compared with past rates of about 1,200 per year. The cost to acquire the hardware and software and complete the entry of historic data was $60,000.
Committee Conclusions of Best Practices: (1) digital submission of data and metadata; (2) public access via Internet to real-time data and metadata updates.
SOURCE: Richard Marvel, WOGCC, personal communication, 2001.
broader audience will have access to a subset of data about most of the specimens in these collections.
The Internet revolution of data access has major implications for geoscience data and collections. The Internet allows multiple users simultaneous access to data that previously were inaccessible—it does so 24 hours a day, 7 days a week without having to travel to individual repositories. For many institutions, the Internet also is responsible for increased interest in their collections. For example, the average number of hits per month on web pages related to petroleum information at the Kansas Geological Survey increased by 41 percent between 2000 and 2001 (Timothy Carr, Kansas Geological Survey, personal communication, 2002).
Several attempts have been made to improve the level of access to catalogs of geoscience data and collections across the United States. These include the GeoTrek metadata catalog (AGI, 2002b) developed by AGI with funding from DOE. As a prototype for a framework that allows access to digital geologic information, its value lies with the underlying data and the links that permit access to the data. Similar beginnings have been made, albeit on a smaller scale, by the Mines Ministries of several Canadian provinces and Australian states (see Appendix H for examples). Many smaller collections in the United States have benefited greatly from NSF funding for cataloging and computerization. For private entities in the business of providing geoscience infor-
KE EMu is an electronic museum management system that supports data capture, querying, and museum management functions through a client–server interface. It also includes a web interface for Internet or Intranet access to museum data resources. KE EMu is produced and supported by KE Software, an Australian company. (See http://www.kesoftware.com/Press/release5.html.)
Catalogs of the Department of Mineral Sciences within NMNH should be available on the KE EMu system in May 2002 (Anna Weitzman, NMNH, personal communication, 2002).
TABLE 4-2 Incentives for Improving the Ability to Find Information about Geoscience Data and Collections
• Significantly reduce the time spent searching for information and increase time spent in analysis and use
• Increased timeliness of data availability for emergencies (see for example the opening paragraph of the Executive Summary on Hutchinson, Kansas)
• Increased investment in exploration and extraction of state and national resources (with the attendant advantage of increasing state and federal revenues from taxes thereon)
• Increased use of collections for educational and scientific purposes
• Aid collections management when trying to determine the uniqueness or significance of samples.
mation, an accessible catalog is crucial. Examples include the electronic catalogs of IHS Energy (IHS, 2002) and Veritas DGC, Inc. (Veritas, 2002).
DISCOVERY AND OUTREACH
Discovery entails identifying the existence and whereabouts of desired data and collections in addition to determining their availability, quality, and format. Discovery can occur in a variety of ways: the Internet provides one means of access to computerized records, whereas attribution at the end of journal articles, for example, leads the reader to a source of information. Outreach is another, more assertive form of enhancing discovery that has been applied successfully in the ocean geoscience community (see Sidebar 4-2).
Much can be gained by improving our ability to discover data and collections (See Table 4-2). In a 1992 paper, Blaine Taylor (1992, p. 193) states: “…we simply spend too much time and money searching for, collecting, and pre-processing data before we can even begin the analysis phase of our work. Recent studies9 indicate that as much as 80 percent of our engineers’ and geoscientists’ time can be spent in these efforts.” The goal of most organizations, whether public or private, is to shorten the discovery time so that the investigator or employee can spend more valuable time actually analyzing the data. Internet-based catalogs allow prospective users of physical samples to determine from afar whether they need to visit a facility, thus saving time. For example, users in Colorado, Texas, and Oklahoma have been able to explore online the holdings of the Wyoming Oil and Gas Compact Commission in Casper, Wyoming (Richard Marvel, WOGCC, personal communication, 2002) (see Sidebar 4-5).
In many instances, however, researchers still gain knowledge of and access to data by traditional means: personal acquaintance, letters, onsite visits, telephone, fax, or e-mail (Wayne Ahr, Texas A&M University, personal communication, 2001). Consequently, data and collections are under-appreciated and under-utilized. More importantly, valuable scientific information that is lost to the discovery process cannot be used for subsequent analyses and interpretations, weakening both. For example, staff in the U.S. Army Corps of Engineers, which has 40 district offices nationwide, have been largely unsuccessful in obtaining funds to publish catalogs of their holdings, leaving their data and collections accessible to the public only with great difficulty (committee survey response, 2001). At the USGS, determining the location, existence, and availability of certain samples may require several phone calls or e-mails (Kevin McKinney, USGS, personal communication, 2001). Access to electronic catalogs of geoscience data and collections is therefore essential to facilitate discovery of these resources by the broadest range of potential users.
A basic tenet of science and engineering research is the precept that new discoveries build upon old ones. Scientists are taught to evaluate and acknowledge the research that has come before. This acknowledgement is accomplished through a system of attribution, by citing previous and related work. How to cite previous works is the subject of numerous style manuals and guides to research in all fields of study.10 It is notable, however, that the style manuals of the sciences rarely refer to or recommend the citation of data or collections (the text in Sidebar 4-6 is one example). This is completely opposite to other areas of study (such as history and the arts) and reflective of the historic reliance of the geoscientists on personal contacts and personal knowledge of collections. The lack of attribution to geoscience data and collections only serves to promote their invisibility and to downgrade their value. The committee concludes that it is essential for the geoscience community to follow the lead of other sciences and begin to cite (i.e., acknowledge) use of and reliance upon data and collections. The NRC’s suggestion in its overview of NASA’s Distributed Active Archive Centers (DAAC) is one approach to this problem (NRC, 1998, p. 40): DAACs are encouraged to post on the Internet a list of publications that make use of their holdings, in a format that would permit an easy search for references with standard web tools. In another example, use of data holdings of the World Data Center for Paleoclimatology is referenced in a standardized manner (see NOAA, 2000).
SIDEBAR 4-6 Method of Attribution for Reports Using Ocean Drilling Program Data and Collections
“This research used samples and/or data provided by the Ocean Drilling Program (ODP). The ODP is sponsored by the U.S. National Science Foundation (NSF) and participating countries under management of Joint Oceanographic Institutions (JOI), Inc. Funding for this research was provided by _______________________.”
In addition, the words “Ocean Drilling Program,” “scientific ocean drilling,” or “ocean drilling” should be used as one of the keywords provided to journal or book publishers of your manuscripts. This will allow the legacy of the ODP to be tracked by bibliographic databases (e.g., GeoRef).
SOURCE: Frank Rack, JOI, personal communication, 2001.
Some of the most exciting discoveries occur through interdisciplinary research, which by its very nature, requires researchers to work beyond their normal boundaries. Consequently, data and collections managers should reach beyond their traditional user communities to educate new users about the existence and utility of the geoscience data and collections they hold. This implies that the organizations archiving these data will have to engage in a certain degree of marketing. For example, NGDC (Sidebar 4-2) promotes its holdings via e-mail and the Internet, through mass mailings, at professional meetings, and with posters and CD-ROMs.
The Internet serves as an effective outreach tool, for example, by making available a wide selection of images of gem, mineral, rock, ore, meteorite, and fossil specimens, as well as related documentation (e.g., field notebooks, historic illustrations). Many museums, including the Smithsonian, hope to increase the number of collection users by expanding the awareness of their collections to audiences beyond those able to travel to the museums themselves. The Smithsonian and several other museums, also have put a great deal of effort into traveling exhibits of various sorts. These traveling exhibits effectively bring the institution to large groups of people who might not otherwise have the opportunity to visit. Still, a traveling exhibit cannot reach everyone; but, the Internet can make an exhibit available to every single home, library, or school quickly, cheaply, and simultaneously. This approach was pioneered by the Uni-
SIDEBAR 4-7 Geoinformatics
The goal of geoinformatics is to construct multidisciplinary databases to facilitate extraction of knowledge from the geologic record. The geoinformatics community is planning a network through a collaborative research initiative undertaken by a consortium of universities and non-academic partners such as USGS, NOAA, NASA, BP Amoco, BEG, and the Geological Survey of Canada. Earth and computer scientists aim to establish a seamless and integrated network system of geoscience data with software tools for access,analysis, visualization, and modeling. The goal of the Geoinformatics Initiative is to develop a national infrastructureof databases and tools for earth science research.
SOURCE: Geoinformatics Network, 2001.
versity of California’s Museum of Paleontology (UCMP, 2002b), which was one of the first 25 sites on the World Wide Web in the early 1990s. The Library of Congress (2002), through its American Memory Project, also has been extremely successful in sharing its collections with the nation by these means.
The application of geoinformatics may facilitate geoscience data outreach and discovery (see Sidebars 4-7 and 4-8). In such a scenario, all metadata would be digital and accessible over the Internet. Each sample could be located by its geographic coordinates, and metadata would record the circumstances under which the sample was collected, and provide quality control. Such a system would require standardized formats for data archiving, software support, and data-mining tools, and a knowledgeable end-user community. The Kansas Geological Survey’s (2001) Data Resources Library geoinformatics systems provide geoscience data over the Internet. Other state geological surveys that have sophisticated data retrieval capabilities over the Internet include Iowa (GEOSAM online database; Iowa Department of Natural Resources, 2002) and North Dakota (fossil, and soon core, database; North Dakota State Geological Survey, 2001).
There are multiple, necessary steps in preserving and making accessible geoscience data and collections. Digital catalogs available over the Internet are critical to successful
SIDEBAR 4-8 Smithsonian’s Research and Collections Information System
The Smithsonian’s National Museum of Natural History (NMNH) is creating a Research and Collections Information System that approaches an informatics-based system. The intention is to accomplish three main goals: allow collections management to better track the disposition of specimens acquired, loaned, borrowed, or disposed of, and their location; enable online access to all digital specimen data for the benefit of museum research, collections, and public programs staff, scientists, and the general public worldwide; and to facilitate participation in national and international informatics initiatives. With a suite of software applications, which are used internationally, the staff has begun to implement the systems in a number of science departments. The software was chosen for its stability, ability to scale, flexibility for diverse NMNH disciplines, interoperability with other systems via conformance to international standards, and ability to customize. An estimated 40 million to 50 million records will adequately represent NMNH specimens at a cost of $55 million to $75 million over the next few years. Currently, funds for data entry are limited, so Smithsonian staff are exploring options for obtaining the needed amount (Ross Simons, Smithsonian Institution, personal communication, 2002).
SOURCE: Input during committee site visit to the Smithsonian Institution, April 2001.
management and use of geoscience data and collections. The existence of such catalogs generates multiple benefits—from enhanced use of the collections, to time and money users save in finding material, to improved ability to plan for financial and staffing needs for the collections. The current extent of cataloging in the United States is limited, however, and is the single greatest inhibitor of effective geoscience data and collections use. The backlog of cataloging in many institutions constitutes a significant burden in itself, and overloaded staff would benefit from digital submission of information about newly acquired geoscience data and collections.