We are the first generation to have the tools to study the Earth as a system. During the last few decades of the 20th century, the development of an array of technologies has made it possible to observe the Earth, collect large quantities of data related to components and processes of the Earth system, and store, analyze, and retrieve these data at will. These data can be registered to specific locations on the Earth's surface and can be integrated into spatial-temporal information systems and registered at the same scale and cartographic projection as other resource data.
Scientists can now perform environmental research that increases our understanding of the Earth system at all spatial scales, enhances resource management and environmental decision making, and improves our capabilities for predicting significant changes in the environment. Over the past decade, in particular, the observational, computational, and communications technologies have enabled the scientific community to undertake a broad range of interdisciplinary environmental research and assessment programs. At the international level, two of the most ambitious programs are the International Geosphere-Biosphere Program (IGBP) of the International Council of Scientific Unions (ICSU), and the World Climate Research Program, jointly sponsored by the World Meteorological Organization and ICSU. At the national level, these international research initiatives are supported through the federal interagency Global Change Research Program.
Global change research, by its nature and scope, is inherently complex. On the technical side, complexity increases with the number of
different variables that are modeled, measured, or experimentally manipulated. These variables may interact with each other to a high degree, and these interactions include nonlinearities or discontinuities in space or time. In particular, a certain degree of complexity in global change research ensues from the sheer quantity of data at large spatial and temporal scales. Likewise, analogous degrees of complexity originate on the organizational side of research in how the work is structured, managed, and implemented due to the sizable number of investigators and participants across a range of disciplines.
The Global Change Research Program and other large research initiatives involve the interfacing of large volumes of diverse data, commonly combining several traditionally distinct disciplines, such as meteorology, oceanography, geology, biology, chemistry, and geography, or their related subdisciplines. ''Data interfacing" may be defined as the coordination, combination, or integration of data for the purpose of modeling, correlation, pattern analysis, hypothesis testing, and field investigations at various scales. Because data from each discipline and subdiscipline are organized into data sets and databases that frequently possess unique or special attributes, their effective interfacing can be difficult.
Sound practices in database management are required to deal effectively with problems of complexity in global change studies and other large interdisciplinary research and assessment projects. Although a great deal of attention and resources has been devoted to this type of research in recent years, little guidance has been provided on overcoming the barriers frequently encountered in the interfacing of disparate data sets. And although there is a wealth of relevant experience at the working level in the research community, this experience generally has not been analyzed and organized to make it more readily available to researchers.
Because of the increasing importance of conducting interdisciplinary environmental research and assessments, both nationally and internationally, the Committee for a Pilot Study on Database Interfaces was charged to review and advise on data interfacing activities in that context. This report is the result of that study. The focus is on developing analytical and functional guidelines to help researchers and technicians engaged in interdisciplinary research—particularly those projects that involve both geophysical and ecological issues—to better plan and implement their supporting data management activities. It also is aimed at informing those individuals responsible for funding, managing, or evaluating such studies and activities.
SUMMARY OF CONCLUSIONS AND RECOMMENDATIONS
The committee used six case studies (1) to identify and to understand the most important problems associated with collecting, integrating, and analyzing environmental data from local to global spatial scales and over a very wide range of temporal scales, and (2) to elaborate the common barriers to interfacing data of disparate sources and types. Consistent with the committee's charge, the primary focus was on the interfacing of geophysical and ecological data. The committee derived a number of lessons from the case studies, and these lessons are summarized at the end of each case study and analyzed in Chapter 8. Some are generic in nature; others are more specific to a discipline or project.
The conclusions and recommendations are all based on the committee's analysis of the case studies and on additional research. They are organized according to four major areas of barriers or challenges to the effective interfacing of diverse environmental data. These are barriers deriving from the data themselves, from the users' needs, from organizational interactions, and from system considerations. In the final section the committee offers a set of broadly applicable principles—Ten Keys to Success—that can be used by scientists and data managers in planning and conducting data interfacing activities.
Addressing Barriers Deriving from the Data
The spatial and temporal scales of the disciplines important to environmental research vary enormously. Such variation was certainly typical of the ecological and geophysical data sets that were reviewed in this study. For instance, massive data sets that cover large areas are routinely collected and used in the physical sciences, while such data sets are much less common in ecological disciplines. These differences reflect distinct historical traditions, working methods, and judgments about what processes are important and the temporal and spatial scales on which they operate. As a result, it is difficult to find geophysical and ecological data sets with matching temporal and spatial scales. In addition, attempts to equalize scales through various methods of data manipulation run the risk of creating spurious patterns and correlations.
Recommendation 1. In the planning for interdisciplinary research, careful thought should be given to the implications of different inherent spatial and temporal scales and the processes they represent. These should be discussed explicitly in project planning documents. The methods used to accommodate or match inherent scales in different data types in any attempts to facilitate modeling and analysis should be carefully evaluated for their potential to produce artificial patterns and correlations.
Preliminary processing generally is necessary to develop useful derived data products from raw data. As a result, data sets unavoidably reflect certain scientific assumptions, perspectives, and value judgments. In addition, each processing step is associated with some kind and amount of statistical uncertainty.
Recommendation 2. Metadata*should explicitly describe all preliminary processing associated with each data set, along with its underlying scientific purpose and its effects on the suitability of the data for various purposes. Further, metadata also should describe and quantify to the extent feasible the statistical uncertainty resulting from each processing step. Planning for studies that involve interfacing should explicitly consider the effects of preliminary processing on the utility of the resultant integrated data set(s). (Additional recommendations regarding metadata appear below.)
The exceptionally large data volumes involved in global environmental research can pose significant challenges for existing methods of data storage, retrieval, and analysis, as well as for the organizational systems currently in place to support these activities.
Recommendation 3. All proposed data management and interfacing methods should be weighed carefully in terms of their ability to deal with large volumes of data. Assumptions that existing methods will continue to be suitable should be treated with caution.
The committee found that differences in scientific conventions among disciplines can be a severe impediment to data interfacing, significantly increasing the costs of achieving compatibility among data sets and in some cases preventing it completely. Some of these differences stem from fundamental dissimilarities in study design or purpose and others from traditional practice that varies from discipline to discipline.
Recommendation 4. Efforts to establish data standards should focus on a key subset of common parameters whose standardization would most facilitate data interfacing. Where possible, such standardization should be addressed in the initial planning and design phases of interdisciplinary research. Early attention to integrative modeling can help identify key incompatibilities. The data requirements, data characteristics and quality, and scales of measurement and sampling should be well defined at the outset.
In several of the case studies, essential ecological data sets were either missing or of unknown quality. In some cases, it was necessary to create
such needed data sets by using historical data or by combining data from a range of ecological studies.
Recommendation 5. Agencies that perform or support environmental research and assessment generally, and global change research particularly, should identify and define key ecological data sets that do not exist but are important to their mission. A careful review should be made of options for finding, rescuing, or creating these crucial data, and funding should be set aside to implement the most feasible option(s).
Addressing Barriers Deriving from Users' Needs
Users' needs in global change research are exceptionally diverse, fluid, and difficult to predict. These characteristics require that data management systems and practices be designed for maximum ease of access, adaptability over time, and communication among all potential users. However, the committee concludes that existing practices frequently inhibit communication and exchange of ideas with the larger user community, as well as access to the data by secondary and tertiary users.
Recommendation 6. Project scientists and data managers should adopt the view that one of their primary responsibilities is the creation of long-lasting data and information resources for the broad research community. Data management systems and practices, particularly the development of metadata, should be designed to balance the needs of this larger user community with those of project scientists.
Addressing Barriers Deriving from Organizational Interactions
The committee concludes that the existing missions and attendant reward systems of research organizations act to inhibit the data sharing, mutual support, and interdisciplinary mindset needed for successful data interfacing. In many cases the stated aims of global change research programs are at odds with the collective understanding among staff within organizations of what their job responsibilities are and how they should be fulfilled.
Recommendation 7. Professional societies, research institutions, and funding and management agencies should reevaluate their reward systems in order to give deserved peer recognition to scientists and data managers for their contributions to interdisciplinary research. Granting and funding agencies, as well as program managers and university administrators, should provide tangible incentives to motivate scientists to participate actively in data management and data interfacing activities. Such incentives should extend to favorable consideration of
those activities in performance reviews, including treating the production of value-added data sets as analogous to scientific publications.
Recommendation 8. Because organizational missions and reward systems inherently reflect a larger policy context, relevant policy issues should be included in the planning for interdisciplinary research. This should be accomplished in part through open communication between project scientists and appropriate policymakers that continues throughout the life of the project. Such communication will help provide a basis for developing cooperative arrangements between collaborating institutions that will provide strong incentives for and reduce barriers to sharing data.
The case studies considered by the committee covered a broad range of objectives, spatial and temporal scales, data sources, data management procedures, quantity and quality of data, and analytic and interpretive methods. From these observations and a consideration of the results of the case studies, the committee concludes that effective data management is an integral part of successful data interfacing. The committee also concludes that there is a critical need to educate scientists about data management principles and to foster improved working relationships between scientists and information management professionals.
Recommendation 9. Research universities should include courses in their curricula that provide environmental scientists with an in-depth understanding of the rationale for and principles of sound data management. Program managers and data managers, in their interactions with and training of environmental scientists, should emphasize how state-of-the-practice data management can provide immediate and long-lasting benefits to scientists, particularly those engaged in interdisciplinary research. At the same time, data managers need to be a part of the conceptual team from the beginning of a project and have equal status with principal investigators.
In its review of the factors that contribute to the success or failure of data interfacing efforts, the committee identified traditional concepts of data ownership as a serious impediment to success. Existing reward systems and traditional practice often combine to motivate scientists to treat data as personal property, even in the face of contractual agreements for data submission and sharing.
Recommendation 10. In order to encourage interdisciplinary research and to make data available as quickly as possible to all researchers, specific guidelines should be established for when and under what conditions data will be made available to users other than those who collected them. Such guidelines are particularly important when data collectors, data managers, and other users are in different organizations. In addition, adequate rewards should be established by the
funders of research and publishers to motivate principal investigators to place all data in the public domain.
A major factor in planning for successful data interfacing is the choice of personnel and the institutional arrangements in which they work. The committee found many instances in which optimal interdisciplinary activities and data sharing were not possible because of unclear responsibilities, conflicting goals, misunderstanding, and outright rivalries. The added complexity of interdisciplinary research increases the severity of such common organizational problems. Even one organization or key player who refuses to share data, prepare documentation, participate in standards setting, or provide other vital project support in a collaborative effort can greatly diminish the probability of success.
Recommendation 11. In the planning of any interdisciplinary research program, as much consideration should be given to organizational and institutional issues as to technical issues. Every effort should be made to minimize the likelihood of misunderstanding, conflicts, and rivalries by establishing interorganizational relationships and procedures, creating effective reward structures, and creating new functions that explicitly support data interfacing.
Based on the case studies and related research, the committee concludes that insufficient attention is given in many interdisciplinary studies to quality control, beta testing of derived data products, creation of broadly useful value-added data sets, resolution of data compatibility problems, and the maintenance and security of key data sets on a long-term basis. Many of these functions are beyond the scope envisioned for existing data centers.
Recommendation 12. The agencies involved in supporting and carrying out interdisciplinary research should investigate the possibility of establishing one or more ecosystem data and information analysis centers to facilitate the exchange of data and access to data, help improve and maintain the quality of valuable data sets, and provide value-added services. A model for such a center is the Carbon Dioxide Information Analysis Center (CDIAC) at Oak Ridge National Laboratory. In addition, it would be wise to look closely at the potential synergism between any new ecosystem data and information analysis center and all other existing environmental data centers.
Addressing Barriers Deriving from System Considerations
The nature of interdisciplinary global change research makes it impossible to clearly define a detailed and stable set of user requirements. Classical methods of system design are therefore inappropriate because they do not provide for sufficient user input throughout the entire design
effort and do not incorporate adequate provisions for flexibility and adaptability.
Recommendation 13. Hardware/software system development efforts should be based on a model that includes ongoing interaction with users as an integral part of the design process. In addition, system designers should work from the assumption that systems will never be finished, but will continue to evolve along with the data collected and users' needs. Designers therefore should use, to the greatest extent possible, modern database development approaches such as rapid prototyping, modular systems design, and object-oriented programming, which enhance system adaptability.
One of the conclusions that clearly emerged from the case studies is the critical role of system interoperability in supporting data interfacing efforts. Interoperability is the ability to readily connect different databases on separate hardware and software systems and perform data retrieval, analyses, and other applications without regard to the boundaries between the systems. Given current technology, this can be a difficult goal to achieve, and it currently requires the direct involvement of knowledgeable information management specialists. However, even when hardware and software systems are successfully connected, fundamental incompatibilities among the data themselves can still impede interfacing.
Recommendation 14. Program managers, project scientists, and data managers should review the interoperability of their hardware, software, and data management technologies to facilitate locating, retrieving, and working with data across several disciplines. However, this effort should be accompanied by parallel attempts to resolve inherent incompatibilities among data types that can thwart interfacing even when state-of-the-art hardware and software systems are seamlessly connected.
One of the most serious problems in the creation, integration, use, and management of large databases for interdisciplinary research is the lack of adequate metadata. Metadata enable users other than the principal investigator to make effective use of the data and to determine which applications they may or may not be suited for. The committee found instances in the case studies where data sets had to be discarded because investigators did not provide the documentation needed for others to make use of them. It is important for researchers to understand that the incremental cost of including the necessary documentation at the time of data collection is small in comparison with the cost of attempting to reconstruct it retrospectively at the end of the project, or long after it has been completed, which may be prohibitive in cost or impossible to do.
Recommendation 15. The production of detailed metadata should be a mandatory requirement of every study whose data might be used
for interdisciplinary research. Metadata should be treated with the seriousness of a peer-reviewed publication and should include, at a minimum, a description of the data themselves, the study design and data collection protocols, any quality control procedures, any preliminary processing, derivation, extrapolation, or estimation procedures, the use of professional judgment, quirks or peculiarities in the data, and an assessment of features of the data that would constrain their use for certain purposes.
The committee found that interdisciplinary research almost invariably involves using data in ways not initially envisioned by the original investigators. In many cases, new uses of data require backtracking along the data path in order to reformat, resummarize, reclassify, or otherwise adjust the data to make them suitable for current needs. In order to backtrack, detailed information should be available about the prior processing steps that were used to create the data sets being interfaced.
Recommendation 16. Metadata should contain enough information to enable users who are not intimately familiar with the data to backtrack to earlier versions of the data so that they can perform their own processing or derivation as needed. Where stand-alone documentation is not adequate (for large and complex data sets or where multiple users are simultaneously updating and modifying data), data managers should investigate the feasibility of incorporating an audit trail into the data themselves.
The committee concludes that far too many environmental research projects give insufficient attention, in either the planning or the implementation stage, to the long-term archiving of their data sets. Data from studies that contribute significantly to our understanding of components and processes of the Earth system must be preserved and made accessible for future potential users of the data. There is a need to create a mindset within all elements of the research community that valuable data need to have a long-term life that extends far beyond the publication of the principal investigator's analyses.
Recommendation 17. In general, the presumption in environmental research should be that "data worth collecting are worth saving." Funding agencies therefore should consider stipulating that all research applicants include in their research plans well-conceived and adequately funded arrangements for data management and for the ultimate disposition of their data. While it is impossible to establish universal guidelines for funding, the committee's investigations suggest that setting aside 10 percent of the total project cost for data management would not be unreasonable. These cost estimates should include adequate funds for preparing thorough metadata that serve the needs of all potential users. In order for these requirements to be fully
effective, however, the agencies must adequately support active archives and long-term data repositories. (See also Recommendation 12.)
There are no well-established and widely accepted protocols to assist scientists in deciding which data should be archived, in what formats they should be stored, and where and how they should be archived to maximize access for potential future users. Further, in several cases the committee found little attention given to the long-term maintenance of data sets once they were archived. It is important to note, however, that there are no technical barriers to keeping all data collected in research projects, even data-intensive ones that involve high-resolution imagery, because advances in data storage and retrieval capabilities have kept pace with the ever-growing volumes of data in all fields of science. It is typical that the ensemble of all previous data in any scientific discipline is modest in volume compared to present and anticipated annual volumes. Therefore, the issue is not unmanageable volumes of data; rather it is the maintenance of the data sets in accessible, usable form over time that is the challenge for long-term retention.
Recommendation 18. The committee is concerned about the gaps in the existing system for long-term retention and maintenance of environmental data. Funding agencies should provide guidelines that define the requirements for preparing data sets for long-term archiving. Educational and research institutions should be encouraged to incorporate strong data management and archival activities into every interdisciplinary project and should allocate sufficient funding to accomplish these functions. Professional recognition should be given to principal investigators and project data managers who perform these functions well.
TEN KEYS TO SUCCESS
The committee's investigations of the case studies and other related experience identified Ten Keys to Success, each of which incorporates both technical and cultural aspects. Keys 1 and 2 deal with the appropriate use of available information management technology. Keys 3, 4, and 5 describe design and management strategies. Keys 6, 7, and 8 refer to methods for accommodating the unavoidable realities of human behavior, motivation, and politics. Finally, keys 9 and 10 suggest ways of enhancing data interfacing by building the need for it into the structure of research programs.
Use appropriate information technology.
Start at the right scale.