The Carbon Dioxide Information Analysis Center
The Carbon Dioxide Information Analysis Center (CDIAC) at the Department of Energy's (DOE) Oak Ridge National Laboratory (ORNL) is internationally known and admired for its role in providing high-quality atmospheric data sets to the research community. These include time series measurements of carbon dioxide and methane at multiple stations around the world, as well as global estimates of the annual production of carbon dioxide from fossil fuel combustion and cement manufacture. In addition, CDIAC is active in efforts to ''rescue" historical climate data that can provide useful comparisons with present data on trends in atmospheric conditions. One prominent example of this is a cooperative program with the Institute of Atmospheric Physics and the Institute of Geography in China.
While CDIAC does not directly engage in interfacing biological and geophysical data types, its data management program exemplifies the kinds of data gathering, quality control, documentation, and dissemination activities that are a necessary part of many data interfacing exercises. Depending on the nature of the interfacing effort, these activities could occur during the acquisition of geophysical and biological data or during the actual interfacing process itself. Despite past successes, the center's staff recognize that their data management model will not be adequate for meeting the challenges of processing and integrating larger volumes of data and doing so on shorter turnaround times. Because these challenges are common to the climate change research community as a whole, the committee believes that the following description of the center's data
management approach will be widely applicable. There are explicit lessons to be learned both from its current success and from the challenges it faces in the future as it scales up its data management efforts.
CDIAC is a part of ORNL's Environmental Sciences Division. It was founded in 1982 by DOE to provide identification, collection, quality assurance, documentation, and distribution for information on the bio-geochemistry of carbon dioxide and the effects of carbon dioxide on vegetation and on the Earth's climate. The scope of CDIAC was subsequently expanded to include related global change topics, such as other greenhouse gases and the effects of climate change on the environment.
Other programs not part of CDIAC, but within the Environmental Sciences Division, include the Atmospheric Radiation Measurement (ARM) archive, which will hold large data volumes (1 to 5 terabytes/year) related to general circulation models (specifically, the representation of clouds and of moisture, heat, and energy transfers therein) and derived from high-speed, real-time samplers. The Environmental Sciences Division also houses a NASA Distributed Active Archive Center, which focuses on ground-based field program data (e.g., carbon in soils, vegetation cover). In large part, this Distributed Active Archive Center was sited at ORNL because of CDIAC's past experience and success, which may be expected to be incorporated and extended into the ORNL DAAC's data management scheme.
ORNL is facing significant technical and organizational challenges as it attempts to implement the new functions associated with the ARM Program and the Distributed Active Archive Center. These challenges are representative of those faced by the global change research community as large volumes of data from new sources become increasingly available. ORNL's experience with CDIAC is relevant and valuable, but these two new programs are different in important ways. First, data volumes will be much larger than those with which CDIAC staff are accustomed to dealing. Second, these programs will focus on real-time rather than historical data. Third, ORNL will be serving a much larger audience and will not be as close to the user community as CDIAC's staff currently is. Finally, ORNL will not always have the luxury of time that is now available to CDIAC to build relationships, perform intense quality assurance and quality control, and produce value-added products.
VARIABLES MEASURED AND SOURCES OF DATA
CDIAC produces aggregate data sets that summarize global and regional production of greenhouse gases such as carbon dioxide and methane; trace gas measurements in the atmosphere and oceans; long-term climate records in addition to temperature (e.g., precipitation, clouds,
atmospheric pressure, and storm climatologies); soil chemistry; coastal vulnerability to rising sea level; global distribution of ecosystem types; and the response of vegetation to elevated ambient carbon dioxide. These variables are derived from a variety of other direct and indirect measurements and estimates gathered from a range of sources. For example, the carbon dioxide emissions data sets are derived from United Nations energy production estimates, from Bureau of Mines cement data, and from DOE gas-flaring statistics (see ORNL, 1991).
This section describes CDIAC's strategy for the management of data. It shows that the center benefits from an unusual degree of freedom in its ability to select data sets for publication and to negotiate agreements with data sources. In addition, it reveals that the data management strategy depends for its success on the large amount of personal attention that each data set receives from the staff. These factors have been important reasons for CDIAC's success.
Selecting the Data Set
CDIAC staff stay abreast of current research issues by attending conferences, symposia, and workshops, by sponsoring workshops, and by interacting directly with researchers. Often, users will ask for information that does not yet exist. CDIAC's global carbon dioxide emissions data set is an example of a product that was created in anticipation of a need as well as in response to such a request. Another such product is the data set on coastal susceptibility to sea level rise. This data set uses Geographic Information System (GIS) technology to integrate data on sea level, erosion, coastline location, and elevation. It reflects scientists' increasing interest in using GIS as an integrative tool.
CDIAC prioritizes potential new data sets and then obtains feedback from sponsors and research groups. Political considerations sometimes influence the choice of what data to work on. The Chinese climate data project mentioned at the beginning of this chapter reflected a management decision to include a more globally diverse array of data. Sometimes a persistent principal investigator can influence the selection decision. Databases deemed to be of lesser scientific importance or whose credibility is in doubt because of methodology will get a lower ranking, whereas technically sound and scientifically important databases will be ranked higher. The size and source of the data are not important in the selection decision.
Contacting the Principal Investigator
CDIAC staff must convince the investigators to submit data to the center. This is because, with the exception of some DOE projects, the center does not have formal relationships that give it the "right" to acquire data. CDIAC staff emphasize that they will document the data, increase both the data's and the investigator's visibility, and remove the burden of responding to data requests. They also stress that the investigator will get full credit for the final data product, will have final sign-off authority on the data set, and can submit the data to the center in whatever format the investigator considers desirable. Because investigators can, and sometimes do, reject these offers, CDIAC staff must adopt a cooperative attitude. They may, for example, offer to wait for the data, while the investigator meets publication deadlines. The overall message is that any extra burden on the investigator will be minimized and that the center's involvement will result in a better product in the end.
Acquiring the Data
At the time that investigators submit data, CDIAC personnel attempt to get as much metadata as possible, including methods, reprints, contact names, and anecdotal information about the data. The contributing scientist is encouraged to send the data in whatever form is convenient. At this point, one person is assigned responsibility for the data set from start to finish. This expands the range of the staff's skills, because a variety of problems are common. In addition, a staff member will care more about a data set for which he or she has full responsibility than if oversight were fragmented. The lead staff person can draw on other expertise as needed.
Performing Quality Assurance and Quality Control
If data are submitted on hard copy, the CDIAC staff perform double data entry. For digital data, they perform virus checks and then make a backup. There is no standard quality assurance/quality control (QA/QC) methodology that is applied to all data sets, because each data set is unique, with its own peculiarities. Based on information from the investigator and past experience, CDIAC staff customize a QA/QC approach to each data set, depending on its characteristics. The operating assumption is that the submitted data are not clean. There are three elements that make this customized approach work: (1) ongoing interaction with the investigator to resolve problems; (2) the continuity that comes from having a single staffer with beginning-to-end responsibility for each data set;
and (3) experienced staff with scientific backgrounds relevant to the data sets.
The lead person for a data set develops a preliminary QA/QC plan and then presents this to other staff for discussion. Following this, the plan is reviewed by the investigator and other experts. The plan includes items such as key thresholds and relationships that must be internally consistent. The QA/QC plan usually is an effective starting point, but surprises sometimes occur that require subsequent improvisation. The plan often necessitates successive passes through the data, because some problems mask others that do not become visible until the problem in the "foreground" is corrected.
All corrections are discussed with the investigator before any changes are made to the data. If the investigator concurs, the change is made and noted in the documentation. If the investigator does not concur, the data are left as is, but the value is flagged as suspicious. After all the visible problems are corrected, the data set is sent out to be "beta tested" by researchers who perform analyses with the data in an attempt to uncover errors or discrepancies that slipped through the QA/QC process. This is a key part of the QA/QC process at CDIAC and provides an opportunity to evaluate critically the data from several different perspectives. The beta test step is based on the recognition that the in-house QA/QC process cannot realistically replicate all the data manipulations that an analyst would be likely to perform.
The data set of global emissions of carbon dioxide provides many examples of typical data quality problems. This data set required integrating four data sets that were not originally intended to be integrated—energy statistics, cement production estimates, gas flaring estimates, and population estimates. In this case, it was frequently necessary to create new data by analyzing or converting existing data. For example, gas flaring estimates for individual countries sometimes have to be estimated from crude oil production estimates and per capita emissions estimates. In addition, political considerations have obscured data or put constraints on how they could be used or reported. For example, some countries have been reluctant to publish raw population data, and the United Nations specified that CDIAC's data set on carbon dioxide emissions could not contain such numbers. However, the data set does contain total carbon dioxide production and the per capita production. The center's staff also were unable to resolve discrepancies in politically sensitive issues, especially for United Nations energy statistics. These data often do not agree with those from other sources, such as private industry or the Organization for Economic Cooperation and Development. However, because approximately 80 percent of carbon dioxide emissions come from about 20 percent of the countries, the CDIAC staff judged that such problems in
relatively smaller sources are not critical. When the CDIAC staff first produced the data set, they reviewed United Nations data from 1950 and found many discrepancies and suspect values, because this was the first time these data had been critically reviewed. Problems included issues such as multiple entries per year per country, or a given country being shown as exporting more coal than it produced. As a result of this positive collaboration experience, the United Nations now utilizes CDIAC as a beta test site for its data sets before public release.
Documenting the Data
The center uses the "20-year rule"; that is, it prepares metadata that would make the data usable 20 years hence, when investigators who collected the data are no longer available for consultation. In the metadata, CDIAC especially emphasizes on limitations of the data and restrictions on possible uses. The documentation also discusses peculiarities and quirks that should be taken into account. In addition, it includes a hard copy of a subset of the data for validation purposes. This subset can be checked against the recipient's digital version to ensure that no problems have occurred during the transfer and loading of the data. Further, the documentation often includes one or more simple algorithms or derived variables to enable users to check on the integrity of the data set as a whole. For example, the documentation might contain the sum of a particular data parameter, added up over all records in the data set. Upon receipt of the data, the user could calculate this sum and compare it to the value in the documentation.
Both the data set and the documentation are reviewed by a team of independent reviewers. This review is made as rigorous as possible and is considered to be equivalent to a peer review of the data package and related metadata.
Distributing the Data
CDIAC staff distribute and publicize their data packages through as many avenues as possible. These include the NASA master directory; electronic bulletin boards; the CDIAC newsletter, CDIAC Communications, with more that 9,000 subscribers in 150 countries; university libraries; catalogs of CDIAC's data and information products; announcements sent to a network of newsletter and journal editors; and a mailing list complied from conferences, sponsoring agencies, and past data requests.
The center sends out regular updates and special news items announcing new data products and revisions. The center also ensures that the investigators are kept up to date on requests for their data as well as
on feedback about the data. Periodic surveys of the entire user community are performed, and these typically achieve a 50 percent response rate.
CDIAC keeps multiple copies of each data set on different media and at different locations. The National Technical Information Service is used as a method for preserving and disseminating the center's reports and data.
A key feature of CDIAC's activities is the staff's understanding that successful data interfacing requires error correction and other quality control activities. Their experience shows clearly the importance of resolving discrepancies between data sets, clarifying ambiguities, investigating the implications of differences in measurement methods, backtracking from derived variables to the original raw data, and creating standardized measurements from a variety of sources. Unless these activities are performed thoroughly and accurately, data interfacing will not result in useful data sets.
As a consequence of working at this "hands-on" level with data, CDIAC staff identified several key prerequisites or premises that they felt were instrumental in their success in creating high-quality data sets. These premises reflect both technical and organizational factors and seem ideally suited to the scale of the center's data management activities to date. Many of these premises will be difficult to duplicate with the much larger volumes of data envisioned in the near future. Nevertheless, the center's staff have placed a high priority on developing ways to incorporate as many of these premises as possible in the expanded activities that will accompany their role as a data center. Each of the following paragraphs summarizes a distinct premise or prerequisite. Many of these are identical to those identified as essential to improving quality in manufacturing and service industries.
Strong commitment to service. A strong commitment to service is a primary goal. The staff have identified their market as the research community and expend a great deal of effort to stay in touch with researchers to find out what they want. They have avoided intricate, high-technology systems and instead emphasize producing high-quality data and useful documentation. They focus on answering the question, "What kinds of data should be in this directory?" rather than on state-of-the-art methods of data transfer. They feel it is more useful to send out a high-quality data set on a tape than a substandard data set on a more advanced medium. While they are interested in responding to their more sophisticated users, they realize they also must remain accessible to many less well-trained
users in developing countries. They do not see their role as being technology drivers.
CDIAC managers also place a high emphasis on fulfilling users' requests completely and in a timely manner. In becoming a World Data Center, they were concerned about the requirement that such data centers accept all data submitted to them for a certain area. A related concern was that the resulting large volumes of data could overwhelm their ability to continue emphasizing data quality and effective service. As a result they tripled their staff and upgraded their hardware in order to keep fulfilling their commitment to respond to users.
Collaborative mindset. CDIAC personnel emphasize a collaborative mindset. They recognize that there is little reward for researchers to manage data for use by others and therefore try to relieve them of this burden. They will accept data in any format that is convenient to researchers and will work with them to make data submission easy. This close interaction with researchers also helps the center evaluate what kinds of data products would be useful or worthwhile to the research community. In contrast, other data center programs that mandate a single format for submitting data have experienced difficulties and have created a motivation for researchers to circumvent the program.
Full credit for data sources. The CDIAC staff try to keep their methods and working relationships in accord with the research community's reward system. This effort creates additional incentives for researchers to provide their data and to participate in the sometimes complex and time-consuming QA/QC process. Data sources get full credit for the data, because they, not the CDIAC staff, are listed as authors on the data packages that the center produces. The staff negotiates with data sources in order to address concerns about others taking improper credit for the data. In some instances, they will agree to delay data submission until the source's analysis has reached a certain point or the results have been published. In addition, the staff recommend a citation format in the data packages, analogous to that for peer-reviewed journals, to help ensure that the sources get full credit.
No fee for services. There is no charge for CDIAC's services, and this helps them build good working relationships with the user community. Not only does the center provide complete data sets with accompanying documentation, they also prepare regional subsets of data or customized combinations of specific data sets on request.
Emphasis on QA/QC and documentation. CDIAC staff emphasize the value added that QA/QC and metadata represent. They argue that their data cleanup and documentation make data sets much more accessible and valuable to the user community. The operating assumption is that no data set is clean. Sufficient time and resources therefore are allocated
for thorough beta testing of data packages. The center sends preliminary versions of data packages out to selected researchers, who then review the data from an analyst's perspective.
CDIAC staff take a long-term perspective, when necessary, in order to improve a source's data quality. For example, the center is involved in obtaining proxy climate records from China (e.g., early monsoons, rice harvest records). Staff are working with the aforementioned Institute of Atmospheric Physics and the Institute of Geography in China and have furnished them with PCs and data entry systems. The first data sets had numerous problems such as the minimum for a variable being greater than the maximum, 50 days of snow in a single month, precipitation of 0 while the qualitative data indicated a rainy month, and 0 used for missing. The project began in 1985, with the agreement signed in 1987. There was a long learning curve before the data quality improved. Data sets were ready to be published in November 1991. Although this is an extreme example, it is common for CDIAC to spend 1 to 2 years preparing a data set for publication.
Use of raw data. CDIAC emphasizes the providing of original raw data, rather than derived or processed data. For example, the center's staff made a scientific case for obtaining the raw, instead of the derived, data when working on the data that were ultimately published as Numeric Data Package (NDP) 20, Global Grid Point Surface Air Temperature (Jones et al., 1991). There were four records for each time and place, but the temperature typically differed among the four records. Jones had used an algorithm to decide which was the "correct" record. Rather than simply providing these processed data, the center furnished the raw data to enable users to try different algorithms and compare their results with those produced by Jones. However, it took additional time to acquire and then work with the raw data.
Emphasis on proper staff training. Based on the conviction that computer science skills alone do not provide the intuition needed for effective QA/QC, CDIAC's professional employees generally have a scientific background in addition to computer programming training. The center's location in Oak Ridge also has proved extremely valuable, because the staff have access to the expertise of a wide range of scientists when needed to help evaluate data.
Responsibility and rewards for staff. CDIAC managers assign one staff person to have responsibility for each data set. This increases skill levels and makes sure someone has the "big picture." In addition, staff care more about "their" data set than they would if responsibility were fragmented. Staff are rewarded for data management skill and success and for soliciting feedback from the user community. These policies contribute to a low turnover of staff, which in turn retains learning in the
organization and permits staff to continue improving their skills. This pool of experience also makes it easier to train new staff.
Added focus on nontechnical issues. CDIAC staff recognize that data management and distribution are not just a technical exercise. Equal emphasis is given to organizational and motivational issues. A major factor contributing to the center's success is that it is operated as a long-term program with secure funding at a consistent level.
Ability to be selective in accepting data sets. The center is not required to accept all data sets that may be submitted to them. Its ability to be selective means there is less danger of staff overload; as a result they can spend the time needed for intensive QA/QC and documentation.
Good working relationship with sponsor. Finally, the center has a good relationship with its sponsoring agency, the Department of Energy. The staff identified a single person, Fred Kuminoff, as CDIAC's champion during its initial years. The current management maintains a commitment to providing data-related services and has helped focus CDIAC on a role it could fulfill effectively.
Jones P.D., S.C.B. Raper, B.S.G. Cherry, C.M. Goodiss, T.M.L. Wigley, B. Santer, P.M. Kelly, R.S. Bradley, and H.F. Diaz. 1991. Numerical Data Package (NDP) 20, Global Grid Point Surface Air Temperature. Carbon Dioxide Information Analysis Center, Oak Ridge National Laboratory, Oak Ridge, Tenn.
Oak Ridge National Laboratory (ORNL). 1991. Trends '91. Thomas A. Boden et al., eds. Environmental Sciences Division, ORNL, Oak Ridge, Tenn.
Oak Ridge National Laboratory (ORNL). 1993. Carbon Dioxide Information Analysis Center: FY 1992 Activities. R.M. Cushman and F.W. Stoss, eds. Environmental Sciences Division, ORNL, Oak Ridge, Tenn.