4
Data Archiving in the Science Centers
THE IMPORTANCE OF ARCHIVAL ACCESS
Access to archival material is becoming increasingly important for all space science disciplines because data are often analyzed more than once and because scientists combine existing data sets across traditionally separate wavelength boundaries. The science centers have become archival centers, and today these online archives serve as the primary point of access to mission data, both raw and calibrated.
Not only are the archives the keepers of the raw observations, but they also provide direct access to calibrated versions of their data products, with online documentation and searchable databases linked to the literature. This “shrink-wrapped” feature of modern archives makes it easier for astronomers to combine data across various subdisciplines, a task that would have been difficult even a few years ago when all astronomers had their own sets of tools and did most of the data reduction themselves.
Archives are a necessary part of an ongoing mission in that they need to furnish rapid access to science-quality data. They also need to capture the relevant information for future recalibration and any modifications and changes that were made to the data reduction pipelines. This provenance information is mandatory not only for a consistent data set but also for legacy uses of the data.
SUSTAINABLE, LONG-TERM ARCHIVES
Archives play a role in efforts that go beyond the space astronomy mission at hand. In most cases, the data sets produced by NASA’s space astronomy missions will be a valuable asset for the community even decades after the mission’s completion—for example, the International Ultraviolet Explorer (IUE) and the Infrared Astronomy Satellite (IRAS). The long-term preservation and continued curation of such data sets are extremely important. These responsibilities present particular challenges for the science centers and have become an important part of their long-term mission. A key question is this: Once a space astronomy mission has completed its operational lifetime, should its archive remain at the location that managed the archive during the mission, or should it be migrated to a central facility where economies of scale might provide a cheaper solution to long-term preservation?
The decision on where to keep a long-term archive should consider what makes an archive usable and sustainable for the community, beyond the minimal goals of preserving the bytes. The committee describes a sustainable archive as one that
-
Continually facilitates the production of new scientific results;
-
Has a strategic goal to enable more and better science;
-
Contains high-quality, reliable data;
-
Provides simple and useful scientific tools to a broad community;
-
Provides user support to the novice as well as to the power user;
-
Has many diverse uses (and users);
-
Has a core group of users for whom it is an everyday tool;
-
Collects metrics that track usage and science output;
-
Is properly curated (e.g., errors discovered are documented and fixed);
-
Adapts and evolves in response to community input; and
-
Has an adequate mix of developers, scientists, and tech support staff.
In spite of the considerable efforts of archive staff to capture as much of the metadata about the particular instruments as possible, the scientists dealing with the quirks of an instrument over the years will always have a much more intimate understanding of the systematic errors in the data products. It is important for the mission to develop good metadata and documentation to ensure the long-term accessibility and usability of its mission data. NASA astronomy science centers can play an important role over the long term in capturing as much of this knowledge as possible during the mission phase, but they should also strive to retain the knowledge as long as necessary, using the above criteria.
ORGANIZATION BY WAVELENGTH
It is clear that there is a natural migration of older data sets into centralized facilities and that not every mission will (or should) retain its own separate archive. Although many archives specialize in broad wavelength ranges,1 those wavelength distinctions have loosened over time. There are also value-added services such as the Astrophysics Data System (ADS)2 (http://adswww.harvard.edu/) and the NASA/ Infrared Processing and Analysis Center (IPAC) Extragalactic Database (NED)3 (http://nedwww.ipac.caltech.edu/), which link the data sets to the literature. Today many astronomers are using these services several times a day. These archives are of course the primary guardians of the data sets from their main missions, the Hubble Space Telescope (HST) at the Space Telescope Science Institute (STScI); Compton Gamma Ray Observatory, Uhuru, Advanced Satellite for Cosmology and Astrophysics, the Roentgen satellite (ROSAT), and many others at HEASARC; the Einstein and Chandra (http://cxc.harvard.edu/cda/) for the Chandra X-ray Center (CXC); and IRAS, Two Micron All Sky Survey (2MASS), Spitzer, and many others at IPAC.
1 |
The UV-optical data sets are migrating to the Multimission Archive at the Space Telescope Science Institute (MAST) at http://archive.stsci.edu/; the near- and far-infrared archives are at the Infrared Science Archive (IRSA) at http://irsa.ipac.caltech.edu/, at IPAC; and the high-energy data sets are moving to the High Energy Astrophysics Science Archive Research Center (HEASARC) at http://heasarc.gsfc.nasa.gov/ at the Goddard Space Flight Center. |
2 |
ADS, operated by Harvard and funded by NASA, contains 4.8 million searchable bibliographic records. Full-text scans of many of these records are viewable free via a browse engine. |
3 |
NED, operated by the Jet Propulsion Laboratory and under contract to NASA, contains 14 million names for over 9 million extragalactic objects and over 3.3 million bibliographic references. |
This natural organization by wavelength has been rather efficient, since the data sets can be curated using a shared expertise at the respective science centers. The personnel at the centers have an enormous collective expertise related to these missions. The help desks are maintained by scientists who have had first-hand experience in developing and/or using these missions. The wavelength-specific software tools are also maintained and distributed through these channels.
Finding: Successful research using archival data sets is dependent on the resident expertise and corporate memory that reside at the science centers.
ARCHIVES AS A SYSTEM
Scientists try to stress the capabilities of any instrument they use to make new discoveries. As a result, most discoveries are done at the edges: the most distant quasar, the faintest arc in the image of a distant galaxy, or the weakest spectral line in a noisy spectrum. Each new space astronomy mission provides a new look at the universe, fainter than before or opening a new domain in the electromagnetic spectrum.
The multiwavelength data available in the different mission archives offer a way to create new “edges.” By combining data sets from different wavelengths, astronomers have found hundreds of brown dwarfs; discovered the most distant galaxies; discovered that the x-ray background was dominated by active galactic nuclei; and established the connection between gamma-ray sources and radio-bright active galaxies.
The NASA astronomy science centers played a crucial role in changing the scientific paradigm of how space science data are analyzed. Figure 4.1 quantifies archival data collected by HST and shows how retrievals increased following the release of the Hubble Deep Field data. The Hubble and Chandra Deep Fields and some of the selected areas—for example, the Great Observatories Origins Deep Survey (GOODS) at http://www.stsci.edu/science/goods/—are prime examples of collecting data at multiple wavelengths over the same area. Archival grants provided the initial motivation for astrophysicists to start analyzing archival data. Today it seems to be almost natural that many data sets are analyzed by numerous scientists, but 10 years ago this was the exception.
Finding: Continued access to mission data across a broad range of wavelengths is of utmost importance to the whole community.
As the use and reuse of data are crossing wavelength boundaries, it is important to consider what is necessary to support such activities. The most important capability from an astronomer’s perspective is that of locating an archive that contains data from a particular region of the sky, in a particular waveband, with a particular instrument. Doing so is possible today, but the procedure is cumbersome.
STANDARDIZATION AND REUSE OF TOOLS
To facilitate comparison, different archives have to be able to provide data in a common format, thereby enabling easy cross-matching of different catalogs and displays of images on the same scale and orientation. Astronomy has a long tradition of common standards, most notably, the Flexible Imaging Transport System (FITS) format. All astronomical software has been able to read FITS images and binary tables for at least two decades. At the same time, it took considerably longer to reach consensus on a common format for spectra. The FITS format was a very important step in allowing the utilization of different data sets for astronomy research.
Finding: Software tools that use standard data frameworks such as FITS provide the best means to cross-query wavelength-specific data sets.
CURRENT STATUS
The National Virtual Observatory (NVO), at http://www.us-vo.org/, is beginning to coordinate standardization efforts and to provide the first data integration and federation tools and applications. To date, the staff from STScI, HEASARC, and CXC have played significant roles in defining standards within the context of the NVO itself and the NVO as a member of the International Virtual Observatory Alliance. Standards, however, are beneficial only if they are accepted. There are encouraging signs that science centers are implementing the virtual observatory (VO) standards. Indeed, much development work in the centers over the last year has been on increasing compatibility with the VO standards. The HEASARC DataScope,4 the IRSA Footprint Service,5 and the STScI Hubble Legacy Archive6 projects
are good examples of these efforts. The upper management of the archives at the science centers has embraced this direction, and the archives are currently implementing medium-term measures to achieve VO standards.
NEAR FUTURE
If the current trend to strong collaboration continues, the archives supported by the science centers will form a homogeneous, easy-to-use system that is integrated from a user’s perspective. Each wavelength regime, however, will retain its own responsibilities for the long-term curation and preservation of the expertise. Such a system of archives needs to be sustainable. What does this sustainability imply? The committee concludes that archives have to form a system that does the following:
-
Provides services that tap resources across the whole community, not just those from one center;
-
Facilitates the adaptation of community-wide standards for data and services;
-
Provides a mechanism for collaborating on and sharing broadly useful software with other archives and with the astrophysics community;
-
Provides data, software, standards, and documentation;
-
Offers, on a regular basis, tools to teach users and developers; and
-
Supports international access to data and services.
Further discoveries will stem from the analysis of multiwavelength data sets. As access to remote data improves and user-friendly tools to support multiwavelength analysis become available, more astrophysicists are expected to rely on these archival data sets on a daily basis. Such a likely outcome will bring additional challenges and raised expectations. Reliability of the data archives will be crucial because more of the community’s research will depend on it. Performance will also be critical when users expect to get their data in seconds rather than hours.
Data curation and provenance are notoriously labor intensive and might present the biggest challenge. As data processing evolves and the archives store derived data sets, possibly from the combination of multiwavelength data (see mention of GOODS above), it will be increasingly important to track the processing trail of the derived products. In a world of more and more data, finding the relevant data sets and assessing their quality and reliability will be also increasingly important, so that the continuous evolution and curation of data—even old mission data—become crucial. Centers could take on more active roles in recent efforts to move data analysis software to the next level, in which a universal (common and distributed) analysis infrastructure supports many instrument-specific applications, some which are developed in the community and some at the center.
As software technology evolves, it is expected that it will be progressively easier to calibrate the data as they are accessed, guaranteeing the most up-to-date version for everyone. Calibrating data as users extract a given data set will require increasingly more computational resources to be co-located with the archives. This will expand the level of services that the archives will be asked to provide.