Skip to main content

Currently Skimming:

2 The Current State of Data Integration in Science
Pages 6-17

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 6...
... Alex Szalay of Johns Hopkins University observed that many fields of science are becoming data intensive, and thus reliant on cyberinfra structure. An example is the use of virtual observatories in astronomy, in which the database serves as a sort of laboratory in which an astronomer 
From page 7...
... observed that a virtual observatory requires a global schema,1 a concept that has not worked very well in most enterprises. There have been numerous efforts to develop global schema, but anticipating the many questions that might be posed of the data constitutes a significant barrier.
From page 8...
... The Large Synoptic Survey Telescope (LSST) project might accumulate hundreds of petabytes of data over the next decade, while the proposed Square Kilometer Array (SKA)
From page 9...
... With emerging capabilities, a biology laboratory will be able to produce over 100 billion base pairs a day, which begins to rival the 150 billion base pairs produced in all of 2007 by the Human Genome Institute. A laboratory that produces 100 billion base pairs per day will never be able to fully analyze those data, so researchers from the broader community will have to be called on.
From page 10...
... Dr. Clarke also noted that, with geospatial data, the goal is often more complex than producing single map layers (showing, for example political boundaries, geography, or roads)
From page 11...
... He pointed to a research study that examined whether there are positive symptoms in schizophrenics associated with severe temporal gyrus dysfunction. Answering that question required integrating results from multiple data sources, from multiple sites, and multiple imaging modalities.
From page 12...
... Laura Haas of the IBM Almaden Research Center noted that the database community has been integrating work flows more generally, to include not just metadata but also information about the subsequent analysis, such as which data were selected, which analyses were performed, and what methods and software were used, all of which could apply to situations such as the one mentioned by Dr. Frazier.
From page 13...
... DATA-INTEgRATION TOOLS Dr. Clarke pointed out that geospatial researchers have created the beginnings of a data-integration policy through the adoption of the National Spatial Data Infrastructure.
From page 14...
... But data stewardship operates on timescales that are more familiar to data archivists and research librarians, which are longer than the active professional life and interests of many of the researchers involved in the project that produced the data (and certainly longer than the tenure of a graduate student)
From page 15...
... CROSSCUTTINg DISCUSSION Michael Brodie of Verizon Communications brought up some more general issues of standards across enterprises. Establishing and maintain ing standards in a very large community, whether a scientific community or an enterprise, is difficult because there are few general principles to help one decide how well a given standard will suit a particular data set, particularly when that data set is innovative or might be subject to novel reuse sometime in the future.
From page 16...
... He thinks the better model is distributed databases and distributed costs of maintaining them. David Maier of Portland State University raised the question of how to train people to work effectively with shared data.
From page 17...
... As noted above the scale of scientific data is P increasing rapidly. The benefits of integrating large volumes of data, multiple data sets from different sources, and multiple types of data are enormous, and this integration will enable science to advance more rapidly and in areas heretofore outside the realm of possibility.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.