The National Academies Press

Currently Skimming:

3 Improving Current Capabilities for Data Integration in Science
Pages 18-30

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.

From page 18... ... Individual groups have little incentive to publish data, which slows the progress of the broader field. A new researcher in the domain is presented with a daunting data-discovery problem. Read the entire page →
From page 19... ... Most popular transforms have been written multiple times by multiple labs, which is, of course, inefficient. Workshop participants said it was rarely easy to locate existing transformation software of interest, and some suggested that an online service to share transforms could be established. Read the entire page →
From page 20... ... The same is true for all sorts of data manipulations, with similar kinds of code modules appearing over and over among 1 The Semantic Web is an ambitious dream of deploying interlinked information via the resource description framework (RDF) throughout the Web. Read the entire page →
From page 21... ... , and others might act like an object-oriented database or even a content repository. Besides having different interfaces, the capability of federation engines also varies, from "gateway" systems that allow simple queries against one source at a time while providing a common interface to all sources, to systems that allow users to leverage the full power of their query language to gather or correlate information from multiple diverse sources with a high level of query function. Read the entire page →
From page 22... ... In this example, federation allows the scientist to pose the query without worrying about the geographic distribution of the data or about the different interfaces for the chemical stores, relational databases, and literature sources. The federation engine bridges this heterogeneity and drives the execution of the query across the different sources, reporting the results to the waiting scientists. Read the entire page →
From page 23... ... paradigm, which is commonly used in business, is an alternative approach for the integration of primarily structured data. Its first step is to extract data from various sources, which includes conversion into some common format. Read the entire page →
From page 24... ... Dynamic algorithms for these tasks would enable federation. • ost federation engines today work only on traditional structured M or semistructured data, though they can also return some uninter preted fields, such as images. Read the entire page →
From page 25... ... The bbc.openlinksw.com server periodically crawls this content and pres ents it for search and structured querying via SPARQL, the SQL equivalent for RDF. Additionally, this server, if used as a proxy for accessing other RDF content, caches this content and allows querying over the BBC data and other cached data. Read the entire page →
From page 26... ... . At the workshop, Amr Awadallah of Cloudera Computing described a popular example that illustrates the scalability of Hadoop for economically storing large amounts of scientific data: the Large Hadron Collider Tier 2 site at the University of Nebraska-Lincoln, which currently stores 400 TB of data.5 As scientific data sets continue to grow at exponential rates, the need is paramount for scalable, fault-tolerant systems that can both store and process data economically. Read the entire page →
From page 27... ... . The power of the overall MapReduce system (the distributed sched uling system that executes MapReduce jobs) Read the entire page →
From page 28... ... Most commercial data-integration solutions are based on the relational model, with a few using XML as a target model. Such offerings are not likely to be of great help for integrating scientific data sets because there is not much support for some data types common to science, such as sequences, time series, and multidimensional arrays. Read the entire page →
From page 29... ... Because of the limitations of relational DBMSs for supporting arrays, Maier reported that many scientific data end up in files using array data formats, such as NetCDF6 and HDF.7 While such formats support multidimensional arrays directly and appropriate access methods, they offer a file-per-dataset model and limited operations and hence are far from a full DBMS. They support interfaces to languages popular in scientific domains (C++, Fortran, Python) Read the entire page →
From page 30... ... has recently begun development of an open-source database with fully native support for an array model, including an array-aware storage manager. In addition to a data model and algebra for multi-dimensional arrays, SciDB will support history and versioning of arrays, provenance, uncertainty annotations, and parallel execution of queries. Read the entire page →

From page 18...

... Individual groups have little incentive to publish data, which slows the progress of the broader field. A new researcher in the domain is presented with a daunting data-discovery problem.

Read the entire page →

From page 19...

... Most popular transforms have been written multiple times by multiple labs, which is, of course, inefficient. Workshop participants said it was rarely easy to locate existing transformation software of interest, and some suggested that an online service to share transforms could be established.

Read the entire page →

From page 20...

... The same is true for all sorts of data manipulations, with similar kinds of code modules appearing over and over among 1 The Semantic Web is an ambitious dream of deploying interlinked information via the resource description framework (RDF) throughout the Web.

Read the entire page →

From page 21...

... , and others might act like an object-oriented database or even a content repository. Besides having different interfaces, the capability of federation engines also varies, from "gateway" systems that allow simple queries against one source at a time while providing a common interface to all sources, to systems that allow users to leverage the full power of their query language to gather or correlate information from multiple diverse sources with a high level of query function.

Read the entire page →

From page 22...

... In this example, federation allows the scientist to pose the query without worrying about the geographic distribution of the data or about the different interfaces for the chemical stores, relational databases, and literature sources. The federation engine bridges this heterogeneity and drives the execution of the query across the different sources, reporting the results to the waiting scientists.

Read the entire page →

From page 23...

... paradigm, which is commonly used in business, is an alternative approach for the integration of primarily structured data. Its first step is to extract data from various sources, which includes conversion into some common format.

Read the entire page →

From page 24...

... Dynamic algorithms for these tasks would enable federation. • ost federation engines today work only on traditional structured M or semistructured data, though they can also return some uninter preted fields, such as images.

Read the entire page →

From page 25...

... The bbc.openlinksw.com server periodically crawls this content and pres ents it for search and structured querying via SPARQL, the SQL equivalent for RDF. Additionally, this server, if used as a proxy for accessing other RDF content, caches this content and allows querying over the BBC data and other cached data.

Read the entire page →

From page 26...

... . At the workshop, Amr Awadallah of Cloudera Computing described a popular example that illustrates the scalability of Hadoop for economically storing large amounts of scientific data: the Large Hadron Collider Tier 2 site at the University of Nebraska-Lincoln, which currently stores 400 TB of data.5 As scientific data sets continue to grow at exponential rates, the need is paramount for scalable, fault-tolerant systems that can both store and process data economically.

Read the entire page →

From page 27...

... . The power of the overall MapReduce system (the distributed sched uling system that executes MapReduce jobs)

Read the entire page →

From page 28...

... Most commercial data-integration solutions are based on the relational model, with a few using XML as a target model. Such offerings are not likely to be of great help for integrating scientific data sets because there is not much support for some data types common to science, such as sequences, time series, and multidimensional arrays.

Read the entire page →

From page 29...

... Because of the limitations of relational DBMSs for supporting arrays, Maier reported that many scientific data end up in files using array data formats, such as NetCDF6 and HDF.7 While such formats support multidimensional arrays directly and appropriate access methods, they offer a file-per-dataset model and limited operations and hence are far from a full DBMS. They support interfaces to languages popular in scientific domains (C++, Fortran, Python)

Read the entire page →

From page 30...

... has recently begun development of an open-source database with fully native support for an array model, including an array-aware storage manager. In addition to a data model and algebra for multi-dimensional arrays, SciDB will support history and versioning of arrays, provenance, uncertainty annotations, and parallel execution of queries.

Read the entire page →

← Previous Chapter Skim

Next Chapter Skim →

This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.

3 Improving Current Capabilities for Data Integration in Science Pages 18-30

3 Improving Current Capabilities for Data Integration in Science
Pages 18-30