Skip to main content

Currently Skimming:

3 Scaling the Infrastructure for Data Management
Pages 41-57

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 41...
... is more powerful than an analysis that is limited to the records of just one bank. Unfortunately, scaling the number of data sets is very difficult in practice due to heterogeneity in data representations and semantics, data quality, and openness.
From page 42...
... Such an approach can interpolate between the current extremes of restrictive up-front standardization versus free-form chaos. Rich semantics allows tools to be developed that can effectively exploit relationships, thus enabling improved discovery and navigation, and several standards and technologies are emerging. However, current capabilities are still highly dependent on defining well-formed models and structures up-front. It may be important to consider how to evolve data standards over time so that as patterns are recognized in freeform entries, they can be gradually folded into the structured portion of the representation.
From page 43...
... A less daunting but still very ambitious goal is to track the provenance of data elements -- that is, their origin, movement, and processing histories. Provenance is also useful for purposes other than reasoning about quality, such as in propagating data updates efficiently, and to attribute data properly in scientific publications.
From page 44...
... A long-standing example of hardware parallelism is integrated circuits for signal processing that can perform Fast Fourier Transforms as a hardware operation. More recent developments are motivated by problems in network management and by the hardware developed to accelerate computer graphics.
From page 45...
... Another recent development in hardware parallelism is motivated by the graphics processing units (GPUs) developed to accelerate computer graphics.
From page 46...
... Data Stream Management Systems Data stream management systems (DSMS) have emerged as a significant research topic over the past decade, with many research systems (Stream, Niagara, Telegraph, Aurora, Cougar)
From page 47...
... . If the DSMS is programmed using a declarative query language, the query analyzer will convert the textual queries into a collection of stream operators so that in either case a collection of interconnected stream operators is presented to the query optimizer.
From page 48...
... are normally split into eight 10-Gigabit Ethernet streams by specialized networking equipment, at the direction of the query optimizer. The InfoSphere Streams system makes many optimizations to the query graph to optimize parallel and distributed processing: splitting streams to enable parallelism of expensive operators, coalescing query operators into processing elements to minimize data copying, and allocating processing element instances to cluster nodes to maximize parallelism while minimizing data copy and network transmission overhead.
From page 49...
... These gaps generally occur where complex synchronization would be involved, e.g., file locks and concurrent file access by processes on different servers. The difficulties of providing POSIX-compliance in a very-large-scale cluster have motivated the development of non-POSIX-compliant file systems, for example the Google file system and the Hadoop distributed file system (HDFS)
From page 50...
... File availability is ensured using replication; a file block is, by default, replicated to three storage hosts, although critical or frequently accessed files might have a higher degree of replication. An interesting aspect of the Google file system is that a single server with a hot spare controls thousands of file server nodes.
From page 51...
... For example, Apache Hadoop is designed to use Elastic Compute Cloud servers accessing data stored in S3. The success of Amazon's cloud service has encouraged the development of other cloud computing offerings.
From page 52...
... , multiple query operators can execute in parallel, in a manner similar to the inter-operator parallelism exploited by data stream systems. Parallelizing database processing has been an active research topic for several decades.
From page 53...
... However, they also found that tuning parallel databases is often a difficult task requiring specialized expertise, whereas MapReduce systems are more readily configured to give good performance. Modern shared-nothing parallel databases include Teradata, Netezza, Greenplum, and extensions to Oracle and DB2.
From page 54...
... NoSQL databases attempt to improve scaling by providing only weak or eventual consistency guarantees on the stored data, eliminating much of the complexity and overhead of the traditional strong consistency provided by conventional databases, which is especially marked in a distributed setting. Examples of NoSQL databases include MongoDB (document store)
From page 55...
... ability to manage it. Parallel databases allow naive users to compose and execute complex programs over petabytes of data.
From page 56...
... Small changes to program phasing, data layout, and system configuration can have a very large effect on performance. Very large systems are typically accessed by a user community; the effect of the interaction of mul tiple parallel programs compounds the problem of understanding performance.
From page 57...
... 2003. The Google file system.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.