The National Academies Press

Currently Skimming:

5 Large-Scale Data Representations
Pages 66-81

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.

From page 66... ... A closely related but somewhat more vague notion is that of a data feature. In many cases, a data feature is an externally defined property of the data that can be easily computed from the data or measured directly and then plugged into a data-processing algorithm. Read the entire page →
From page 67... ... Each mathematical structure can be represented using many data structures, with different imple mentations supporting different operations and optimizing differ ent metrics. • Derived mathematical structures. Read the entire page →
From page 68... ... Additional considerations are needed for large-scale distributed computing environments that use computational resources from multiple computers. Reducing Storage and/or Communication In addition to reducing computation time, proper data representations can also reduce the amount of required storage (which translates into reduced communication if the data are transmitted over a network) Read the entire page →
From page 69... ... Other related techniques for obtaining lowdimensional representations include random projections, based either on the Johnson-Lindenstrauss lemma (Johnson and Lindenstrauss, 1984) or its more time-efficient versions (Ailon and Chazelle, 2010; Ailon and Liberty, 2011) Read the entire page →
From page 70... ... Dimensionality reduction refers to a broad class of methods that reexpress the data, typically in terms of vectors that are formally of very high dimension, in terms of a small number of actual data points or attributes, linear or nonlinear combinations of actual data points/attributes, or linear combinations of nonlinearly transformed actual data points/attributes. Such methods are most useful when one can view the data as a perturbed approximation of some low-dimensional scaffolding. Read the entire page →
From page 71... ... Exploratory Data Analysis and Data Interpretation Both dimensionality reduction and clustering are highly useful tools for visualization or other tasks that aid in initial understanding and interpretation of the data. The resulting insights can be incorporated or refined in more sophisticated inference procedures. Read the entire page →
From page 72... ... Many randomized, as well as nonrandom, sampling techniques are known as well, including core-sets, data squashing, and CUR decompositions (Agarwal et al., 2005; DuMouchel, 2002; Mahoney and Drineas, 2009) Read the entire page →
From page 73... ... The Challenge of Architecture and Algorithms: How to Extend Existing Methods to Massive Data Systems A large body of work currently exists for small-scale to medium-scale data analysis and machine learning, but much of this work is currently difficult or impossible to use for very-large-scale data because it does not interface well with existing large-scale systems and architectures, such as multicore processors or distributed clusters of commodity machines. Thus, a major challenge in large-scale data representation is to extend work that has been developed in the context of single machines and medium-scale data to be applicable to parallel, distributed processing and much largerscale situations. Read the entire page →
From page 74... ... The Challenge of Heavy-Tailed and High-Variance Data Although dimensionality reduction methods and related clustering and compact representations represent a large and active area of research with algorithmic and statistical implications, it is worth understanding their limitations. At a high level, these methods take advantage of the idea that if the data are formally high-dimensional but are very well-modeled by a low-dimensional structure -- that is, the data are approximately sparse in some sense -- then a small number of coordinates should suffice to describe the data. Read the entire page →
From page 75... ... A good example of an archetypal problem for this challenge is the need for methods appropriate for the analysis of large but relatively unstructured graphs, e.g., on social networks, biological networks, certain types of noisy graphical models, etc. These problems represent a large and growing domain of applications, but fundamental difficulties limit the use of traditional data representations in large-scale applications. Read the entire page →
From page 76... ... Because the issues that arise in modern massive data applications of matrix and graph algorithms are very different than those in traditional numerical linear algebra and graph theory -- e.g., the sparsity and noise properties are very different, as are considerations with respect to communication and input/output cost models -- a central challenge will be to deal with those issues. The Challenge of Manipulation and Integration of Heterogeneous Data The manipulation and integration of heterogeneous data from different sources into a meaningful common representation is a major challenge. Read the entire page →
From page 77... ... The Challenge of Understanding and Exploiting the Relative Strengths of Data-Oblivious Versus Data-Aware Methods Among the dimensionality reduction methods, there is a certain dichotomy, with most of the techniques falling into one of two broad categories. • Data-oblivious dimensionality reduction includes methods that compute the dimensionality-reducing mapping without using (or the knowledge of) Read the entire page →
From page 78... ... This approach is popular with researchers in machine learning, statistics, and related fields. A challenge is to merge the benefits of data-oblivious and data-aware dimensionality reduction approaches. Read the entire page →
From page 79... ... Thus, a major challenge in large-scale data representation is to combine and exploit the complementary strengths of these two approaches. The underlying representation should be able to support in an efficient manner operations required by both approaches. Read the entire page →
From page 80... ... 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Read the entire page →
From page 81... ... 2000. Nonlinear dimensionality reduction by locally linear embed ding. Read the entire page →

From page 66...

... A closely related but somewhat more vague notion is that of a data feature. In many cases, a data feature is an externally defined property of the data that can be easily computed from the data or measured directly and then plugged into a data-processing algorithm.

5 Large-Scale Data Representations Pages 66-81

5 Large-Scale Data Representations
Pages 66-81