The National Academies Press

Currently Skimming:

10 The Seven Computational Giants of Massive Data Analysis
Pages 146-160

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.

From page 146... ... hardware platforms; accommodate a wide range of data formats, models, loss functions, and methods; provide an expressive but simple language in which humans can specify their data analysis goals; hide inessential aspects of data analysis from human users while providing useful visualizations of essential aspects of the analysis; provide seamless interfaces to other computational platforms, including scientific measurement platforms and databases; and provide many of capabilities familiar from large-scale databases, such as checkpointing and provenance tracking. These systems should permit appropriate blends of autonomy and human oversight. Read the entire page →
From page 147... ... Graph-theoretic computations, 4. Linear algebraic computations, 5. Read the entire page →
From page 148... ... But the list of inferential giants is a helpful reminder of the ultimate goal of knowledge discovery. BASIC STATISTICS Types of Data Analyses and Computations This class includes basic statistical tasks. Read the entire page →
From page 149... ... For example, multidimensional counts are important in count-based methods such as association rules and in probabilistic inference in graphical models with discrete variables. Challenges and Examples of Notable Approaches The problems in this class become more difficult in sublinear models of computation, such as in the streaming model. Read the entire page →
From page 150... ... occurs in kernel discriminant analysis as well as in support vector machines. Other instances of this type of computation include k-means, mixtures of Gaussians clustering, hierarchical clustering, spatial statistics of various kinds, spatial joins, the Hausdorff set distance, and many others. Read the entire page →
From page 151... ... When the statistical model takes the form of a graph, graph-search algorithms continue to remain important, but there is also a need to compute marginal probabilities and conditional probabilities over graphs, operations generally referred to as "inference" in the graphical models literature. In models with all discrete variables, graphical model inference requires deeply nested (many-variable) Read the entire page →
From page 152... ... or kernel PCA, the kernel matrix can be so large as to prohibit even storing the matrix explicitly, motivating matrix-free algorithms if possible. Challenges and Examples of Notable Approaches Matrices that do not exhibit quickly decaying spectra -- a feature that indicates strong structure that can be exploited computationally -- and nearsingular matrices represent two common challenges for linear algebraic problems. Read the entire page →
From page 153... ... , and second-order cone programming appear in various forms of support vector machines as well as more recent classifiers, and semidefinite programming appears in recent manifold learning methods such as maximum variance unfolding. It is only a matter of time before other standard types of optimization problems, such as geometric programming (which has not yet been adopted within statistical fields) Read the entire page →
From page 154... ... The kinds of integrals arising in Bayesian statistical models are typically of high dimension for modern problems. The default approach for high-dimensional integration is Markov Chain Monte Carlo (MCMC; Andrie et al., 2003) Read the entire page →
From page 155... ... Challenges and Examples of Notable Approaches Such problems are often combinatorial in nature, and thus various forms of problem-specific constraints are generally exploited to make the computations efficient. The sequential structure of genomics problems naturally lead to dynamic programming solutions, but for larger scales, greedy hierarchical solutions (Higgins and Sharpe, 1988) Read the entire page →
From page 156... ... As noted early in this chapter, most work to date focuses on setting A, the "default setting" of a single processor with the entire data set fitting in random access memory, so much work remains in order to more completely explore this space of algorithms and settings. REFERENCES Agarwal, P.K., S Read the entire page →
From page 157... ... In Advances in Neural Information Processing Systems 20 (J.C. Platt, D Read the entire page →
From page 158... ... 2004. Fast Kernel Matrix-Vector Multiplication with Application to Gaussian Process Regression. Read the entire page →
From page 159... ... 1998. Fast training of support vector machines using sequential minimal optimization. Read the entire page →
From page 160... ... 2003. Graphical models, exponential families and varia tional inference. Read the entire page →

From page 146...

... hardware platforms; accommodate a wide range of data formats, models, loss functions, and methods; provide an expressive but simple language in which humans can specify their data analysis goals; hide inessential aspects of data analysis from human users while providing useful visualizations of essential aspects of the analysis; provide seamless interfaces to other computational platforms, including scientific measurement platforms and databases; and provide many of capabilities familiar from large-scale databases, such as checkpointing and provenance tracking. These systems should permit appropriate blends of autonomy and human oversight.

10 The Seven Computational Giants of Massive Data Analysis Pages 146-160

10 The Seven Computational Giants of Massive Data Analysis
Pages 146-160