Skip to main content

Currently Skimming:

10 The Seven Computational Giants of Massive Data Analysis
Pages 146-160

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 146...
... hardware platforms; accommodate a wide range of data formats, models, loss functions, and methods; provide an expressive but simple language in which humans can specify their data analysis goals; hide inessential aspects of data analysis from human users while providing useful visualizations of essential aspects of the analysis; provide seamless interfaces to other computational platforms, including scientific measurement platforms and databases; and provide many of capabilities familiar from large-scale databases, such as checkpointing and provenance tracking. These systems should permit appropriate blends of autonomy and human oversight.
From page 147...
... Graph-theoretic computations, 4. Linear algebraic computations, 5.
From page 148...
... But the list of inferential giants is a helpful reminder of the ultimate goal of knowledge discovery. BASIC STATISTICS Types of Data Analyses and Computations This class includes basic statistical tasks.
From page 149...
... For example, multidimensional counts are important in count-based methods such as association rules and in probabilistic inference in graphical models with discrete variables. Challenges and Examples of Notable Approaches The problems in this class become more difficult in sublinear models of computation, such as in the streaming model.
From page 150...
... occurs in kernel discriminant analysis as well as in support vector machines. Other instances of this type of computation include k-means, mixtures of Gaussians clustering, hierarchical clustering, spatial statistics of various kinds, spatial joins, the Hausdorff set distance, and many others.
From page 151...
... When the statistical model takes the form of a graph, graph-search algorithms continue to remain important, but there is also a need to compute marginal probabilities and conditional probabilities over graphs, operations generally referred to as "inference" in the graphical models literature. In models with all discrete variables, graphical model inference requires deeply nested (many-variable)
From page 152...
... or kernel PCA, the kernel matrix can be so large as to prohibit even storing the matrix explicitly, motivating matrix-free algorithms if possible. Challenges and Examples of Notable Approaches Matrices that do not exhibit quickly decaying spectra -- a feature that indicates strong structure that can be exploited computationally -- and nearsingular matrices represent two common challenges for linear algebraic problems.
From page 153...
... , and second-order cone programming appear in various forms of support vector machines as well as more recent classifiers, and semidefinite programming appears in recent manifold learning methods such as maximum variance unfolding. It is only a matter of time before other standard types of optimization problems, such as geometric programming (which has not yet been adopted within statistical fields)
From page 154...
... The kinds of integrals arising in Bayesian statistical models are typically of high dimension for modern problems. The default approach for high-dimensional integration is Markov Chain Monte Carlo (MCMC; Andrie et al., 2003)
From page 155...
... Challenges and Examples of Notable Approaches Such problems are often combinatorial in nature, and thus various forms of problem-specific constraints are generally exploited to make the computations efficient. The sequential structure of genomics problems naturally lead to dynamic programming solutions, but for larger scales, greedy hierarchical solutions (Higgins and Sharpe, 1988)
From page 156...
... As noted early in this chapter, most work to date focuses on setting A, the "default setting" of a single processor with the entire data set fitting in random access memory, so much work remains in order to more completely explore this space of algorithms and settings. REFERENCES Agarwal, P.K., S
From page 157...
... In Advances in Neural Information Processing Systems 20 (J.C. Platt, D
From page 158...
... 2004. Fast Kernel Matrix-Vector Multiplication with Application to Gaussian Process Regression.
From page 159...
... 1998. Fast training of support vector machines using sequential minimal optimization.
From page 160...
... 2003. Graphical models, exponential families and varia tional inference.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.