Skip to main content

Currently Skimming:

7 Building Models from Massive Data
Pages 93-119

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 93...
... Statistical models are usually presented as a family of equations (mathematical formulas) that describe how some or all aspects of the data might have been generated.
From page 94...
... , yet both statisticians and machine learners can be found at both ends of the spectrum. In this report, terms like "statistical model" or "statistical approach" are understood to include rather than exclude machine learning.
From page 95...
... Moving beyond regression, multivariate models also specify a joint distribution, but without distinguishing response and predictor variables. The frequentist approach views the model parameters as unknown constants and estimates them by matching the model to the available training data using an appropriate metric.
From page 96...
... The posterior can be calculated from Bayes's theorem. Bayesian models replace the "parameter estimation" problem by the problem of defining a "good prior" plus computation of the posterior, although when coupled with a suitable loss function, Bayesian approaches can also produce parameter estimates.
From page 97...
... The nonparametric perspective copes naturally with the fact that new phenomena often emerge as data sets increase in size. The distinction between parametric and nonparametric models is orthogonal to the frequentist/Bayesian distinction -- there are frequentist approaches to nonparametric modeling and Bayesian approaches to nonparametric modeling.
From page 98...
... For Bayesians the loss function is used to specify the aspects of the posterior distribution that are of particular interest to the data analyst and thereby guide the design of posterior inference and decision-making procedures. The use of loss functions encourages the development of partially specified models.
From page 99...
... and the more traditional statistical models. Some other data-analysis procedures try to find meaningful characterizations of the data that satisfy some descriptions but are not necessarily based on optimization.
From page 100...
... The goal is often to build statistical models that include one or more components of noise so that noise can be separated from signal, and thus relatively complex models can be used for the signal, while avoiding overly complex models that would find structure where there is none. The modeling of the noise component impacts not only the parameter estimation procedure, but also the often informal process of cleaning the data and assessing whether the data are of high-enough quality to be used for the task at hand.
From page 101...
... Some statistical modeling procedures -- such as trees, random forests, and boosted trees -- have built-in methods for dealing with missing values. However, many model-building approaches assume the data are complete, and so one is left to impute the missing data prior to modeling.
From page 102...
... For example, most complex models are computationally intensive, and algorithms that work perfectly well with megabytes of data may become infeasible with terabytes or petabytes of data, regardless of the computational power that is available. Thus, in analyzing massive data, one must re-think the trade-offs between complexity and computational efficiency.
From page 103...
... There are also valid clustering procedures that are not based on optimization or statistical models. For example, in hierarchical agglomerative clustering, one starts with each single data point as a cluster, and then iteratively groups the two closest clusters to form a larger cluster; this process is repeated until all data are grouped into a single cluster.
From page 104...
... Bayesian methods are natural in this context because they work with a joint distribution, both on observed and unobserved variables, so that elementary probability calculations can be used for statistical inference. Massive data may contain many variables that require complex probabilistic models, presenting both statistical and computational challenges.
From page 105...
... , but the two approaches lead to different estimates for the parameters. In the traditional statistical literature, the standard parameter estimation method for developing either a generative or discriminative model is maximum likelihood estimation (MLE)
From page 106...
... Efficient leverage of massive data is an important research topic currently in structured prediction, and this is likely to continue for the near future. Another active research topic is online prediction, which can be regarded both as modeling for sequential prediction and as optimization over massive data.
From page 107...
... Some relevant examples are the following: • The shrinkage parameter in a Lasso or elastic-net logistic regression; • The number of terms (trees) in a boosted regression model; • The "cost" parameter in a support vector machine classifier, or the scale parameter of the radial kernel used; • The number of variables included in a forward stepwise regression; and • The number of clusters in a prototype model (e.g., mixture model)
From page 108...
... It is also important to note that statistical model complexity and computational complexity are distinct. Given a model with fixed statistical complexity and for a fixed out-of-sample accuracy target, additional data allow one to estimate a model with more computational efficiency.
From page 109...
... Thus, once the best model has been chosen, its predictive performance should be evaluated on a different held-back test data set, because the selection step can introduce bias. Ideally, then, the following three separate data sets will be identified for the overall task: • Training data.
From page 110...
... If models have to repeatedly be fit as the structure of the data changes, it is important to know how many training data are needed. It is much easier to build models with smaller numbers of observations.
From page 111...
... Standard error estimates usually accompany parameter and prediction estimates for traditional linear models. However, as statistical models have grown in complexity, estimation of secondary measures such as standard errors has not kept pace with prediction performance.
From page 112...
... CHALLENGES Building effective models for the analysis of massive data requires different considerations than building models for the kinds of small data sets that have been more common in traditional statistics. Each of the topics outlined above face new challenges when the data become massive, although having access to much more data also opens the door to alternative approaches.
From page 113...
... The Trade-Off Between Model Accuracy and Computational Efficiency As was discussed in Chapter 6, computational complexity is an important consideration in massive data analysis. Although complex models may be more accurate, significantly more computational resources are needed to carry out the necessary calculations.
From page 114...
... Nevertheless, as pointed out in the previous subsection, the availability of massive data means that there is a significant opportunity to build ever more complex models that can benefit from massive data, as long as this can be done efficiently. One approach is to discover important nonlinear features from a massive data set and use linear models with the discovered features.
From page 115...
... For example, a recent data mining competition focuses on using insurance claims data to predict hospitalization2: Given extensive historical medical data on millions of patients (drugs, conditions, etc.) , identify the subset of patients that will be hospitalized in the next year.
From page 116...
... In many massive data applications, data contain heterogeneous data types such as audio, video, images, and text. Much of the model-building literature and the current model-building arsenal focus on homogeneous data types, such as numerical data or text.
From page 117...
... This is because traditional statistical models can lead to estimators (such as MLE) that solve an optimization problem; moreover, appropriately formulated optimization formulations (such as k-means)
From page 118...
... For example, in a distributed computing environment with many computers that are loosely connected, the communication cost between different machines is high. In this scenario, some Bayesian models with parameters estimated using Monte Carlo simulation can relatively easily take advantage of the multiple machines by performing independent simulations on different machines.
From page 119...
... 2003. Exploratory Data Mining and Data Cleaning.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.