Skip to main content

Currently Skimming:

Summary
Pages 1-10

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 1...
... Traditional methods of analysis have been based largely on the assumption that analysts can work with data within the confines of their own computing environment, but the growth of "big data" is changing that paradigm, especially in cases in which massive amounts of data are distributed across locations. While the scientific community and the defense enterprise have long been leaders in generating and using large data sets, the emergence of e-commerce and massive search engines has led other sectors to confront the challenges of massive data.
From page 2...
... In particular, these fields have fueled the advent of cloud computing and other parallel and distributed platforms that seem well suited to massive data analysis. Moreover, innovations in the fields of machine learning, data mining, statistics, and the theory of algorithms have yielded
From page 3...
... , but the inferences and decisions made may refer to a different sampling criterion. This issue seems likely to be particularly severe in many massive data sets, which often consist of many subcollections of data, each collected according to a particular choice of sampling criterion and with little control over the overall composition.
From page 4...
... And new tools are needed to bring humans into the data-analysis loop at all stages, recognizing that knowledge is often subjective and context-dependent and that some aspects of human intelligence will not be replaced anytime soon by machines. The current report is the result of a study that addressed the following charge: • Assess the current state of data analysis for mining of massive sets and streams of data, • Identify gaps in current practice and theory, and • Propose a research agenda to fill those gaps.
From page 5...
... Finally, domain scientists and users of technology have an essential role to play in the design of any system for data analysis, and particularly so in the realm of massive data, because of the explosion of design decisions and possible directions that analyses can follow. The current report focuses on the technical issues -- computational and inferential -- that surround massive data, consciously setting aside major issues in areas such as public policy, law, and ethics that are beyond the current scope.
From page 6...
... Other sources of error that are prevalent in massive data include the high-dimensional nature of many data sets, issues of heterogeneity, biases arising from uncontrolled sampling patterns, and unknown provenance of items in a database. In general, data analysis is based on assumptions, and the assumptions underlying many classical data analysis methods are likely to be broken in massive data sets.
From page 7...
... As just alluded to, many data sets require semantic understanding that is currently beyond the reach of algorithmic approaches and for which human input is needed. This input may be obtained from the data analyst, whose judgment is needed throughout the data analysis process, from the framing of hypotheses to the management of trade-offs (e.g., errors versus
From page 8...
... • Many data sources operate in real time, producing data streams that can overwhelm data analysis pipelines. Moreover, there is often a desire to make decisions rapidly, perhaps also in real time.
From page 9...
... , and computational infrastructure will be a necessity in training the next generation of "data scientists." The same point, of course, can be made for academic research: significant new ideas will only emerge if academics are exposed to real-world massive data problems. Finally, the committee emphasizes that massive data analysis is not one problem or one methodology.
From page 10...
... But it is important to emphasize the need for flexibility and for tools that are sensitive to the overall goals of an analysis; massive data analysis cannot, in general, be reduced to turnkey procedures that consumers can use without thought. Rather, the design of a system for massive data analysis will require engineering skill and judgment, and deployment of such a system will require modeling decisions, skill with approximations, attention to diagnostics, and robustness.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.