Skip to main content

Currently Skimming:

Massive Data Sets Workshop: The Morning After
Pages 169-184

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 169...
... This makes it mandatory to disclose the data that have shaped one's views. In my case these were: children's growth data, census data, air traffic radar data, environmental data, hospital data, marketing research data, road quality data, agricultural and meteorological data, with sizes ranging from 3 Mbytes to 2 Gbytes.
From page 170...
... of monster sets. Data analysis goes beyond data processing and ranges from data analysis in the strict sense (non-automated, requiring human judgment based on information contained in the data, and therefore done in interactive mode, if feasible)
From page 171...
... AD three prerequisites can be violated for large sets: the decisions may be not straightforward because of data complexity, the response may be too slow (the human side of the feedback loop is broken if response time exceeds the order of human think time, with the latter depending on the task under consideration) , and it may be difficult to provide a rational basis for the next decision if one cannot visualize the preceding results.
From page 172...
... Processor speed does not scale wed, since computational complexity tends to increase faster than linearly with data size. The position papers of Huber (1994b)
From page 173...
... For very large sets, a clean tree structure is rather the exception than the rule. In particular, those sets often are composed during the analysis from several, originally unrelated sources (for example health data and environmental data, collected independently for different purposes)
From page 174...
... We can distinguish between at least four levels of derived data sets: · raw data set: rarely accessed, never modified, · base data set: frequently accessed, rarely modified, · low level derived sets: semi-permanent, · high level derived sets: transient. The base set is a cleaned and reorganized version of the raw set, streamlined for fast access and easy handling.
From page 175...
... With large sets, processing time problems have to do more with storage access than with processor speed. To counteract that, one wit have to produce small, possibly distributed, derived sets that selectively contain the required information and can be accessed quickly, rather than to work with pointers to the original, larger sets, even if this increases the total required storage space and creates tricky problems with keeping data integrity (e.g.
From page 176...
... We then must identify specific tasks that become harder, or more important, or both, with massive data sets, or with distributed processors and memory. In any case, one wiD need general, efficient subset operations that can operate on potentially very large base sets sitting on relatively slow storage devices.
From page 177...
... ~ have personally encountered at least two unrelated instances where leading bits were lost due to integer overflow, in one case because the subject matter scientist had underestimated the range of a variable, in the other case because a programmer had overlooked that short integers do not suffice to count the seconds in a day. ~ also remember a case of unusable data summaries calculated on-line by the recording apparatus (we noticed the programming error only because the maximum occasionally fed below the average)
From page 178...
... Models are the domain of subject matter specialists, not of statisticians; not all models are stochastic! Therefore, modelling is one of the areas least amenable to a unified treatment and thus poses some special challenges with regard to its integration into general purpose data analysis software through export and import of derived sets.
From page 179...
... For a non-trivial example of preprocessing, compare Ralph Kahn's description of the Earth Observing System and the construction of several layers of derived data sets. For one beginning with subletting, see Eric Lander's description of how a geneticist wiD find the genes responsible for a particular disease: in a first step, the location in the human genome (which is a huge data set, 3 x 109 base pairs)
From page 180...
... But it is difficult to put the general notion of dimension reduction on a sound theoretical basis; exploratory projection pursuit comes closest, but as its computational complexity increases exponentially with dimension, it is not well suited to massive data sets.
From page 181...
... : remember that the k leading singular values yield the best approximation (in the square norm sense) to the data matrix by a matrix of rank k.
From page 182...
... Can general purpose data analysis take advantage of supercomputers? Dongarra's famous benchmark comparison (I have the version of April 13, 1995 in front of me)
From page 183...
... In data analysis, you never hit it right the first, or the second, or even the third time around, and it must be possible to play interactively with modifications, but without having to start everything from scratch. Rather than building a system on top of SmaDtalk or LISP, we decided to augment our data analysis language ISP so that it acquired its own programming environment.
From page 184...
... Subset manipulation and other data base operations, in particular the linking of originally unrelated data sets, are very important. We need a data base management system with characteristics rather different from those of a traditional DBMS.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.