Skip to main content

Currently Skimming:

From Massive Data Sets to Science Catalogs: Applications and Challenges
Pages 129-142

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 129...
... keywords: science data analysis, limitations of current methods, challenges for massive data sets, classification learning, clustering. NOTE: Both authors are affiliated with Machine Learning Systems Group, Jet Propulsion Laboratory, M/S 5253660, California Institute of Technology, Pasadena, CA 91 109, ht~p://www-aigjpl.nasa.gov/mls/.
From page 130...
... Both of these image databases are too large for manual visual analysis and provide excellent examples of the need for automated analysis tools. The POSSE- application demonstrates the benefits of using a trainable classification approach in a context where the transformation from pixel space to feature space is well-understood.
From page 131...
... In this case, the significant challenges in developing a cataloging system lie in the feature extraction stage: moving from a pixel representation to a relatively invariant feature representation. 1.2 Developing Science Catalogs from Data In a typical science data cataloging problem, there are several important steps: 1.
From page 132...
... The total number of features measured for each object by SKICAT is 40, including magnitudes, areas, sky brightness, peak values, and intensity weighted and unweighted pixel moments. Some of these features are generic in the sense that they are typically used in analyses of astronomical image data i314: other features such as normalized and non-linear combinations are derived from the generic set.
From page 133...
... 2.1.3 SKICAT Classification Results Stable test classification error rates of about 94% were obtained using RULER, compared to the original trees which had an accuracy of about 90~. Note that such high classification accuracy results could only be obtained after expending significant effort on defining more robust features that captured sufficient invariances between various plates.
From page 134...
... The total combined volume of pre-Magellan Venus image data available from various past US and USSR spacecraft and ground-based observations represents only a tiny fraction of the Magellan data set. Thus, the Magellan mission has provided planetary scientists with an unprecedented data set for Venus science analysis.
From page 135...
... about the spatial and temporal evolution of features on the surface of the Sun: how to incorporate this prior information electively in an automated cataloging system is a non-trivial technical issue. Another ongoing project involves the detection of atmospheric patterns (such as cyclones)
From page 136...
... Current work on SKICAT focuses on exploring the utility of clustering techniques to aid in scientific discovery. The basic idea is to search for clusters in the large data sets (millions to billions of entries in the sky survey catalog database)
From page 137...
... Methods which offer parsimony and insight will tend to be preferred over more complex methods which offer slight performance gains but at a substantial loss of interpretability. 3.3 Subjective Human Annotation of Data Sets for Classification Purposes For scientific data, performance evaluation is often subjective in nature since there is frequently no "gold standard." As an example consider the volcano detection problem: there is no way at present to independently verify if any of the objects which appear to look like volcanoes in the Magellan-SAR imagery truly represent volcanic edifices on the surface of the planet.
From page 138...
... The SKICAT application is an example of manual feature extraction followed by greedy automated feature selection. The Venus application relies entirely on a reduction from high-dimensional pixel space to a low dimensional principal component-based feature space.
From page 139...
... In conclusion, we point out that although our focus has been on science-related applications, massive data sets are rapidly becoming commonplace in a wide spectrum of activities including healthcare, marketing, finance, banking, engineering and diagnostics, retail, and many others. A new area of research, bringing together techniques and people from a variety of fields including statistics, machine learning, pattern recognition, and databases, is emerging under the name: Knowledge Discovery in Databases (KDD)
From page 140...
... In Proceedings of the 1994 Computer Vision and Pattern Recognition Conference, CVPR-94, Los Alamitos, CA: IEEE Computer Society Press, pp.302-309. Cattermole, P
From page 141...
... In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pp.300-305, U


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.