Skip to main content

Currently Skimming:

5 Shared Resources
Pages 31-39

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 31...
... • Yahoo Webscope is a reference library of large, scientifically useful, publicly available data sets for researchers to use. (Ron Brachman)
From page 32...
... Examples include health, particularly population health; financial markets; climate; and biodiversity. Ré used the latter as a specific exam­ple: broadly speaking, biodiversity research involves assembling information about Earth in various disciplines to make estimates of species extinction.
From page 33...
... However, synthetic knowledge bases are feature based and require a priori knowledge of what is sought from the data set. A participant noted that noise, including misspelled words and words that have multiple meanings, is a standard problem for optical character recognition (OCR)
From page 34...
... • The system can be used continuously to analyze large, complex data sets, generate new ideas, and serve as a test bed. Cleveland then described the divide and recombine method.
From page 35...
... Cleveland emphasized that the complexity of the data set is more critical to the computations than the overall size; however, size and complexity are often correlated. In response to a question from the audience, Cleveland stated that training students in these methods, even students who are not very familiar with computer science and statistics, is not difficult.
From page 36...
... Yahoo was interested in creating data sets for academics around the time of the AOL incident, and the AOL experience caused a slow start for Yahoo. Yahoo persisted, however, working on important measures to ensure privacy, and has developed the Webscope6 data sharing program.
From page 37...
... These data sets consist of freely available data of broad interest to the community and include Yahoo Webscope data, C ­ ommon Crawl data gathered by the open-source community (240 TB) , Earthscience satellite data (40 TB of data from NASA)
From page 38...
... a R ­ yland explained that AWS provides identity control and authentication features, including Web Identity Federation. He also described a science-oriented data service (Globus8)
From page 39...
... A workshop participant posited that advanced tools, such as the AWS tools, enable students to use systems to do large-scale computation without fully under­ standing how it works. Ryland responded that this is a pattern in computer science: a new level of abstraction develops, and a compiled tool is developed.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.