Skip to main content

Currently Skimming:

5 Data, Representation, and Information
Pages 79-100

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 79...
... Gray couples this review with introspection about the ways in which database researchers approach these problems. Databases support storage and retrieval of information by defining-in advance -- a complex structure for the data that supports the intended operations.
From page 80...
... Database industry leaders are all U.S.-based corporations: IBM, Microsoft, and Oracle are the three largest. There are several specialty vendors: Tandem sells over $1 billion/ year of fault-tolerant transaction processing systems, Teradata sells about $1 billion/year of data-mining systems, and companies like Information Resources Associates, Verity, Fulcrum, and others sell specialized data and text-mining software.
From page 81...
... Theoretical work on distributed databases led to prototypes that in turn led to products. Today, all the major database systems offer the ability to distribute and replicate data among nodes of a computer network.
From page 82...
... The database research community now has a major focus on stream data processing. Traditionally, databases have been stored locally and are
From page 83...
... pioneered distributed database technology and object-oriented database technology. Projects at Stanford University fostered deductive database technology, data integration technology, query optimization technology, and the popular Yahoo!
From page 84...
... . The Ingres project went on to investigate distributed databases, database inference, active databases, and extensible databases.
From page 85...
... Today the parallel systems from IBM, Tandem, Oracle, Informix, Sybase, and Microsoft all have a direct lineage from the Wisconsin research on parallel database systems. The use of parallel database systems for data mining is the fastest-growing component of the database server industry.
From page 86...
... After a decade of experimentation, these research ideas evolved into the SQL database language. Having this high level non-procedural language was a boon both to application programmers and to database implementers.
From page 87...
... Today, there are very good tools for defining and querying traditional data base systems; but, there are still major research challenges in the traditional data base field. The major focus is automating as much of the data administration tasks as possible -- making the database system self-healing and self-managing.
From page 88...
... Similarly, in the history of computer science, our information needs and our information capabilities have driven parts of the research agenda. Information retrieval systems take some kind of information, such as text documents or pictures, and try to retrieve topics or concepts based on words or shapes.
From page 89...
... When our documents are few and short, the main problem is not to miss any, and the research at the time stressed algorithms that found related words via associations or improved recall with techniques like relevance feedback. Then, of course, several other advances -- computer typesetting and word processing to generate material and cheap disks to hold it -- led to much larger text collections.
From page 90...
... Cheaper storage led to larger and larger text collections online. Now there are many terabytes of data on the Web.
From page 91...
... This data stimulated a number of projects looking at how to handle bilingual material, including work on automatic alignment of the parallel texts, automatic linking of similar words in the two languages, and so on.4 A similar effect was seen with the Brown corpus of tagged English text, where the part of speech of each word (e.g., whether a word is a noun or a verb) was identified.
From page 92...
... Geographical data started showing up in machine-readable form during the 1980s, especially with the release of the Dual Independent Map Encoding (DIME) files after the 1980 6See, for example, Wayne Niblack, Ron Barber, William Equitz, Myron Flickner, Eduardo H
From page 93...
... After all, if a CD-ROM contains about 300,000 times as many bytes per pound as a deck of punched cards, and a digitized video has about 500,000 times as many bytes per second as the ASCII script it comes from, we should be about where we were in the 1960s with video today. And indeed there are a few projects, most notably the Informedia project at Carnegie Mellon University, that experiment with video signals; they do not yet have ways of searching enormous collections, but they are developing algorithms that exploit whatever they can find in the video: scene breaks, closed-captioning, and so on.
From page 94...
... And there are many other interesting projects specifically linked to an individual data source. Among examples: · The British Library scanning of the original manuscript of Beowulf in collaboration with the University of Kentucky, working on image enhancement until the result of the scanning is better than reading the original; · The Perseus project, demonstrating the educational applications possible because of the earlier Thesaurus Linguae Graecae project, which digitized all the classical Greek authors; · The work in astronomical analysis stimulated by the Sloan Digital Sky Survey; · The creation of the field of "forensic paleontology" at the University of Texas as a result of doing MRI scans of fossil bones; · And, of course, the enormous amount of work on search engines stimulated by the Web.
From page 95...
... can push computer scientists to new algorithms. In both cases, synthesis of specific instances into concepts is a crucial problem.
From page 96...
... We demand more complex measurement, description, and fewer smoothing metaphors and lowest common denominators. Thus, to scientists, atoms appear as clouds of probability; evolution appears as a branching, labyrinthine bush in which some branches die out and others diversify.
From page 97...
... It uses sorting mechanisms, hypertextual display, animation, and the like to allow people to handle the evidence of this part of the past for themselves. This isn't cutting-edge computer science, of course, but it's darned hard and deeply disconcerting to some, for it seems to abdicate responsibility, to undermine authority, to subvert narrative, to challenge story.
From page 98...
... It explicitly linked the past to the present and held out a history of obvious and immediate use. But that quantitative social science history collapsed suddenly, the victim of its own inflated claims, limited method and machinery, and changing academic fashion.
From page 99...
... Manipulable histograms, maps, and time lines promise a social history that is simultaneously sophisticated and accessible. We have what earlier generations of social science historians dreamed of: a fast and widely accessible network linked to cheap and powerful computers running common software with well-established standards for the handling of numbers, texts, and images.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.