Skip to main content

Currently Skimming:

2 Massive Data in Science, Technology, Commerce, National Defense, Telecommunications, and Other Endeavors
Pages 22-40

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 22...
... Analyses of the information contained in these data sets have already led to major breakthroughs in fields ranging from genomics to astronomy and high-energy physics, encompassing every scale of the physical world. Yet much more remains, and the great increase in scale of the data creates complex challenges for traditional analysis techniques.
From page 23...
... Social media (e.g., Facebook, YouTube, Twitter) have exploded beyond anyone's wildest imagination, and today some of these companies have hundreds of millions of users.
From page 24...
... Initiatives in research and development that are leading to improved capabilities include the following: • Dealing with highly distributed data sources, • Tracking data provenance, from data generation through data preparation, • Validating data, • Coping with sampling biases and heterogeneity, • Working with different data formats and structures, • Developing algorithms that exploit parallel and distributed architectures, • Ensuring data integrity, • Ensuring data security, • Enabling data discovery and integration, • Enabling data sharing, • Developing methods for visualizing massive data, and • Developing scalable and incremental algorithms. As data volumes increase, the ability to perform analysis on the data is constrained by the increasingly distributed nature of modern data sets.
From page 25...
... Finally, challenges exist in better visualizing massive data sets. While there have been advances in visualizing data through various approaches, most notably geographic information system-based capabilities, better methods are required to analyze massive data, particularly data sets that are heterogeneous in nature and may exhibit critical differences in information that are difficult to summarize.
From page 26...
... Existing supercomputers are not well suited for data-intensive computations either, because while they maximize CPU cycles, they lack I/O bandwidth to the mass storage layer. Moreover, most supercomputers lack disk space adequate to store petabyte-size data sets over the multi-month periods that are required for a detailed exploratory analysis.
From page 27...
... These models work well in the commercial setting, where enormous resources are spent on harvesting and collecting the data through actions such as Internet crawling, aerial photos for geospatial information systems, or collecting user data in search engines. Some of the technical trends that have been occurring to address the data challenges include the following: • Distributed systems (access, federation, linking, etc.)
From page 28...
... Such a pipeline may also extract data from operational databases and systems and put that data into environments where it can be prepared and fused with other data sets and staged into systems that support analysis. Challenges include co-utilization of services, workflow discovery, workflow sharing, and maintaining information on metadata, information pedigree, and information assurances as data moves through the workflow.
From page 29...
... Climate research also continues to grow at a rapid pace as climate models and satellite observations grow and are needed to support new discovery. NASA's Earth Science enterprise, for example, now manages data collections in the several-petabyte range.
From page 30...
... When a community of 100,000 people starts collecting high-resolution images, the aggregate data from amateur astronomers may easily outgrow the professional astronomy community. Biological and Medical Research A substantial amount of analysis is being performed using data collected by medical information systems, most notably patient electronic health records.
From page 31...
... Medical researchers are gathering together to share information about interventions and outcomes in order to perform retrospective analysis, and insurance companies continue to mine data to improve their own models. The genomics revolution is proceeding apace, with the cost of sequencing a single human genome soon to drop below $1,000.
From page 32...
... Several papers appearing in the top journals have used this facility. Telecommunications and Networking Managing a modern globe-spanning highly reliable communications network requires extensive real-time network monitoring and analysis capabilities.
From page 33...
... Further, the global network contains tens of thousands of network elements distributed worldwide. In addition to the large volumes of data involved, the major problem in telecommuni­ ations and networking data analysis is the complexity of c the data sets (hundreds to thousands of distinct data feeds)
From page 34...
... The availability of, and interest in, such massive network data has increased as social media sites have become more prevalent; as data records for public functions (such as home sales records and criminal activity reports) have increasingly been made public; and as various corporations make such data available, at least in anonymized form (such as phone records, Internet movie databases, and the web-of-science)
From page 35...
... updating metrics as data changes. Geotemporal network data present still further challenges, due in part to the infrastructure constraints that inhibit transmitting and sharing geo-images and the lack of large-scale, well-validated, spatial data for locations of interest in network analyses.
From page 36...
... Social network analysis technologies can be used to assess social media data, while social network theory can be used to address how people will connect via social media and how it will change the nature of their interactions. However, many challenges remain.
From page 37...
... How can one meas ure, assess, forecast, and alter social sentiment using diverse social media? What new scalable social-network techniques are needed for assessing sentiment, identifying sources of sentiment, track ing changes in groups and sentiment simultaneously, determining whether the opinion leaders across groups are the same or differ ent, and so on?
From page 38...
... Transparency must be maintained, and the analysts must be able to check sources for any node or link in an extracted network. An increasing number of jobs require tracking information using social network data, and an increasing number of activities that individuals engage in can be discerned from information on the individual's social network, particularly when multiple networks, multiple types of nodes, and multiple relations can be overlain on one another.
From page 39...
... Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media. Available at http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/ paper/viewFile/1521/1832.
From page 40...
... Com paring data from Twitter's streaming API with Twitter's firehose. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM)


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.