Skip to main content

Currently Skimming:

5 Data Management and Bioinformatics Challenges of Metagenomics
Pages 85-97

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 85...
... There are three nucleic acid sequence archives, all founded in the 1980s: GenBank, funded by the National Institutes of Health (NIH) through the National Library of Medicine; EMBL-Bank, funded by the European Molecular Biology Laboratory; and the DNA Databank of Japan 
From page 86...
... Despite the challenges arising from some of the new sequencing methods, timely deposition of raw sequence data to the Trace Archive by the metagenomics community will also be of great long-term community benefit. The nucleic acid sequence data archives are a primary source of experimentally determined DNA and RNA sequences.
From page 87...
... This is far from being a solved problem even for "complete" genomes. It will be even more difficult for the fragmentary sequences that will typically be obtained in metagenomics projects.
From page 88...
... UniProt is not a primary database, but rather a highly curated database of protein sequences, the vast majority of which are derived computationally from gene models in the nucleic sequence data archive. Not surprisingly, the growth of UniProt has been slower than that of the nucleic acid sequence archive (see Figure 5-2)
From page 89...
... (Data from the European Bioinformatics Institute.)
From page 90...
... . All the new-generation technologies produce sequence read lengths that are short -- 25-200 bases compared with 800-1000 bases for Sanger capillary sequencing technologies.
From page 91...
... If metagenomic sequence data are to be used to their fullest advantage, a metadata infrastructure, which defines the data that are to be collected and their semantics, is an urgent need. As indicated above, no single metadata standard will be appropriate for all samples.
From page 92...
... At the Department of Energy's Joint Genome Institute (Walnut Creek, CA) , an existing microbial genome database project, Integrated Microbial Genomes, is being extended to cope with metagenomics data in a project called IMG/M.9 The objective of IMG/M is to integrate conventional microbial genomics data with data from metagenomics projects.
From page 93...
...  DATA MANAGEMENT AND BIOINFORMATICS CHALLENGES BOX 5-1 The Metagenomic Data Deluge: Future Data Storage and Access Challenges From the perspective of sequence data repositories, projected data storage needs for archiving Sanger-based capillary sequence data might not seem overly formidable. Every year disk space gets cheaper, with storage density increasing steadily.
From page 94...
... Not only will it be necessary for databases to include metadata about habitat and sample treatment, it will also be critical to document how the raw data has been processed, filtered, and analyzed. Maintenance and curation of metagenomics databases will greatly add to their value, but are expensive and will require consistent support.
From page 95...
... ANALYSIS OF METAGENOMIC SEqUENCE DATA Data from metagenomics projects share features that will require the development of novel computational tools and perhaps a new paradigm for the analysis of DNA data. In genome projects, the organization of the DNA in the organism was well known -- a circular chromosome and plasmids in bacteria and multiple chromosomes in eukaryotes.
From page 96...
... for the community to annotate genomic sequences in GenBank, EMBL-Bank, and DDBJ, the original authors' annotations, even if outdated, remain as primary annotations seen in the database. For instance, annotations added through curation at the appropriate model organism database are only very slowly being incorporated into central databases.
From page 97...
... However, the Microbe Project, a US government interagency group, has the appropriate broad membership to facilitate coordination and communication among the interested scientific communities (see Chapter 6)


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.