Skip to main content

Currently Skimming:

3 On the Nature of Biological Data
Pages 35-56

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 35...
... Sequence data, such as those associated with the DNA of various species, have grown enormously with the development of automated sequencing technology. In addition to the human genome, a variety of other genomes have been collected, covering organisms including bacteria, yeast, chicken, fruit flies, and mice.2 Other projects seek to characterize the genomes of all of the organisms living in a given ecosystem even without knowing all of them beforehand.3 Sequence data generally 1This discussion of data types draws heavily on H.V.
From page 36...
... , or other types of grammars. Patterns are also interesting in the exploration of protein structure data, microarray data, pathway data, proteomics data, and metabolomics data.
From page 37...
... . Examples of phenomena with a temporal dimension include cellular response to environmental changes, pathway regulation, dynamics of gene expression levels, protein structure dynamics, developmental biology, and evolution.
From page 38...
... The raw output of a microarray experiment is a listing of fluorescent intensities associated with spots in an array; apart from complicating factors, the brightness of these spots is an indication of the expression level of the transcript associated with them. On the other hand, the complicating factors are many, and in some cases ignoring these factors can render one's interpretation of microarray data completely irrelevant.
From page 39...
... For example, Tu et al. found that hybridization noise is strongly dependent on expression level, and in particular the hybridization noise is mostly Poisson-like for high expression levels but more complex at low expression levels.7 · Differential binding strengths for different probe-target combinations.
From page 40...
... is that they have large amounts of data, but the operations to be performed on the data are simple," and also that under such circumstances, "the modification of the database scheme is very infrequent, compared to the rate at which queries and other data manipulations are performed."15 The situation in biology is the reverse. Modern information technologies can handle the volumes of data that characterize 21st century biology, but they are generally inadequate to provide a seamless integration of biological data across multiple databases, and commercial database technology has proven to have many limitations in biological applications.16 For example, although relational databases have often been used for biological data management, they are clumsy and awkward to use in many ways.
From page 41...
... : http://www.ddbj.nig.acjp/ EMBL Nucleotide Sequence Databank: http://www.ebi.ac.uk/embl/index.html PIR (Protein Information Resource) : http://pir.georgetown.edu/ Swiss-Prot: http://www.expasy.ch/sprot/sprot-top.html Biomolecular interactions BIND (Biomolecular Interaction Network Database)
From page 42...
... : family and protein http://scop.mrc-Imb.cam.ac.uk/scop/ domains CATH (Protein Structure Classification Database) : http://www.biochem.ucl.ac.uk/bsm/cath-new/index.html Pfam: http://pfam.wustl.edu/ PROSITE database for protein family and domains: http://www.expasy.ch/prosite/ BLOCK: http://www.blocks.fhcrc.org/ Protein pathway KEGG (Kyoto Encyclopedia of Genes and Genomes)
From page 43...
... More importantly, relational databases presume the existence of well-defined and known relationships between data records, whereas the reality of biological research is that relationships are imprecisely known -- and this imprecision cannot be reduced to probabilistic measures of relationship that relational databases can handle. Jagadish and Olken argue that without specialized life sciences enhancements, commercial relational database technology is cumbersome for constructing and managing biological databases, and most approximate sequence matching, graph queries on biopathways, and three-dimensional shape similarity queries have been performed outside of relational data management systems.
From page 44...
... ; detailed data provenance; extensive terminology management; rapid schema evolution; temporal data; and management for a variety of mathematical and statistical models of organisms and biological systems. Data organization and management present major intellectual challenges in integration and presentation, as discussed in Chapter 4.
From page 45...
... If data sources are always associated with data, any work based on that data will automatically have a link to the original source; hence proper acknowledgment of intellectual credit will always be possible. Without automated data provenance, it is all too easy for subsequent researchers to lose the connection to the original source.
From page 46...
... Since the UPSIDE report was released in 2003, editors at two major life science journals, Science and Nature, have agreed in principle with the idea that publication entails a responsibility to make data freely available to the larger research community.21 Nevertheless, it remains to be seen how widely the UPSIDE principles will be adopted in practice. 20The UPSIDE report contained five principles, but only three were judged relevant to the question of data sharing per se.
From page 47...
... In reality, when new biological structures, entities, and events have been uncovered in a particular biological context, they are often 22Reprinted by permission from L.D. Stein, "Integrating Biological Databases," Nature Reviews Genetics 4(5)
From page 48...
... Stein, "Integrating Biological Databases," Nature Reviews Genetics 4(5)
From page 49...
... and profiles (e.g., pulmonary, cardiac, and psychological function tests, and cancer chemotherapeutic side effects) ; DNA sequence data, gene structure, and polymorphisms in sequence (and information to track haploid, diploid, or polyploid alleles, alternative splice sites, and polymorphisms observed as common variants)
From page 50...
... As a point of historical fact, most biological databases have been developed and maintained by individual research groups or research institutions. Initially, these databases were developed for individual use by these groups or institutions, and even when they proved to have value to the larger community, data management practices peculiar to those groups remained.
From page 51...
... To the maximum extent possible, the information con tained in the database is intended to be machine-readable. The complete database is intended to enable researchers to: · Query the database about complex relationships between molecules; · View phenotype-altering mutations or functional domains in the context of protein structure; · View or create de novo signaling pathways assembled from knowledge of interactions between molecules and the flow of information among the components of complex pathways; · Evaluate or establish quantitative relationships among the components of complex pathways; · View curated information about specific molecules of interest (e.g., names, synonyms, sequence informa tion, biophysical properties, domain and motif information, protein family details, structure and gene data, the identities of orthologues and paralogues, BLAST results)
From page 52...
... -supported repository for single nucleotide polymorphisms and short deletion and insertion polymorphisms. These monitoring operations search for new information about the genes of interest to the various research groups associated with the Pharmacogenetics Research Network.
From page 53...
... , DNA sequence data, gene structure, and polymorphisms in sequence (and information to track hap loid, diploid, or polyploid alleles; alternative splice sites; and polymorphisms observed as common variants) , molecular and cellular phenotype data (e.g., enzyme kinetic measurements)
From page 54...
... For a new sequence that does not match any known sequence, gene prediction programs can be used to identify open reading frames, to translate DNA sequence into protein sequence, and to characterize promoter and regulatory sequence motifs. Gene prediction programs are also parameter-dependent, and the specifics of parameter settings must be retained if a future user is to make sense of the results stored in the database.
From page 55...
... Investigators use the Entrez retrieval system for cross-database searching of GenBank's collections of DNA, protein, and genome mapping sequence data, population sets, the NCBI taxonomy, protein structures from the Molecular Modeling Database (MMDB) , and MEDLINE references (from the scientific literature)
From page 56...
... 56 CATALYZING INQUIRY its absolute truth value)


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.