Skip to main content

Currently Skimming:

4 Computational Tools
Pages 57-116

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 57...
... Also, as algorithms for analyzing biological data have become more sophisticated and the capabilities of electronic computers have advanced, new kinds of inquiries and analyses have become possible. 4.1 THE ROLE OF COMPUTATIONAL TOOLS Today, biology (and related fields such as medicine and pharmaceutics)
From page 58...
... 4.2 TOOLS FOR DATA INTEGRATION2 As noted in Chapter 3, data integration is perhaps the most critical problem facing researchers as they approach biology in the 21st century. 2Sections 4.2.1, 4.2.4, 4.2.6, and 4.2.8 embed excerpts from S.Y.
From page 59...
... Transform the retrieved data into a common data model for data integration; 3. Provide a rich common data model for abstracting retrieved data and presenting integrated data objects to the end-user applications; 4.
From page 60...
... Legacy databases, which have been built around unique data definitions, are much less amenable to a standards-driven approach to data integration. Standards are indeed an essential element of efforts to achieve data integration of future datasets, but the adoption of standards is a nontrivial task.
From page 61...
... Cho, "Microarray Optimizations: Increasing Spot Accuracy and Automated Identification of True Microarray Signals," Nucleic Acids Research 30(12) :e54, 2002, available at http://nar.oupjournals.org/cgi/content/full/30/12/e54; M
From page 62...
... Data federation often 6Reprinted by permission from L.D. Stein, "Integrating Biological Databases," Nature Reviews Genetics 4(5)
From page 63...
... It is clear that the size, shape, symmetry, folding pattern, and structural relationships of the systems in the human brain vary from individual to individual. This has been a source of considerable consternation and difficulty in research and clinical evaluations of the human brain from both the structural and the functional perspective.
From page 64...
... 10L.D. Stein, "Integrating Biological Databases," Nature Reviews Genetics 4(5)
From page 65...
... The common model for data derived from the underlying data sources is the responsibility of the mediator. This model must be sufficiently rich to accommodate various data formats of existing biological data sources, which may include unstructured text files, semistructured XML and HTML files, and structured relational, object-oriented, and nested complex data models.
From page 66...
... that apply known interactions and causal relationships among proteins that regulate cell division to changes in an individual's DNA sequence, gene expression, and proteins in an individual tumor.18 The physician might use this information together with the BIS to support a decision on whether the inhibition of a particular protein kinase is likely to be useful for treating that particular tumor. Indeed, a major goal in the for-profit sector is to create richly annotated databases that can serve as testbeds for modeling pharmaceutical applications.
From page 67...
... 23R.J. Robbins, "Object Identity and Life Science Research," position paper submitted for the Semantic Web for Life Sciences Workshop, October 27-28 2004, Cambridge, MA, available at http://lists.w3.org/Archives/Public/public-swls-ws/2004Sep/ att-0050/position-01.pdf.
From page 68...
... Pellegrini-Toole, et al., "The EcoCyc Database," Nucleic Acids Research 30(1)
From page 69...
... A sample collection of ontology resources for controlled vocabulary purposes in the life sciences is listed in Table 4.1. 4.2.8.2 Ontologies for Automated Reasoning Today, it is standard practice to store biological data in databases; no one would deny that the volume of available data is far beyond the capabilities of human memory or written text.
From page 70...
... of a relationship that is defined only locally. An initiative coordinated by the World Wide Web Consortium seeks to explore how Semantic Web technologies can be used to reduce the barriers and costs associated with effective data integration, analysis, and collaboration in the life sciences research community, to enable disease understanding, and to accelerate the development of therapies.
From page 71...
... NIBII (National Biological Information NBII provides links to taxonomy sites for all biological Infrastructure) : disciplines.
From page 72...
... Although automated reasoning can potentially predict the response of a biological system to a particular stimulus, it is particularly useful for discovering inconsistencies or missing relations in the data, establishing global properties of networks, discovering predictive relationships between elements, and inferring or calculating the consequences of given causal relationships.33 As the number of discovered pathways and molecular networks increases and the questions of interest to researchers become more about global properties of organisms, automated reasoning will become increasingly useful. Symbolic representations of biological knowledge -- ontologies -- are a foundation for such efforts.
From page 73...
... Collado-Vides, S.M. Paley, A Pellegrini-Toole, et al., "The EcoCyc Database," Nucleic Acids Research 30(1)
From page 74...
... Metada ta makes it possible for data users to search, retrieve, and evaluate data set information from the NBII's vast network of biological databases by providing standardized descriptions of geospatial and biological data. A popular tool for the implementation of controlled metadata vocabularies is the extensible markup language (XML)
From page 75...
... Next-generation annotation systems will have to be built in a highly modular and open fashion, so that they can accommodate new capabilities and new data types without anyone's having to rewrite the basic code. 4.2.10 A Case Study: The Cell Centered Database45 To illustrate the notions described above, it is helpful to consider an example of a database effort that implements many of them.
From page 76...
... The types of imaging data stored in the CCDB are quite heterogeneous, ranging from large-scale maps of protein distributions taken by confocal microscopy to three-dimensional reconstruction of individual cells, subcellular structures, and organelles. The CCDB can accommodate data from tissues and cultured cells regardless of tissue of origin, but because of the emphasis on the nervous system, the data model contains several features specialized for neural data.
From page 77...
... It is also desirable to exploit information in the database that is not explicitly represented in the schema.49 Thus, the CCDB project team is developing specific data types around certain classes of segmented objects contained in the CCDB. For example, the creation of a "surface data type" will enable users to query the original surface data directly.
From page 78...
... Ontologies for areas such as neurocytology and neurological disease are being built on top of the UMLS, utilizing existing concepts wherever possible and constructing new semantic networks and concepts as needed.54 In addition, imaging data in the CCDB is mapped to a higher level of brain organization by registering their location in the coordinate system of a standard brain atlas. Placing data into an atlas-based coordinate systems provides one method by which data taken across scales and distributed across multiple resources can reliably be compared.55 Through the use of computer-based atlases and associated tools for warping and registration, it is possible to express the location of anatomical features or signals in terms of a standardized coordinate system.
From page 79...
... are emphasizing integrating database creation, curation, and sharing into the process of ecological science: for example, the NSF Biological Databases and Informatics program60 (which includes research into database algorithms and structures, as well as developing particular databases) and the Biological Research Collections program, which provides around $6 million per year for computerizing existing biological data.
From page 80...
... For issues of representing observations or collections, an important element is the Darwin Core, a set of XML metadata standards for describing a biological specimen, including observations in the wild and preserved items in natural history collections. Where ITIS attempts to improve communicability by achieving agreement on precise name usage, Darwin Core64 (and similar metadata efforts)
From page 81...
... GBIF accomplishes this query access through the use of data standards (such as the Darwin Core) and Web services, an information technology (IT)
From page 82...
... New ways to "see" interactions and associations are therefore needed in life sciences research. The most complex data visualizations are likely to be representations of networks.
From page 83...
... and result in physical molecular representations that one can hold in one's hand. These efforts have required the development and testing of software for the representation of physical molecular models to be built by autofabrication technologies, linkages between molecular descriptions and computer-aided design and manufacture approaches for enhancing the models with additional physical characteristics, and integration of the physical molecular models into augmentedreality interfaces as inputs to control computer display and interaction.
From page 84...
... 74K. Fukuda, et al., "Toward Information Extraction: Identifying Protein Names from Biological Papers," Pacific Symposium on Biocomputing 1998, 707-718.
From page 85...
... and B Jacq, "Detecting Gene Symbols and Names in Biological Texts: A First Step Toward Pertinent Information Extraction," Genome Informatics 9:72-80, 1999.
From page 86...
... (2000) described two information extraction applications in biology based on templates: EMPathIE extracted from journal articles details of enzyme and metabolic pathways; PASTA extracted the roles of amino acids and active sites in protein molecules.
From page 87...
... . The digital abstraction includes much of the essential information of the system, without including complicating higher- and lower-order biochemical properties.81 The comparison of the state of the art in computational analysis of DNA sequences and protein sequences speaks in part to the enormous advantage that the digital string abstraction offers when appropriate.
From page 88...
... Graph theory has been applied profitably to the problem of identifying structural similarities among proteins.89 In this approach, a graph represents a protein, with each node representing a single amino acid residue and labeled with the type of residue, and edges representing either peptide bonds or close spatial proximity. Recent work in this area has combined graph theory, data mining, and information theoretic techniques to efficiently identify such similarities.90 87For more on the influence of DNA methylation on genetic regulation, see R
From page 89...
... Algorithms are needed to search, sort, align, compare, contrast, and manipulate data related to a wide variety of biological problems and in support of models of biological processes on a variety of spatial and temporal scales. For example, in the language of automated learning and discovery, research is needed to develop algorithms for active and cumulative learning; multitask learning; learning from labeled and unlabeled data; relational learning; learning from large datasets; learning from small datasets; learning with prior knowledge; learning from mixed-media data; and learning causal relationships.92 The computational algorithms used for biological applications are likely to be rooted in mathematical and statistical techniques used widely for other purposes (e.g., Bayesian networks, graph theory, principal component analysis, hidden Markov models)
From page 90...
... A sequence similarity search program compares a query sequence (an uncharacterized sequence) of interest with already characterized sequences in a public sequence database (e.g., databases of the Institute of Genomic Research (TIGR)
From page 91...
... An example of a simple hidden Markov model for a compositional and signal search for a gene in a sequence sampled from a bacterial genome is shown in Figure 4.3. The model is first "trained" on sequences from the reference database and generates the probable frequencies of different nucleotides at any given position on the query sequence to estimate the likelihood that a sequence is in a different "state" (such as a coding region)
From page 92...
... If the combination having the highest overall probability exceeds a threshold determined using gene sequences in the reference database, the query sequence is concluded to be a gene. 4.4.5 Sequence Alignment and Evolutionary Relationships A remarkable degree of similarity exists among the genomes of living organisms.104 Information about the similarities and dissimilarities of different types of organisms presents a picture of relatedness between species (i.e., between reproductive groups)
From page 93...
... One type of molecular phylogenetic tree, for example, might represent the amino acid sequence of a protein found in several different species. The tree is created by aligning the amino acid sequences of the protein in question from different species, determining the extent of differences between them (e.g., insertions, deletions, or substitutions of amino acids)
From page 94...
... Another problem is the tendency of highly divergent sequences to group together when being compared regardless of their true relationships. This occurs because of a background noise problem -- with only a limited number of possible sequence letters (20 in the case of amino acid sequences)
From page 95...
... There is an increasing recognition of the importance of genetic variation for medicine and developmental biology and for understanding the early demographic history of humans.108 In particular, variation in the human genome sequence is believed to play a powerful role in the origins of and prognoses for common medical conditions.109 The total number of unique mutations that might exist collectively in the entire human population is not known definitively and has been estimated at upward of 10 million,110 which in a 3 billion basepair genome corresponds to a variant every 300 bases or less. Included in these are single-nucleotide polymorphisms (SNPs)
From page 96...
... Gut, "Molecular Haplotyping at High Throughput," Nucleic Acids Research 30(19)
From page 97...
... are the discrimination of genes with significant changes in expression relative to the presence of a disease, drug regimen, or chemical or hormonal exposure. To illustrate the power of large-scale analysis of gene data, an article in Science by Gaudet and Mango is instructive.120 A comparison of microarray data taken from Caenorhabditis elegans embryos lacking a pharynx with microarray data from embryos having excess pharyngeal tissue identified 240 genes that were preferentially expressed in the pharynx, and further identified a single gene as directly regulating almost all of the pharynx-specific genes that were examined in detail.
From page 98...
... Churchill, "Analysis of Variance for Gene Expression Microarray Data," Journal of Computational Biology 7(6)
From page 99...
... While many analyses of microarray data consider a single snapshot in time, of course expression levels vary over time, especially due to the cellular life cycle. A challenge in analyzing microarray timeseries data is that cell cycles may be unsynchronized, making it difficult to correctly identify correlations between data samples that have similar expression behavior.
From page 100...
... Bayesian analysis allows one to make inferences about the possible structure of a genetic regulatory pathway on the basis of microarray data, but even advocates of such analysis recognize the need for experimental test. One work goes so far as to suggest that it is possible that automated processing of microarray data can suggest interesting experiments that will shed light on causal relationships, even if the existing data themselves don't support causal inferences.129 4.4.8 Data Mining and Discovery 4.4.8.1 The First Known Biological Discovery from Mining Databases130 By the early 1970s, the simian sarcoma virus had been determined to cause cancer in certain species of monkeys.
From page 101...
... Vinayaka, Z.Z. Hu, et al., "PIRSF Family Classification System at the Protein Information Resource," Nucleic Acids Research 32(Database issue)
From page 102...
... . Many prokaryotic ATRs are predicted to be required for EC 4.2.1.28 based on the genome context of the corresponding genes.
From page 103...
... , which is soluble when it folds properly, becomes insoluble when one of the intermediates along its folding pathway misfolds and forms an aggregation that damages nerve cells.136 Due to the importance of the functional conformation of proteins, many efforts have been attempted to predict computationally a three-dimensional structure of a protein from its amino acid sequence. Although experimental determination of protein structure based on X-ray crystallography and nuclear magnetic resonance yields protein structures in high resolution, it is slow, labor-intensive, and expensive and thus not appropriate for large-scale determination.
From page 104...
... A number of tools for protein structure prediction have been developed, and progress in prediction by these methods has been evaluated by the Critical Assessment of Protein Structure Prediction (CASP) experiment held every two years since 1994.137 In a CASP experiment, the amino acid sequences of proteins whose experimentally determined structures have not yet been released are published, and computational research groups are then invited to predict structures of these target sequences using their methods and any other publicly available information (e.g., known structures that exist in the Protein Data Bank (PDB)
From page 105...
... At 30 percent sequence identity, the fraction of incorrectly aligned residues is about 20 percent, and the number rises sharply with further decreases in sequence similarity. This limits the usefulness of comparative modeling.138 If no template structure (or fold)
From page 106...
... It is fair to say that computation will play an important role in the success of mass spectrometry as the tool of choice for proteomics. Mass spectrometry is also coming into its own for protein expression studies.
From page 107...
... However, the pattern of such values across all 60 cell lines can provide insight into the mechanisms of drug action and drug resistance. Combined with molecular structure data, these activity patterns can be used to explore the NCI database of 460,000 compounds for growth-inhibiting effects in these cell lines, and can also provide insight into potential target molecules and modulators of activity in the 60 cell lines.
From page 108...
... In addition to the targets assessed one at a time, others have been measured en masse as part of a protein expression database generated for the 60 cell lines by 2D polyacrylamide gel electrophoresis. Each compound displays a unique "fingerprint" pattern, defined by a point in the 60D space (one dimension for each cell line)
From page 109...
... As noted by Swedlow et al., this problem requires the following: a segmentation algorithm to find the vesicles and to produce a list of centroids, volumes, signal intensities, and so on; a tracker to define trajectories by linking centroids at different time points according to a predetermined set of rules; and a viewer to display the analytic results overlaid on the original movie.2 OME provides a mechanism for linking together various analytical modules by specifying data semantics that enable the output of one module to be accepted as input to another. These semantic data types of OME describe analytic results such as "centroid," "trajectory," and "maximum signa," and allow users, rather than a predefined standard, to define such concepts operationally, including in the machine-readable definition and the processing steps that produce it (e.g., the algorithm and the various parameter settings used)
From page 110...
... The resulting volume rendition permits direct inspection of internal structures, without a precomputed segmentation or surface extraction step, through the use of multidimensional transfer functions. As seen in the visualizations in Figure 4.6, the resolution of the CT scan allows subtleties such as the definition of the cochlea, the modiolus, the implanted electrode array, and the lead wires that connect the array to a head-mounted connector.
From page 111...
... COMPUTATIONAL TOOLS 111 FIGURE 4.5 Visualizations of mutant (left) and normal (right)
From page 112...
... The resolution of the scan allows definition of the shanks and tips of the implanted electrode array. Volumetric image processing was used to isolate the electrode array from the surrounding tissue, highlighting the structural relationship between the implant and the bone.
From page 113...
... .150 Launched in 2002, the CCDB contains structural and protein distribution information derived from confocal, multiphoton, and electron microscopy for use by the structural biology and neuroscience communities. In the case of neurological images, most of the imaging data are referenced to a higher level of brain organization by registering their location in the coordinate system of a standard brain atlas.
From page 114...
... . Finally, a standard coordinate system allows the same brain region to be sampled repeatedly to allow data to be accumulated over time.
From page 115...
... · General. The program should accept a wide selection of data types, including common formats, units, precisions, ranges, and file sizes.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.