Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 76 While the choice of δ for a multi-alignment scoring scheme is conceptually a function of K arguments, it is often the case that δ is effectively defined in terms of an underlying pairwise scoring function δ'. For example, the sum-of-pairs score is defined as , where one must let δ(-,-) = 0. In essence, the sum-of-pairs multi-alignment score is the sum of the scores of the K(K â 1)/2 pairwise alignments it induces. Another common scheme is the consensus score, which defines δ(a1, a2,. . ., aK ) as max/min{Σ jδ'(c,ai):âÏ {â}}. The symbol c that gives the best score is said to be the consensus symbol for the column, and the concatenation of these symbols is the consensus sequence. In effect, the consensus multi-alignment score is the sum of the scores of the K pairwise alignments of the sequences versus the consensus. The example of Figure 3.8 is such a scoring scheme where d' is the scoring scheme of Figure 3.1. While we do not show it here, the problem of determining minimal phylogenies mentioned at the start of this subsection can also be modeled as an instance of a multiple sequence alignment problem by choosing a δ for columns that suitably encodes the tree relating the sequences (Sankoff, 1975). However, the more general phylogeny problem requires that one also determine the tree that produces the minimal score. This daunting task essentially requires the exploration of the space of all possible trees with K vertices. So in practice, evolutionary biologists have put a great deal of effort into designing heuristic algorithms for the phylogeny problem, and there is much debate about which of these is best. K-Best Alignments The alignment algorithm in the section "The Basic Dynamic Programming Algorithm" above reports an optimal alignment that is clearly a function of the choice of scoring scheme. Unfortunately, biologists have not yet ascertained which scoring schemes are "correct" for a given comparison domain. This uncertainty has suggested the problem of listing all alignments near the optimum in the hope of generating the biologically correct alignment.
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 77 From the point of view of the edit graph formulation, the K-best problem is to deliver the K-best shortest source-to-sink paths, a problem much studied in the operations research literature. Indeed, there is an O(MN + KN) time and space algorithm, immediately available from this literature (Fox, 1973), that delivers the K-best paths over an edit graph. The algorithm delivers these paths/alignments in order of score, and K does not need to be known a priori: the next best alignment is available in O(N) time. The essential idea of the algorithm is to keep, at each vertex v, an ordered list of the score of the next best path to the sink through each edge out of v. The next best path is traceable using these ordered lists and is extracted, and the lists are appropriately updated. If all one desires is an enumeration, not necessarily in order of score, of all alignments that are within ε of the optimal difference D(A, B), then a simpler method is available that requires only the matrix S of the dynamic programming computation. While not any faster in time, the simpler alternative below does require only O(MN) space. One can imagine tracing back all paths from the sink to the source in a recursive fashion. The essential idea of the algorithm is to limit the traceback to only those paths of score not greater than D(A, B) + ε. Suppose one reaches vertex (i, j) and the score of the path thus far traversed from the sink to this vertex is T(i, j). Then one traces back to predecessor vertices (i â 1, j), (i â 1, j â 1), and (i, j â 1) if and only if: S(i â 1, j) + δ(ai, â) + T(i, j) ⤠D(A, B) + ε, S(i â l, j â 1) + δ (ai, bj) + T(i, j) ⤠D(A, B) + ε, S(i, j â 1) + δ (â, bj) + T(i, j) ⤠D(A, B) + ε, respectively. This procedure is very simple, space economical, and quite fast. A classic example of the need for affine gap costs was presented in a paper by Smith and Fitch (1983) comparing the α and β chicken hemoglobin chains. For a setting of the gap costs that gave the biologically correct alignment, there were 17 optimal alignments, 1,317 alignments within 5 percent of the optimum, and 20,137,655 within 20 percent of the optimum. This kind of exponential growth suggests that perhaps rather than list alignments, one should report the best possible