001 /*
002 * CrochemoreLandauZivUkelson.java
003 *
004 * Copyright 2003 Sergio Anibal de Carvalho Junior
005 *
006 * This file is part of NeoBio.
007 *
008 * NeoBio is free software; you can redistribute it and/or modify it under the terms of
009 * the GNU General Public License as published by the Free Software Foundation; either
010 * version 2 of the License, or (at your option) any later version.
011 *
012 * NeoBio is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
013 * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
014 * PURPOSE. See the GNU General Public License for more details.
015 *
016 * You should have received a copy of the GNU General Public License along with NeoBio;
017 * if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330,
018 * Boston, MA 02111-1307, USA.
019 *
020 * Proper attribution of the author as the source of the software would be appreciated.
021 *
022 * Sergio Anibal de Carvalho Junior mailto:sergioanibaljr@users.sourceforge.net
023 * Department of Computer Science http://www.dcs.kcl.ac.uk
024 * King's College London, UK http://www.kcl.ac.uk
025 *
026 * Please visit http://neobio.sourceforge.net
027 *
028 * This project was supervised by Professor Maxime Crochemore.
029 *
030 */
031
032 package neobio.alignment;
033
034 import java.io.Reader;
035 import java.io.IOException;
036
037 /**
038 * This abstract class is the superclass of both global and local sequence alignment
039 * algorithms (with linear gap penalty function) due to Maxime Crochemore, Gad Landau and
040 * Michal Ziv-Ukelson (2002).
041 *
042 * <P>This implementation derives from the paper of M.Crochemore, G.Landau and
043 * M.Ziv-Ukelson, <I>A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring
044 * Matrices</I> (available here as
045 * <A HREF="doc-files/Crochemore_Landau_Ziv-Ukelson_algorithm.pdf">PDF</A> or
046 * <A HREF="doc-files/Crochemore_Landau_Ziv-Ukelson_algorithm.pdf">Postscript</A>).</P>
047 *
048 * <P>It employs Lempel-Ziv compression techniques to speed up the classic dynamic
049 * programmin approach to sequence alignment (see {@linkplain NeedlemanWunsch} and
050 * {@linkplain SmithWaterman} classes). It reduces the time and space complexity from
051 * O(n<SUP>2</SUP>) to O(n<SUP>2</SUP>/log n). In fact, the complexity is actually O(h
052 * n<SUP>2</SUP>/log n), where 0 <= h <= 1 is a real number denoting the entropy of the
053 * text (a measure of how compressible it is). This means that, the more compressible a
054 * sequence is, the less memory the algorithm will require, and the faster it will
055 * run.</P>
056 *
057 * <P>The idea behind this improvement is to identify repetitions in the sequences and
058 * reuse the computation of their alignments. The first step is, therefore, to parse the
059 * sequences into LZ78-like factors. LZ78 is a popular compression algorithm of the
060 * Lempel-Ziv familiy due to J.Ziv and A.Lempel (1978). This factorisation is accomplished
061 * by the {@linkplain FactorSequence} class (for more information about this
062 * factorisation, please refer to the specification of that class) that builds a
063 * doubly-linked list of factors. Each factor is an instance of the {@linkplain Factor}
064 * class (refer to the specification of this class for more information).</P>
065 *
066 * <P>Once the sequences have been parsed, the algorithm builds a matrix of blocks, called
067 * block table, that is vaguely similar to the dynamic programming matrix used by both
068 * <CODE>NeedlemanWunsch</CODE> and <CODE>SmithWaterman</CODE>. Each block contains an
069 * instance of an {@linkplain AlignmentBlock} (please refer to its specification for more
070 * information on what information is stored) and represents the alignment beteween one
071 * factor of each sequence. This block table is, in fact, a partition of the alignment
072 * graph.</P>
073 *
074 * <P>Consider a block B which corresponds to the alignment of factor F1 = xa from
075 * sequence S1 and factor F2 = yb from sequence S2. Here, F1 extends a previous factor of
076 * S1 with character a, while F2 extends a previous factor of S2 with character b. We can
077 * define the input border of B as the set of values at the left and top borders of block
078 * B, and the output border as the set of values at the right and bottom borders of B.
079 * Moreover, we can define the following prefix blocks of B:</P>
080 *
081 * <UL>
082 * <LI><B>Left prefix</B> - is the block that contains the alignment of factor F1 with
083 * a factor F2' = y (a prefix of factor F2).
084 * <LI><B>Diagonal prefix</B> - is the block that contains the alignment of factor F1' = x
085 * (a prefix of factor F1) with a factor F2' = y (a prefix of factor F2).
086 * <LI><B>Top prefix</B> - is the block that contains the alignment of factor F1' = x (a
087 * prefix of factor F1) with factor F2.
088 * </UL>
089 *
090 * <P>Note that each factor has a pointer to its prefix factor, called ancestor (see the
091 * specification of the <CODE>Factor</CODE> class). This pointer makes it easy to retrieve
092 * any of the three prefix blocks of B in constant time.</P>
093 *
094 * <P>Rather than computing each value of the alignment block B, the algorithm will only
095 * compute the values on its input and output borders. This is precisely what makes it
096 * more efficient.</P>
097 *
098 * <P>In this class there is a general specification of how the block table is computed
099 * (see the {@link #computeBlockTable computeBlockTable} method for details). The actual
100 * method depends on the subclasses. In general, there are two phases:</P>
101 *
102 * <UL>
103 * <LI><B>Encoding</B> - the structure of a block B is studied and represented in an
104 * efficient way by computing weights of optimal paths connecting each entry of its input
105 * border to each entry of its output border. This information is encoded in a DIST matrix
106 * where DIST[i,j] stores the weight of an optimal paths connecting entry i of the input
107 * border to entry j the output border of B.
108 * <LI><B>Propagation</B> - the input border of a block B is retrieved (from the left and
109 * top blocks of B) and its output border is computed with the help of the DIST matrix.
110 * </UL>
111 *
112 * <P>In fact, for each block, only one column of the DIST matrix needs to be computed,
113 * all other columns are actually retrieved from its prefix blocks. This is precisely what
114 * is accomplished by the {@link #assembleDistMatrix assembleDistMatrix} method found in
115 * this class (it is general enough for both global and local alignment versions of the
116 * algorithm.</P>
117 *
118 * <P>From the DIST matrix, we obtain the OUT matrix defined as
119 * <CODE>OUT[i,j] = I[i] + DIST[i,j]</CODE> where I is the input border array. This means
120 * that the OUT matrix is the DIST matrix updated by the input border of a block. The
121 * output border is then computed from the OUT matrix by taking the maximum value of each
122 * column. This class also have a general method for assembling the input border (see
123 * {@link #assembleInputBorder assembleInputBorder}</P>
124 *
125 * <P>The OUT matrix is encoded by the {@linkplain OutMatrix} class that takes as
126 * both a DIST matrix and an input border array. Note that <B>it does not compute the OUT
127 * matrix</B>, it just stores the necessary information to retrieve a value at any
128 * position of the matrix.</P>
129 *
130 * <P>A naive approach to compute the output border of a block from the OUT matrix of size
131 * n x n would take a time proportional to n<SUP>2</SUP>. However, it happens that, due to
132 * the nature of the DIST matrix, both DIST and OUT matrices are Monge arrays, which
133 * implies that they are also <I>totally monotone</I>. This property allows the
134 * computation of the output border of B in linear time with the SMAWK algorithm (see the
135 * specification of the {@linkplain Smawk} class for more information on SMAWK).</P>
136 *
137 * <P>This class contains a general specification that is pertinent to both global and
138 * local versions of the algorithm. For more information on each version of, please refer
139 * to the appropriate subclass.</P>
140 *
141 * <P><B>A note about the performance of these algorithms.</B> Although theoretical
142 * results suggest that these algorithms are faster and consume less memory than the
143 * classical methods, in practice it is hard to realise their performance gains.
144 *
145 * <P>These algorithms are extremely complex and require the storage of many extra
146 * pointers and other auxiliary data for each block (see the <CODE>AlignmentBlock</CODE>
147 * class for more details). Hence, even though the space requirement is
148 * O(n<SUP>2</SUP>/log n), which is less than O(n<SUP>2</SUP>), in practice, for most of
149 * the cases these algorithms will take more time and memory space than their clasical
150 * counterparts (we have to keep in mind that the Big Oh notation ignores all constants
151 * involved).</P>
152 *
153 * <P>Therefore, in order to realise the full power of these algorithms, they have to be
154 * used with extremly large and redundant sequences. This will allow a proper
155 * reutilisation of the computations and, maybe, provide an improvement in terms of space
156 * and run time. For instance, it is easy to devise such a sequence if we use a
157 * one-character alphabet because, in this case, a sequence is factorised into a series
158 * of factors that are a prefix of the next.</P>
159 *
160 * @author Sergio A. de Carvalho Jr.
161 * @see CrochemoreLandauZivUkelsonGlobalAlignment
162 * @see CrochemoreLandauZivUkelsonLocalAlignment
163 * @see NeedlemanWunsch
164 * @see SmithWaterman
165 * @see FactorSequence
166 * @see AlignmentBlock
167 * @see OutMatrix
168 * @see Smawk
169 * @see #computeBlockTable
170 * @see #assembleDistMatrix
171 */
172 public abstract class CrochemoreLandauZivUkelson extends PairwiseAlignmentAlgorithm
173 {
174 /**
175 * A constant that indicates that the source of an optimal path has been reached in a
176 * block and that the trace back procedure to retrieve a high scoring alignment can
177 * stop.
178 */
179 protected static final byte STOP_DIRECTION = 0;
180
181 /**
182 * A constant that indicates that the left direction must be followed to reach the
183 * source of an optimal path in a block during the trace back procedure to retrieve a
184 * high scoring alignment.
185 */
186 protected static final byte LEFT_DIRECTION = 1;
187
188 /**
189 * A constant that indicates that the diagonal direction must be followed to reach the
190 * source of an optimal path in a block during the trace back procedure to retrieve a
191 * high scoring alignment.
192 */
193 protected static final byte DIAGONAL_DIRECTION = 2;
194
195 /**
196 * A constant that indicates that the top direction must be followed to reach the
197 * source of an optimal path in a block during the trace back procedure to retrieve a
198 * high scoring alignment.
199 */
200 protected static final byte TOP_DIRECTION = 3;
201
202 /**
203 * The first factorised sequence being aligned.
204 */
205 protected FactorSequence seq1;
206
207 /**
208 * The second factorised sequence being aligned.
209 */
210 protected FactorSequence seq2;
211
212 /**
213 * The block table, which is a matrix of alignment blocks where each block represents
214 * the alignment between one factor of each sequence.
215 */
216 protected AlignmentBlock[][] block_table;
217
218 /**
219 * Number of rows of the block table. It is determined by the number of factors of the
220 * first sequence.
221 */
222 protected int num_rows;
223
224 /**
225 * Number of columns of the block table. It is determined by the number of factors of
226 * the second sequence.
227 */
228 protected int num_cols;
229
230 /**
231 * An instance of the <CODE>Smawk</CODE> class that implements the SMAWK algorithm to
232 * compute the column maxima of a totally monotone matrix. It is used to speed up the
233 * computation of the output border of a block.
234 */
235 protected Smawk smawk = new Smawk();
236
237 /**
238 * An instance of the <CODE>OutMatrix</CODE> class that encodes the OUT matrix of a
239 * block when supplied with the DIST matrix and the input border array of a block.
240 * Note that it does not compute the OUT matrix itselft, it just stores the necessary
241 * information to retrieve a value at any position of the matrix.
242 *
243 * <P>This object is then used to compute the output border of a block with the
244 * <CODE>Smawk</CODE> class. Note that the <CODE>OutMatrix</CODE> class implements the
245 * <CODE>Matrix</CODE> interface as required by the <CODE>Smawk</CODE> class.
246 *
247 * @see Matrix
248 * @see Smawk
249 */
250 protected OutMatrix out_matrix = new OutMatrix ();
251
252 /**
253 * Loads sequences into <CODE>FactorSequence</CODE> instances. In case of any error,
254 * an exception is raised by the constructor of <CODE>FactorSequence</CODE> (please
255 * check the specification of that class for specific requirements).
256 *
257 * <P>A <CODE>FactorSequence</CODE> is an LZ78-like factorisation of the sequences
258 * being aligned.
259 *
260 * @param input1 Input for first sequence
261 * @param input2 Input for second sequence
262 * @throws IOException If an I/O error occurs when reading the sequences
263 * @throws InvalidSequenceException If the sequences are not valid
264 * @see FactorSequence
265 */
266 protected void loadSequencesInternal (Reader input1, Reader input2)
267 throws IOException, InvalidSequenceException
268 {
269 // load sequences into instances of CharSequence
270 this.seq1 = new FactorSequence(input1);
271 this.seq2 = new FactorSequence(input2);
272
273 // determine the block table's dimensions
274 this.num_rows = seq1.numFactors();
275 this.num_cols = seq2.numFactors();
276 }
277
278 /**
279 * Frees pointers to loaded sequences and the the block table so that their data can
280 * be garbage collected.
281 */
282 protected void unloadSequencesInternal ()
283 {
284 this.seq1 = null;
285 this.seq2 = null;
286 this.block_table = null;
287 }
288
289 /**
290 * Computes the block table (the result depends on subclasses, see
291 * <CODE>computeBlockTable</CODE> for details) and requests subclasses to retrieve an
292 * optimal alignment between the loaded sequences. The actual product depends on the
293 * subclasses which can produce a global (see
294 * <CODE>CrochemoreLandauZivUkelsonGlobalAlignment</CODE>) or local alignment (see
295 * <CODE>CrochemoreLandauZivUkelsonLocalAlignment</CODE>).
296 *
297 * <P>Subclasses are required to implement the <CODE>buildOptimalAlignment</CODE>
298 * abstract method defined by this class according to their own method.</P>
299 *
300 * @return an optimal alignment between the loaded sequences
301 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible
302 * with the loaded sequences.
303 * @see CrochemoreLandauZivUkelsonGlobalAlignment
304 * @see CrochemoreLandauZivUkelsonLocalAlignment
305 * @see #computeBlockTable
306 * @see #buildOptimalAlignment
307 */
308 protected PairwiseAlignment computePairwiseAlignment ()
309 throws IncompatibleScoringSchemeException
310 {
311 // compute block table
312 computeBlockTable ();
313
314 // build and return an optimal global alignment
315 PairwiseAlignment alignment = buildOptimalAlignment ();
316
317 // allow the block table to be garbage collected
318 block_table = null;
319
320 return alignment;
321 }
322
323 /**
324 * Computes the block table (the result depends on subclasses, see
325 * <CODE>computeBlockTable</CODE> for details) and requests subclasses to locate the
326 * score of the highest scoring alignment between the two sequences in the block
327 * table. The result depends on the subclasses, and either a global alignment
328 * (see <CODE>CrochemoreLandauZivUkelsonGlobalAlignment</CODE>) or local alignment
329 * score (see <CODE>CrochemoreLandauZivUkelsonLocalAlignment</CODE>) will be produced.
330 *
331 * <P>Subclasses are required to implement the <CODE>locateScore</CODE> abstract
332 * method defined by this class according to their own method.</P>
333 *
334 * <P>Note that this method calculates the similarity value only (it doesn't trace
335 * back into the block table to retrieve the alignment itself).</P>
336 *
337 * @return the score of the highest scoring alignment between the loaded sequences
338 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible
339 * with the loaded sequences.
340 * @see CrochemoreLandauZivUkelsonGlobalAlignment
341 * @see CrochemoreLandauZivUkelsonLocalAlignment
342 * @see #locateScore
343 */
344 protected int computeScore () throws IncompatibleScoringSchemeException
345 {
346 // compute block table
347 computeBlockTable ();
348
349 // get score
350 int score = locateScore ();
351
352 // allow the block table to be garbage collected
353 block_table = null;
354
355 return score;
356 }
357
358 /**
359 * Computes the block table. This method is a general specification of how the block
360 * table should be computed. It creates the block table according to the number of
361 * factors of each sequence. It then goes over each position of the block table,
362 * retrieves the corresponding factors from each sequence, and repasses the
363 * information to the subclasses that will do the actual computation of each block
364 * using the scoring scheme previously set.
365 *
366 * <P>There are four different cases that defines four abstract methods in this class,
367 * which subclasses must implement:</P>
368 *
369 * <UL>
370 * <LI><B>createRootBlock</B> - creates and computes block at row 0, column 0;
371 * <LI><B>createFirstColumnBlock</B> - creates and computes blocks at column 0 which
372 * corresponds to alignment blocks between factors of sequence 1 and an empty string;
373 * <LI><B>createFirstRowBlock</B> - creates and computes blocks at row 0 which
374 * corresponds to alignment blocks between factors of sequence 2 and an empty string;
375 * <LI><B>createBlock</B> - creates and computes blocks at row > 0 and column > 0
376 * which corresponds to alignment blocks between one factor of sequence 1 and one
377 * factor of sequence 2;
378 * </UL>
379 *
380 * <P>Note that each factor has a serial number which indicates its order in the list
381 * of factors of a sequence. This number will match with the row and column index of
382 * a block in the block table. For instance, if a block has factors F1 and F2 with
383 * serial numbers 12 and 53, respectively, this means that this block is found at row
384 * 12, column 53 of the block table.</P>
385 *
386 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible
387 * with the loaded sequences.
388 * @see #createRootBlock
389 * @see #createFirstColumnBlock
390 * @see #createFirstRowBlock
391 * @see #createBlock
392 */
393 protected void computeBlockTable () throws IncompatibleScoringSchemeException
394 {
395 Factor factor1, factor2;
396 int r, c, max_length;
397
398 // create block table
399 block_table = new AlignmentBlock[num_rows][num_cols];
400
401 // find the length of the longest sequence (number of characters)
402 max_length = Math.max(seq1.numChars(), seq2.numChars());
403
404 // prepares the OUT matrix object
405 out_matrix.init (max_length, scoring.maxAbsoluteScore());
406
407 // start at the root of each trie
408 factor1 = seq1.getRootFactor();
409 factor2 = seq2.getRootFactor();
410
411 // check if roots' indexes are both zero
412 if (factor1.getSerialNumber() != 0 || factor2.getSerialNumber() != 0)
413 throw new IndexOutOfBoundsException ("Unexpected factor index.");
414
415 // initiate first cell of block table
416 block_table[0][0] = createRootBlock (factor1, factor2);
417
418 // compute first row
419 for (c = 1; c < num_cols; c++)
420 {
421 factor2 = factor2.getNext();
422
423 // check if factor2's index equals its column number (except for
424 // the last factor that can be a repetition of a previous one)
425 if (c < num_cols - 1 && factor2.getSerialNumber() != c)
426 throw new IndexOutOfBoundsException ("Unexpected factor index.");
427
428 block_table[0][c] = createFirstRowBlock (factor1, factor2, c);
429 }
430
431 for (r = 1; r < num_rows; r++)
432 {
433 factor1 = factor1.getNext();
434
435 // check if factor1's index equals its row number (except for
436 // the last factor that can be a repetition of a previous one)
437 if (r < num_rows - 1 && factor1.getSerialNumber() != r)
438 throw new IndexOutOfBoundsException ("Unexpected factor index.");
439
440 // go back to the root of sequence 2
441 factor2 = seq2.getRootFactor();
442
443 if (factor2.getSerialNumber() != 0)
444 throw new IndexOutOfBoundsException ("Unexpected factor index.");
445
446 // compute first column of current row
447 block_table[r][0] = createFirstColumnBlock (factor1, factor2, r);
448
449 for (c = 1; c < num_cols; c++)
450 {
451 factor2 = factor2.getNext();
452
453 // check if factor2's index equals its column number (except for
454 // the last factor that can be a repetition of a previous one)
455 if (c < num_cols - 1 && factor2.getSerialNumber() != c)
456 throw new IndexOutOfBoundsException ("Unexpected factor index.");
457
458 // compute row r, col c
459 block_table[r][c] = createBlock (factor1, factor2, r, c);
460 }
461 }
462 }
463
464 /**
465 * Assembles the DIST matrix of a block by retrieving the DIST columns of its prefix
466 * blocks. In fact, it also stores pointers to the owner block for each column
467 * retrieved. These pointers are later used during the trace back procedure that
468 * builds an optimal alignment from the information computed in the block table. This
469 * method is general enough to suit both global and local alignment versions of the
470 * algorithm.
471 *
472 * @param block the block for which the DIST matrix is needed
473 * @param dim the dimension of the DIST matrix
474 * @param r the row index of this block in the block table
475 * @param c the column index of this block in the block table
476 * @param lc the number of columns of the alignment block
477 * @return the DIST matrix
478 */
479 protected int[][] assembleDistMatrix (AlignmentBlock block, int dim, int r, int c,
480 int lc)
481 {
482 AlignmentBlock ancestor;
483 Factor parent;
484 int[][] dist;
485 int i;
486
487 dist = new int[dim][];
488
489 // columns to the left of lc
490 parent = block.factor2.getAncestor();
491 for (i = lc - 1; i >= 0; i--)
492 {
493 ancestor = block_table[r][parent.getSerialNumber()];
494 block.ancestor[i] = ancestor;
495 dist[i] = ancestor.dist_column;
496 parent = parent.getAncestor();
497 }
498
499 // column lc
500 dist[lc] = block.dist_column;
501 block.ancestor[lc] = block;
502
503 // columns to the right of lc
504 parent = block.factor1.getAncestor();
505 for (i = lc + 1; i < dim; i++)
506 {
507 ancestor = block_table[parent.getSerialNumber()][c];
508 block.ancestor[i] = ancestor;
509 dist[i] = ancestor.dist_column;
510 parent = parent.getAncestor();
511 }
512
513 return dist;
514 }
515
516 /**
517 * Assembles the input border of a block by retrieving the values at the output
518 * borders of the left and top blocks. This method is general enough to suit both
519 * global and local alignment versions of the algorithm. Note that it can be used to
520 * assemble the input border of any block but the root one (it will cause an
521 * <CODE>ArrayIndexOutOfBoundsException</CODE>.
522 *
523 * @param dim dimension of the input border
524 * @param r row index of the block in the block table
525 * @param c column index of the block in the block table
526 * @param lr number of row of the block
527 * @return the block's input border
528 */
529 protected int[] assembleInputBorder (int dim, int r, int c, int lr)
530 {
531 AlignmentBlock left = null, top = null;
532 int[] input;
533 int i;
534
535 input = new int [dim];
536
537 // set up pointers to the left and top blocks (if applicable)
538 if (c > 0) left = block_table[r][c-1];
539 if (r > 0) top = block_table[r-1][c];
540
541 for (i = 0; i < dim; i++)
542 {
543 if (i < lr)
544 {
545 if (left != null)
546 input[i] = left.output_border[left.factor2.length() + i];
547 else
548 // there is no block to the left, so set a big negative value
549 // to make sure it will not be used (unfortunately, MIN_VALUE
550 // can overflow to a positive value when substracted by any
551 // number, so we use half of it as a workaround)
552 input[i] = Integer.MIN_VALUE / 2;
553 }
554 else if (i == lr)
555 {
556 if (left != null)
557 input[i] = left.output_border[left.factor2.length() + i];
558 else
559 // no need to check if top is not null
560 // (because we assume this is not the root block)
561 input[i] = top.output_border[i - lr];
562 }
563 else
564 {
565 if (top != null)
566 input[i] = top.output_border[i - lr];
567 else
568 // there is no top block (see note for the left case)
569 input[i] = Integer.MIN_VALUE / 2;
570 }
571 }
572
573 return input;
574 }
575
576 /**
577 * Traverses a block to retrieve a part of an optimal alignment from the specified
578 * source in the output border to an entry in the input border.
579 *
580 * @param block the block to be traversed
581 * @param source the source of the path in the output border
582 * @param gapped_seq1 the StringBuffer to where the gapped sequence 1 is written to
583 * @param tag_line the StringBuffer to where the tag_line is written to
584 * @param gapped_seq2 the StringBuffer to where the gapped sequence 2 is written to
585 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible
586 * with the loaded sequences.
587 */
588 protected void traverseBlock (AlignmentBlock block, int source,
589 StringBuffer gapped_seq1, StringBuffer tag_line, StringBuffer gapped_seq2)
590 throws IncompatibleScoringSchemeException
591 {
592 char char1, char2;
593
594 while (block.direction[source] != STOP_DIRECTION)
595 {
596 char1 = block.factor1.getNewChar();
597 char2 = block.factor2.getNewChar();
598
599 switch (block.direction[source])
600 {
601 case LEFT_DIRECTION:
602 gapped_seq1.insert (0, GAP_CHARACTER);
603 tag_line.insert (0, GAP_TAG);
604 gapped_seq2.insert (0, char2);
605
606 block = getLeftPrefix (block);
607 break;
608
609 case DIAGONAL_DIRECTION:
610 gapped_seq1.insert (0, char1);
611 if (char1 == char2)
612 if (useMatchTag())
613 tag_line.insert (0, MATCH_TAG);
614 else
615 tag_line.insert (0, char1);
616 else if (scoreSubstitution(char1, char2) > 0)
617 tag_line.insert (0, APPROXIMATE_MATCH_TAG);
618 else
619 tag_line.insert (0, MISMATCH_TAG);
620 gapped_seq2.insert (0, char2);
621
622 block = getDiagonalPrefix (block);
623 source --;
624 break;
625
626 case TOP_DIRECTION:
627 gapped_seq1.insert (0, char1);
628 tag_line.insert (0, GAP_TAG);
629 gapped_seq2.insert (0, GAP_CHARACTER);
630
631 block = getTopPrefix (block);
632 source --;
633 break;
634 }
635 }
636 }
637
638 /**
639 * This method is a shorthand to retrieve the left prefix of a block from the block
640 * table.
641 *
642 * @param block the block
643 * @return the block's left prefix
644 */
645 protected AlignmentBlock getLeftPrefix (AlignmentBlock block)
646 {
647 int prefix_row = block.factor1.getSerialNumber();
648 int prefix_col = block.factor2.getAncestorSerialNumber();
649
650 return block_table[prefix_row][prefix_col];
651 }
652
653 /**
654 * This method is a shorthand to retrieve the diagonal prefix of a block from the
655 * block table.
656 *
657 * @param block the block
658 * @return the block's diagonal prefix
659 */
660 protected AlignmentBlock getDiagonalPrefix (AlignmentBlock block)
661 {
662 int prefix_row = block.factor1.getAncestorSerialNumber();
663 int prefix_col = block.factor2.getAncestorSerialNumber();
664
665 return block_table[prefix_row][prefix_col];
666 }
667
668 /**
669 * This method is a shorthand to retrieve the top prefix of a block from the block
670 * table.
671 *
672 * @param block the block
673 * @return the block's top prefix
674 */
675 protected AlignmentBlock getTopPrefix (AlignmentBlock block)
676 {
677 int prefix_row = block.factor1.getAncestorSerialNumber();
678 int prefix_col = block.factor2.getSerialNumber();
679
680 return block_table[prefix_row][prefix_col];
681 }
682
683 /**
684 * Computes the root block of the block table. See subclasses for actual
685 * implementation.
686 *
687 * @param factor1 the factor of the first sequence being aligned
688 * @param factor2 the factor of the second sequence being aligned
689 * @return the root block
690 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible
691 * with the loaded sequences.
692 */
693 protected abstract AlignmentBlock createRootBlock (Factor factor1, Factor factor2)
694 throws IncompatibleScoringSchemeException;
695
696 /**
697 * Computes a block at the first row (row zero) of the block table, which corresponds
698 * to an alignment block between one factor of sequence 2 and an empty string. See
699 * subclasses for actual implementation.
700 *
701 * @param factor1 the factor of the first sequence being aligned
702 * @param factor2 the factor of the second sequence being aligned
703 * @param col the column index of block in the block table
704 * @return the computed block
705 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible
706 * with the loaded sequences.
707 */
708 protected abstract AlignmentBlock createFirstRowBlock (Factor factor1, Factor factor2,
709 int col) throws IncompatibleScoringSchemeException;
710
711 /**
712 * Computes a block at the first column (column zero) of the block table, which
713 * corresponds to an alignment block between one factor of sequence 1 and an empty
714 * string. See subclasses for actual implementation.
715 *
716 * @param factor1 the factor of the first sequence being aligned
717 * @param factor2 the factor of the second sequence being aligned
718 * @param row the row index of block in the block table
719 * @return the computed block
720 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible
721 * with the loaded sequences.
722 */
723 protected abstract AlignmentBlock createFirstColumnBlock (Factor factor1,
724 Factor factor2, int row) throws IncompatibleScoringSchemeException;
725
726 /**
727 * Computes a block of the block table, which corresponds to an alignment block
728 * between one factor of sequence 1 and one factor of sequence 2. See subclasses for
729 * actual implementation.
730 *
731 * @param factor1 the factor of the first sequence being aligned
732 * @param factor2 the factor of the second sequence being aligned
733 * @param row the row index of block in the block table
734 * @param col the column index of block in the block table
735 * @return the computed block
736 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible
737 * with the loaded sequences.
738 */
739 protected abstract AlignmentBlock createBlock (Factor factor1, Factor factor2,
740 int row, int col) throws IncompatibleScoringSchemeException;
741
742 /**
743 * Retrieves an optimal alignment between the loaded sequences. See subclasses for
744 * actual implementation.
745 *
746 * @return the computed block
747 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible
748 * with the loaded sequences.
749 */
750 protected abstract PairwiseAlignment buildOptimalAlignment ()
751 throws IncompatibleScoringSchemeException;
752
753 /**
754 * Locates the score of the highest scoring alignment between the two sequences in the
755 * block table after is thas been computed. See subclasses for actual implementation.
756 *
757 * @return the score of the highest scoring alignment between the loaded sequences
758 * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible
759 * with the loaded sequences.
760 */
761 protected abstract int locateScore ();
762 }