<<Up     Contents

Longest-common subsequence problem

The Longest-common subsquence problem is the search for an efficient method of finding the longest common subsequence (LCS). This computer science problem has gained promience thanks in part to the Bioinformatics field.

An old method of searching for LCS was to employ a brute force policy: Given a sequence X, determine all possible of subsequences of X, and check to see if each subsequence was a subsequence of Y, keeping track of the longest subsequence found. Each subsequence of X would be in the set of {1,2,3,4,....,k}. Using number theory proofs, we find that there would be 2k subsequences of X. This would be in exponential time, making this search extremely ineffective for long sequences, such as human DNA strands.

Four Steps to LCS, Polynomial Time Edition

1. Analyze LCS properties
Many computer scientists have written papers on LCS properties, including one where LCS has an optimal-substructure property.

The Optimal-Substructure of an LCS Theorem is

Let X = < x1,...,xm > and Y = < y1,...,yn > be sequences, and let Z = < z1,...,zk > be any LCS of X and Y.

  1. If xm = yn, then zk = xm = yn and Zk-1 is an LCS of Xm-1 and Yn-1.
  2. If xm ≠ yn, then zk ≠ xm, implies that Z is an LCS of Xm-1 and Y.
  3. If xm ≠ yn, then zk ≠ yn implies that Z is an LCS of X and Yn-1.

2. Devise a recursive solution
Computer scientists have agreed on the formula:

<math> c[i,j] = \left\{\begin{matrix} 0, & \mbox{if i = 0 or j = 0} \\ c[i-1,j-1]+1, & \mbox{if i,j}>\mbox{0 and }x_i = y_j \\ max(c[i,j-1],c[i-1,j]) & \mbox{if i,j}>\mbox{0 and }x_i \ne y_j \end{matrix}\right. </math>

The recursive formula relies on the Optimal Substructure Theorem in Part 1. It involves establishing a recurrence for the optimal value. c[i,j] is the length of the LCS in sequences Xi and Yj. The first part of the formula says if either one of the sequences has length 0, then there can be no proper subsequence of a null sequence. The other two parts breaks the LCS apart into smaller subproblems until we reach the null sequence.

3. Compute the LCS

 LCS-Delta(X,Y)
 m <- LENGTH[X];
 n <- LENGTH[Y];
 for i <- 1 to m, do c[i,0] <-0;
 for j <- 0 to n, do c[0,j] <-0;
 b <- c;
 for i <- 1 to m, do {
     for j <- 1 to m do {
          if (xi = yj) {
              c[i,j] <- c[i-1,j-1]+1;
              b[i,j] <- "UP-LEFT";
          }
          else if (c[i-1,j] >= c[i,j-1]) {
              c[i,j] <- c[i-1,j];
              b[i,j] <- "UP";
          }
          else
              c[i,j] <- c[i,j-1];
              b[i,j] <- "LEFT";
          }
     }
  return c and b

4. Construct the LCS

<math>\begin{bmatrix} ! & 0 & 0 & 0 & 0 & 0\\
 0 & UP-LEFT & LEFT & 0 & 0 & 0 \\  
 0 & 0 & 0 & UP-LEFT & 0  & 0\\  
 0 & 0 & 0 & 0 & UP-LEFT & 0 \\  
 0 & 0 & 0 & 0 & UP & LEFT \\
\end{bmatrix} </math>

Starting from the (m,n) position, we follow the direction of the arrows, until we reach the end of the LCS, denoted by !. The values in c-table corresponding to the "UP-LEFT" positions in the b-table would be an element of LCS. The LCS is constructed in reverse order using this method. The construction method takes O(m + n) time to complete, therefore it runs in linear time.

Modern LCS search methods have yielded algorithms which have cut down the exponential time to polynomial time. The LCS continues to be examined by computer scientists, trying to find an even faster time, perhaps one in logarithmic/linear time.

wikipedia.org dumped 2003-03-17 with terodump