Longest Common Subsequence

X = <x₁, x₂, x₃, ...,x_m> is a sequence

Z = <z₁, z₂, z₃, ...,z_k> is a subsequence of X if z₁= x_i₁, z₂= x_i₂, ...z_k= x_{i_k} 0 < i₁< i₂... < i_kŁ m

Given sequences X and Y, Z is a common subsequence if Z is a subsequence to both X and Y. If Z is the longest possible subsequence of both X and Y then it is the longest common subsequence.

X = < A, G, C, G, T, A, G >

Y = < G, T, C, A, G, A >

a common subsequence is < G, C, A > also < G, C, G, A > and < G, T, A, A > since no common subsequence of length 5 exists there are 2 LCG's: < G, C, G, A > and < G, T, A, G >.

Finding the LCS:

Could enumerate all subsequences of X (length m) Ž 2^mand Y (length n) Ž 2ⁿmust search for hits and sort by length. This is an exponential algorithm. LCS has an optimal substructure property based on prefixes. If X = <x₁, ...,x_m> then X_i = <x₁, ...,x_i> is the i^th prefix of X and X₀is empty.

Theorem 16.1 ( Optimal substructure for LCS)

If X = <x₁, ...,x_m> and if Y = <y₁, ...,y_n> are sequences, let Z = <z₁, ...,z_k> be some LCS of x and y.

If x_m= y_n then z_k= x_mand Z_k-1 is an LCS of X_m-1and Y_n-1

If x_mš y_n then z_kšx_mŢ Z is an LCS of X_m-1and Y

If x_mš y_n then z_kšy_m Ţ Z is an LCS of Xand Y_n-1

Proof:

If z_kšx_mthen we could add x_m= y_nto Z to get an LCS of length k + 1. By contradiction it must be that z_k= x_m= y_n. |z_k-1 | = k - 1 and is an LCS of X_m-1and Y_n-1 . It is an LCS, if not then $ W CS of X_m-1and Y_n-1with | W | > k - 1 and so by appending x_m= y_nwe get a CS of X and Y of length greater than k, a contradiction.

If z_kšx_mthen z is a cs of X_m-1and Y. If $ W a CS with | W | > k, then W would be a CS of X and Y, a contradiction

Proof by reversing x and y

This means that to find the LCS of X and Y:

if x_m= y_nfind LCS of X_m-1 and Y_m-1

If   x_mš y

find LCS of X_m-1and Y

find LCS of X and Y_n-1

and take the larger of 'a.' or 'b.'. Thus we start with small problem, find LCS and grow our solution:

Let c[i,j] = | W |, W is LCS of X_iand Y_j

c[i,j] = 0 if i*j = 0

         = c[i-1,j-1] + 1, if i*j > 0 and x_i= y_j

         = max (c[i, j-1]. c[i-1,j]) if i,j > 0 and x_i š y_j

Could use this to write an exponential algorithm via recursion. However, there are only Q (m n) sub-problems, so we use DP. Store c[i,j] and b[i,j] which points to the optimal sub-problem chosen. c[m,n] contains the length of the LCS and b[m,n] directions used to build it.

LCS-Length ( X, Y )

m Ź length[X]

n Ź length[Y]

for i Ź 1 to m

   do c[i,0] Ź 0

for j Ź 0 to n

   do c[0,j] Ź 0

for i Ź 1 to m

   do for j Ź 1 to n

      do if x_i= y_j

            then c[i,j] Ź c[i-1, j-1] + 1

                    b[i,j] Ź "\"

            else if c[i - 1, j] ł c[i, j-1]

               then c[i,j] Ź c[i-1, j]

                       b[i,j] Ź "Ý"

               else c[i,j] Ź c[i, j-1]

                      b[i,j] Ź "Ź"

return c and b

Running Time = O (m n) since each table entry takes O (1) time.

Note:

"\" means both the same

"Ý" means c[i - 1, j] ł c[i, j-1]

"Ź" means c[i - 1, j] < c[i, j-1]

The "\" diagonal arrows lengthen the LCS

Print-LCS (b, X, i, j)

If i = 0 or j = 0

   then return

if b[i,j] = "\"

   then Print-LCS (b, X, i-1, j-1)

      print

else if b[i,j] = "Ý"

   then Print-LCS ( b, X, i-1, j)

else Print-LCS (b, X, i, j-1)

This takes O ( m + n ) since either i or j is decremented in the recursion.

Dynamic Programming - 3 of 3