Longest common subsequence problem
From Wikipedia, the free encyclopedia
The longest common subsequence problem (LCS) is finding the longest subsequence common to all sequences in a set of sequences (often just two). It is a classic computer science problem, the basis of diff, and has applications in bioinformatics.
It should not be confused with the longest common substring problem (substrings are necessarily contiguous).
Contents |
[edit] Complexity
The problem is NP-hard for the general case of an arbitrary number of input sequences.[1] When the number of sequences is constant, the problem is solvable in polynomial time by dynamic programming (see Solution below). Assume you have N sequences of lengths . A naive search would be trying, for all possible lengths, for all the sequences, all possible subsequences, which numbers
- .
A dynamic programming (DP) approach gives a solution in
There exist methods with lower complexity,[2] which often depend on the length of the LCS, the size of the alphabet, or both.
Notice that the LCS is not necessarily unique; for example the LCS of "ABC" and "ACB" is either "AB" or "AC". Indeed the LCS problem is often defined to be finding all common subsequences of a maximum length. This problem inherently has higher complexity, as the number of such subsequences is exponential in the worst case,[3] even for only two input strings.
[edit] Solution for two sequences
Given the sequences and
Here + denotes concatenation, and max gives the longest sequence.
Since this problem has an optimal substructure property, it can be solved by dynamic programming. See[4] for an accessible explanation.
The rationale for this recurrence is that, if the last characters of two sequences are equal, they must be part of the LCS. You could never get a larger LCS by matching xm to yj where j < n, and vice versa. If they are not equal, check what gives the largest LCS of keeping xm and yn, and use that. If you want to find all the longest common subsequences, the LCS should be denoted as a set of sequences, and max should return both solutions if they are equally long.
[edit] Relation to other problems
For two strings and , the length of the shortest common supersequence is[2]
The edit distance when only insertion and deletion is allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion, is:
[edit] Dynamic programming solution
[edit] Computing the length of the LCS
The function below takes as input sequences X[1..m]
and Y[1..n]
computes the LCS between X[1..i]
and Y[1..j]
for all 1 ≤ i ≤ m
and 1 ≤ j ≤ n
, and stores it in C[i,j]
. C[m,n]
will contain the length of the LCS of X
and Y
.
function LCSLength(X[1..m], Y[1..n]) C = array(0..m, 0..n) for i := 0..m C[i,0] = 0 for j := 1..n C[0,j] = 0 for i := 1..m for j := 1..n if X[i] = Y[j] C[i,j] := C[i-1,j-1] + 1 else: C[i,j] := max(C[i,j-1], C[i-1,j]) return C[m,n]
Alternatively, memoization could be used.
[edit] Reading out an LCS
The following function backtraces the choices taken when computing the C
table. If the last characters in the prefixes are equal, they must be in an LCS. If not, check what gave the largest LCS of keeping xi and yj, and make the same choice. Just choose one if they were equally long. Call the function with i=m
and j=n
.
function backTrace(C[0..m,0..n], X[1..m], Y[1..n], i, j) if i = 0 or j = 0 return "" else if X[i] = Y[j] return backTrace(C, X, Y, i-1, j-1) + X[i] else if C[i,j-1] > C[i-1,j] return backTrace(C, X, Y, i, j-1) else return backTrace(C, X, Y, i-1, j)
[edit] Reading out all LCSs
If choosing xi and yj would give an equally long result, read out both resulting subsequences. This is returned as a set by this function. Notice that this function is not polynomial, as it might branch in almost every step if the strings are similar.
function backTraceAll(C[0..m,0..n], X[1..m], Y[1..n], i, j) if i = 0 or j = 0 return {} else if X[i] = Y[j] return {Z + X[i] for all Z in backTraceAll(C, X, Y, i-1, j-1)} else R := {} if C[i,j-1] ≥ C[i-1,j] R := backTraceAll(C, X, Y, i, j-1) if C[i-1,j] ≥ C[i,j-1] R := R ∪ backTraceAll(C, X, Y, i-1, j) return R
[edit] Print the diff
This function will backtrack through the C matrix, and print the diff between the two sequences. Notice that you will get a different answer if you exchange ≥
and <
, with >
and ≤
below.
function printDiff(C[0..m,0..n], X[1..m], Y[1..n], i, j) if i > 0 and j > 0 and X[i] = Y[j] printDiff(C, X, Y, i-1, j-1) print " " + X[i] else if j > 0 and (i = 0 or C[i,j-1] ≥ C[i-1,j]) printDiff(C, X, Y, i, j-1) print "+ " + Y[j] else if i > 0 and (j = 0 or C[i,j-1] < C[i-1,j]) printDiff(C, X, Y, i-1, j) print "- " + X[i]
[edit] Example
Let X be "XMJYAUZ
" and Y be "MZJAWXU
". The longest common subsequence between X and Y is "MJAU
". The table C
shown below, which is generated by the function LCSlength
, shows the lengths of the longest common subsequences between prefixes of X and Y. The ith row and jth column shows the length of the LCS between X1..i and Y1..j.
| 0 1 2 3 4 5 6 7 | M Z J A W X U -----|----------------- 0 | 0 0 0 0 0 0 0 0 1 X | 0 0 0 0 0 0 1 1 2 M | 0 1 1 1 1 1 1 1 3 J | 0 1 1 2 2 2 2 2 4 Y | 0 1 1 2 2 2 2 2 5 A | 0 1 1 2 3 3 3 3 6 U | 0 1 1 2 3 3 3 4 7 Z | 0 1 2 2 3 3 3 4
The underlined numbers show the path the function backTrack
would follow from the bottom right to the top left corner, when reading out an LCS. If the current symbols in X and Y are equal, they are part of the LCS, and we go both up and left. If not, we go up or left, depending on which cell has a higher number. This corresponds to either taking the LCS between X1..i − 1 and Y1..j, or X1..i and Y1..j − 1
[edit] Optimization
Several optimizations can be made to the algorithm above to speed it up for real-world cases.
[edit] Reduce the problem set
The C matrix in the naive algorithm grows quadratically with the lengths of the sequences. For two 100-item sequences, a 10,000-item matrix would be needed, and 10,000 comparisons would need to be done. In most real-world cases, especially source code diffs and patches, the beginnings and ends of files rarely change, and almost certainly not both at the same time. If only a few items have changed in the middle of the sequence, the beginning and end can be eliminated. This reduces not only the memory requirements for the matrix, but also the number of comparisons that must be done.
function LCS(X[1..m], Y[1..n]) m_start := 1 m_end := m n_start := 1 n_end := n trim off the matching items at the beginning while m_start ≤ m_end and n_start ≤ n_end and X[m_start] = Y[n_start] m_start := m_start + 1 n_start := n_start + 1 trim off the matching items at the end while m_start ≤ m_end and n_start ≤ n_end and X[m_end] = Y[n_end] m_end := m_end - 1 n_end := n_end - 1 C = array(m_start-1..m_end, n_start-1..n_end) only loop over the items that have changed for i := m_start..m_end for j := n_start..n_end the algorithm continues as before ...
In the best case scenario, a sequence with no changes, this optimization would completely eliminate the need for the C matrix. In the worst case scenario, a change to the very first and last items in the sequence, only two additional comparisons are performed.
[edit] Reduce the comparison time
Most of the time taken by the naive algorithm is spent performing comparisons between items in the sequences. For textual sequences such as source code, you want to view lines as the sequence elements instead of single characters. This can mean comparisons of relatively long strings for each step in the algorithm. Two optimizations can be made that can help to reduce the time these comparisons consume.
[edit] Reduce strings to hashes
A hash function or checksum can be used to reduce the size of the strings in the sequences. That is, for source code where the average line is 60 or more characters long, the hash or checksum for that line might be only 8 to 40 characters long. Additionally, the randomized nature of hashes and checksums would guarantee that comparisons would short-circuit faster, as lines of source code will rarely be changed at the beginning.
There are three primary drawbacks to this optimization. First, an amount of time needs to be spent beforehand to precompute the hashes for the two sequences. Second, additional memory needs to be allocated for the new hashed sequences. However, in comparison to the naive algorithm used here, both of these drawbacks are relatively minimal.
The third drawback is that of collisions. Since the checksum or hash is not guaranteed to be unique, there is a small chance that two different items could be reduced to the same hash. This is unlikely in source code, but it is possible. A cryptographic hash would therefore be far better suited for this optimization, as its entropy is going to be significantly greater than that of a simple checksum. However, the setup and computational requirements of a cryptographic hash may not be worth it for small sequence lengths.
[edit] Reduce strings to identity integers
In source code, there will often be lines that are exactly the same. This could include an empty line between two functions, or a trailing brace, or closing comment, and so on. The algorithm does not need to distinguish between specific instances of these similar lines. That is, one blank line is the same as any other blank line. Therefore, the previous optimization can be taken one step further by reducing each line to an identity integer. String buckets, such as Python dict
dictionary objects, Perl hashes, or C++ STL map
objects, are well-suited for this optimization.
- Create an empty bucket.
- For each item in the sequence, look to see if it is already present in the bucket.
- If not, add the item to the bucket and tag it with a new identity integer.
- If so, get the identity integer from the bucket.
- Replace the item with its identity integer.
- Repeat.
This method has advantages over the previous optimization. First, it does not suffer from the hash collision problem, as language-specific bucket objects are almost always coded to handle the case of two different items with the same hash. More importantly, each item is reduced from a string to a unique integer, meaning that each comparison is now an integer comparison. There is overhead to the identity allocation, but in terms of the original naive algorithm, this optimization is almost always worth it.
[edit] Reduce the required space
If only the length of the LCS is required, the matrix can be reduced to a 2×min(n,m) matrix with ease, or to min(m,n)+1 vector (smarter) as the dynamic programming approach only needs the current and previous columns of the matrix. Hirschberg's algorithm allows the construction of the optimal sequence itself in the same quadratic time and linear space bounds.[5]
[edit] References
- ^ David Maier (1978). "The Complexity of Some Problems on Subsequences and Supersequences". J. ACM 25: 322–336. ACM Press. doi: .
- ^ a b L. Bergroth and H. Hakonen and T. Raita (2000). "A Survey of Longest Common Subsequence Algorithms". SPIRE 00: 39–48. IEEE Computer Society. doi: .
- ^ Ronald I. Greenberg (2003-08-06). "Bounds on the Number of Longest Common Subsequences". v2.
- ^ Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein (2001). "15.4", Introduction to Algorithms, second edition, MIT Press and McGraw-Hill, 350-355. ISBN 0-262-53196-8.
- ^ Hirschberg, D. S. (1975). "A linear space algorithm for computing maximal common subsequences". Communications of the ACM 18 (6): 341–343. doi: .