Talk:Longest common subsequence problem

From Wikipedia, the free encyclopedia

Earlier comments can be found at Talk:Longest_common_subsequence_problem/Archive

Contents

[edit] Cleanup 2005-08-07

Since I am activally using and contributing to this page (I am the one who posted the diff algorithm) to create a generic history tool I think Zer000 is right it lay out sucks.... for that reason I will undertake writting a new one once I start I will post a link to the draft page here for comment. (unsigned by Sarek1024, 2006-05-14)

(I forgot to save this edit to the talk page earlier). I took the liberty of rewriting the page. Feel free to revert or change. I moved the old talk to Talk:Longest_common_subsequence_problem/Archive. -- Nils Grimsmo 05:54, 8 August 2006 (UTC)
The quality control effort really screwed the page up.... first of not everyone knows python so it is better to offer the algorithm in traditional psedo-code... second back tracking is *NOT* needed so why include it (the rivest algorithm may not be the most efficent but it is the simplest to understand)... getting rid of the LCS matrix makes stuff more confusing and harder to debug for someone implementing it (I started editing the page when I first started playing with LCS about 6 months ago and found it to be an ideal reference [short on theory but long on practicle stuff])... also the diff algorithm should also be included since it illustrates a good real life example of using the LCS for something useful. (unsigned by Sarek1024, 2006-08-08)
I will not be offended if you revert it. Some views:
* I might agree to the argument about Python, as there is some syntax noise there (initialising the C matrix, for instance.)
* The algorithms from CLRS01 does use backtracking. The "UP"/"LEFT" table? The algorithm PRINT-LCS from CLRS01 reads this, and backtracks. See also the section "Improving your code" on page 355.
* PRINT-LCS from CLRS01 prints a random LCS if there are more than one. This could be a better approach from an educational view, though. But then it should be stated clearly in the article that you shold modify your approach if you want all the LCS.
* I feel that the four step explanation in CLRS01 is too long for an encyclopedia. Theorem 15.1 Step 1 is enough for people to understand the problem. It might be better to write it as in CLRS01 (as it was here previously) though.
I am sorry for the brutal cleanup. If more people feel like a revert, just go ahead.
Nils Grimsmo 05:54, 8 August 2006 (UTC)
I have now changed it a bit. Maybe you like it better.
* Finding a single LCS is the default problem again.
* The code is split into one function which finds a single LCS, and one which finds all. The former should be very easy to understand.
* I have gotten rid of some odd Python specific syntax (initializing the C array). What is left is the syntax of the for-loops (range) and some stuff in bt_all.
* Added a diff print function.
Nils Grimsmo 14:51, 8 August 2006 (UTC)
I wrote a crude Python to pseudo-code translator. Do you people like this pseudo-code better than the Python variant then? Nils Grimsmo 07:38, 9 August 2006 (UTC)
While I'm not too keen on the Python code, it's not horrible. I think that while Nils' LCS2 version may be more language-agnostic, it may be harder to read due to its additional verbosity. As it stands, the Python code is at least trim and concise. RickOsborne 14:54, 2 October 2006 (UTC)
I tried to make an example and print the C matrix combined with the choices you make (which are the choices you have to redo when you backtrack). I am no HTML wiz, so I had a little trouble getting the output compact. I have uploaded a dump of it. Does anybody have an idea of how to make the source code more compact? -- Nils Grimsmo 18:05, 13 August 2006 (UTC)

[edit] Number of LCSs

Does anybody know a tight bound on the maximal number of longest common subsequences? (As in less than O(mm), m \leq n). If you take for example the strings XYXYXY...XY and YXYXYX...YX, there are 2^\frac{m}{2} LCSs, right? For XYZXYZ...XYZ and ZYXZYX...ZYX there are 3^\frac{m}{3}. We have that r^\frac{m}{r} is maximal for r = e. Is e^\frac{m}{e} a bound? -- Nils Grimsmo 17:56, 13 August 2006 (UTC)

This was just plain wrong. The strings XYZXYZ and ZYXZYX, have 10 LCSs, of length 3. And e^\frac{6}{e} \approx 9.09 < 10. But pretty please, doesn't anybody have a bound somewhere in a paper? -- Nils Grimsmo 07:42, 16 August 2006 (UTC)
I believe the following will help: http://arxiv.org/abs/cs.DM/0301030 . This too: http://www.research.att.com/~njas/sequences/A094837 . Zerotalk 10:48, 16 August 2006 (UTC)
Thanks a bunch! -- Nils Grimsmo 09:29, 17 August 2006 (UTC)

[edit] LCP?

Hi folks, I dunno if it's just an artifact of an earlier version or an artifact of my wayward brain, but is "LCP" a typo for LCS in the Python code and there abouts? Also, this Python doesn't run as is, right? The matrix C isn't initialized. Isn't the LCS algorithm understood to take two sequences as its input? --babbage 09:00, 27 October 2006 (UTC)

  • You are right about the typo. Thanks for noticing! I added a comment in the code about the initialisation of the C matrix. The reason I did not do create the matrix inside in the function is that this is really ugly in python:
C = [[0] * (m+1) for i in xrange(n+1)]
But maybe it should have been put there anyway. Klem fra Nils Grimsmo 07:18, 28 October 2006 (UTC)

[edit] Notation

Currently the notation is not consistent between the definition of the recurrence and the pseudo-code (my fault). Which should be used of X1..m and X[1..m]? Or maybe something else? Or does it not matter that two different notations are used? Klem fra Nils Grimsmo 14:29, 10 December 2006 (UTC)

And BTW, this should be probably be uniform across all similar string problems/algorithms which use DP solutions, such as edit distance, longest common substring, shortest common supersequence, etc.. Klem fra Nils Grimsmo 21:26, 11 December 2006 (UTC)


[edit] Relation to other problems

For two strings X_{1 \dots m} and Y_{1 \dots n}, the Levenshtein distance (edit distance) is[1]

d(X,Y) = n + m - 2 \cdot \left|\textrm{LCS}(X,Y)\right|

I don't think this is true. For example, if we take the words RHYTHM and ALGORITHM, then \left|\textrm{LCS}(X,Y)\right|=4 so the Levenshtein distance should become 15 - 2 \cdot 4 = 7, but it should be 6.

You are right. Good thing you noticed! Klem fra Nils Grimsmo 15:03, 21 December 2006 (UTC)

[edit] Solution for two sequences

I think the following equation is wrong.

\textrm{LCS}\left(X_{1 \dots i},Y_{1 \dots j}\right) = \begin{cases} \textrm{LCS}\left(X_{1 \dots i-1},Y_{1 \dots j-1}\right) + x_{i}                                                                & \mbox{ if } x_i = y_j \\ \end{cases}

The correct one should be

\textrm{LCS}\left(X_{1 \dots i},Y_{1 \dots j}\right) = \begin{cases} \textrm{LCS}\left(X_{1 \dots i-1},Y_{1 \dots j-1}\right) + 1                                                                & \mbox{ if } x_i = y_j \\ \end{cases}

Isn't it? —The preceding unsigned comment was added by 140.115.213.121 (talk) 04:59, 2 February 2007 (UTC).

Sorry, the original one is correct. I misunderstood the equation. —140.115.213.121.

[edit] Needs an example

I think the article could really do with an example demonstrating the LCS between two strings. I'm not sure I understand it well enough to add one, though... Tjwood 10:18, 5 March 2007 (UTC)

I added an example now. Does it make things clearer? Klem fra Nils Grimsmo 08:25, 6 March 2007 (UTC)