Talk:Longest common subsequence problem

From Wikipedia, the free encyclopedia

Earlier comments can be found at Talk:Longest_common_subsequence_problem/Archive

1 Cleanup 2005-08-07
2 Number of LCSs
3 LCP?
4 Notation
5 Relation to other problems
6 Solution for two sequences
7 Needs an example
8 Is the Identity Integers reasoning wrong?

[edit] Cleanup 2005-08-07

Since I am activally using and contributing to this page (I am the one who posted the diff algorithm) to create a generic history tool I think Zer000 is right it lay out sucks.... for that reason I will undertake writting a new one once I start I will post a link to the draft page here for comment. (unsigned by Sarek1024, 2006-05-14)

(I forgot to save this edit to the talk page earlier). I took the liberty of rewriting the page. Feel free to revert or change. I moved the old talk to Talk:Longest_common_subsequence_problem/Archive. -- Nils Grimsmo 05:54, 8 August 2006 (UTC)

The quality control effort really screwed the page up.... first of not everyone knows python so it is better to offer the algorithm in traditional psedo-code... second back tracking is *NOT* needed so why include it (the rivest algorithm may not be the most efficent but it is the simplest to understand)... getting rid of the LCS matrix makes stuff more confusing and harder to debug for someone implementing it (I started editing the page when I first started playing with LCS about 6 months ago and found it to be an ideal reference [short on theory but long on practicle stuff])... also the diff algorithm should also be included since it illustrates a good real life example of using the LCS for something useful. (unsigned by Sarek1024, 2006-08-08)

I will not be offended if you revert it. Some views:

* I might agree to the argument about Python, as there is some syntax noise there (initialising the C matrix, for instance.)

* The algorithms from CLRS01 does use backtracking. The "UP"/"LEFT" table? The algorithm PRINT-LCS from CLRS01 reads this, and backtracks. See also the section "Improving your code" on page 355.

* PRINT-LCS from CLRS01 prints a random LCS if there are more than one. This could be a better approach from an educational view, though. But then it should be stated clearly in the article that you shold modify your approach if you want all the LCS.

* I feel that the four step explanation in CLRS01 is too long for an encyclopedia. Theorem 15.1 Step 1 is enough for people to understand the problem. It might be better to write it as in CLRS01 (as it was here previously) though.

I am sorry for the brutal cleanup. If more people feel like a revert, just go ahead.

Nils Grimsmo 05:54, 8 August 2006 (UTC)

I have now changed it a bit. Maybe you like it better.

* Finding a single LCS is the default problem again.

* The code is split into one function which finds a single LCS, and one which finds all. The former should be very easy to understand.

* I have gotten rid of some odd Python specific syntax (initializing the C array). What is left is the syntax of the for-loops (range) and some stuff in bt_all.

* Added a diff print function.

Nils Grimsmo 14:51, 8 August 2006 (UTC)

I wrote a crude Python to pseudo-code translator. Do you people like this pseudo-code better than the Python variant then? Nils Grimsmo 07:38, 9 August 2006 (UTC)

While I'm not too keen on the Python code, it's not horrible. I think that while Nils' LCS2 version may be more language-agnostic, it may be harder to read due to its additional verbosity. As it stands, the Python code is at least trim and concise. RickOsborne 14:54, 2 October 2006 (UTC)

I tried to make an example and print the C matrix combined with the choices you make (which are the choices you have to redo when you backtrack). I am no HTML wiz, so I had a little trouble getting the output compact. I have uploaded a dump of it. Does anybody have an idea of how to make the source code more compact? -- Nils Grimsmo 18:05, 13 August 2006 (UTC)

[edit] Number of LCSs

Does anybody know a tight bound on the maximal number of longest common subsequences? (As in less than $O (m m)$ , $m \leq n$ ). If you take for example the strings XYXYXY...XY and YXYXYX...YX, there are $2^\frac{m}{2}$ LCSs, right? For XYZXYZ...XYZ and ZYXZYX...ZYX there are $3^\frac{m}{3}$ . We have that $r^\frac{m}{r}$ is maximal for $r = e$ . Is $e^\frac{m}{e}$ a bound? -- Nils Grimsmo 17:56, 13 August 2006 (UTC)

This was just plain wrong. The strings XYZXYZ and ZYXZYX, have 10 LCSs, of length 3. And $e^\frac{6}{e} \approx 9.09 < 10$ . But pretty please, doesn't anybody have a bound somewhere in a paper? -- Nils Grimsmo 07:42, 16 August 2006 (UTC)

I believe the following will help: http://arxiv.org/abs/cs.DM/0301030 . This too: http://www.research.att.com/~njas/sequences/A094837 . Zero^talk 10:48, 16 August 2006 (UTC)

Thanks a bunch! -- Nils Grimsmo 09:29, 17 August 2006 (UTC)

[edit] LCP?

Hi folks, I dunno if it's just an artifact of an earlier version or an artifact of my wayward brain, but is "LCP" a typo for LCS in the Python code and there abouts? Also, this Python doesn't run as is, right? The matrix C isn't initialized. Isn't the LCS algorithm understood to take two sequences as its input? --babbage 09:00, 27 October 2006 (UTC)

You are right about the typo. Thanks for noticing! I added a comment in the code about the initialisation of the C matrix. The reason I did not do create the matrix inside in the function is that this is really ugly in python:

C = [[0] * (m+1) for i in xrange(n+1)]

But maybe it should have been put there anyway. Klem fra Nils Grimsmo 07:18, 28 October 2006 (UTC)

[edit] Notation

Currently the notation is not consistent between the definition of the recurrence and the pseudo-code (my fault). Which should be used of X_1..m and X[1..m]? Or maybe something else? Or does it not matter that two different notations are used? Klem fra Nils Grimsmo 14:29, 10 December 2006 (UTC)

And BTW, this should be probably be uniform across all similar string problems/algorithms which use DP solutions, such as edit distance, longest common substring, shortest common supersequence, etc.. Klem fra Nils Grimsmo 21:26, 11 December 2006 (UTC)

[edit] Relation to other problems

For two strings $X_{1 \dots m}$ and $Y_{1 \dots n}$ , the Levenshtein distance (edit distance) is^[1]

$d(X,Y) = n + m - 2 \cdot \left|\textrm{LCS}(X,Y)\right|$

I don't think this is true. For example, if we take the words RHYTHM and ALGORITHM, then $\left|\textrm{LCS}(X,Y)\right|=4$ so the Levenshtein distance should become $15 - 2 \cdot 4 = 7$ , but it should be 6.

You are right. Good thing you noticed! Klem fra Nils Grimsmo 15:03, 21 December 2006 (UTC)

[edit] Solution for two sequences

I think the following equation is wrong.

$\textrm{LCS}\left(X_{1 \dots i},Y_{1 \dots j}\right) = \begin{cases} \textrm{LCS}\left(X_{1 \dots i-1},Y_{1 \dots j-1}\right) + x_{i} & \mbox{ if } x_i = y_j \\ \end{cases}$

The correct one should be

$\textrm{LCS}\left(X_{1 \dots i},Y_{1 \dots j}\right) = \begin{cases} \textrm{LCS}\left(X_{1 \dots i-1},Y_{1 \dots j-1}\right) + 1 & \mbox{ if } x_i = y_j \\ \end{cases}$

Isn't it? —The preceding unsigned comment was added by 140.115.213.121 (talk) 04:59, 2 February 2007 (UTC).

Sorry, the original one is correct. I misunderstood the equation. —140.115.213.121.

[edit] Needs an example

I think the article could really do with an example demonstrating the LCS between two strings. I'm not sure I understand it well enough to add one, though... Tjwood 10:18, 5 March 2007 (UTC)

I added an example now. Does it make things clearer? Klem fra Nils Grimsmo 08:25, 6 March 2007 (UTC)

[edit] Is the Identity Integers reasoning wrong?

The only advantage I can see identity integers having over hash functions is that they don't have collisions, at the cost of taking longer to calculate. The article currently makes it sound like they have more advantages.. eg:

In source code, there will often be lines that are exactly the same. This could include an empty line between two functions, or a trailing brace, or closing comment, and so on. The algorithm does not need to distinguish between specific instances of these similar lines. That is, one blank line is the same as any other blank line. Therefore, the previous optimization can be taken one step further by reducing each line to an identity integer.

Yet any hash function will yield the same hash value for two lines that are the same, else it's not a hash function. So the above point seems fairly moot..

More importantly, each item is reduced from a string to a unique integer, meaning that each comparison is now an integer comparison.

As above.

Identity integers also lose this property:

Additionally, the randomized nature of hashes and checksums would guarantee that comparisons would short-circuit faster, as lines of source code will rarely be changed at the beginning.

"randomized nature" leads me to believe they're considering a hash stored as a sequence of bytes to be compared one-by-one, rather than a sequence of integers which can be compared more efficiently. What it's saying is that without a hash you typically need to look through many characters to test equality. With a "randomized nature" hash the comparison is faster, typically just one or two bytes/characters of the hash need to be examined, or perhaps just one integer if the hash is encoded as a sequence of integers. It's not really guaranteed to short-circuit though since there may be collisions and for certainty you should fall back to string comparison if the hashes are equal. With identity integers, "randomized nature" is not required since equality is guaranteed to be determined with a single integer comparison, making it most efficient, although in practice the advantage over a hash using integer comparison may be negligible.
Atomota (talk) 08:12, 22 November 2007 (UTC)

So unless I'm misunderstanding something, should the article be changed? (I don't fully understand how the function works yet, so in the case that I'm wrong I'd rather not change/make a mess of things =P, and also have never worked in perl/c++)

Themania 05:20, 20 April 2007 (UTC)

Also I forgot to mention the largest advantage of hash values - you can calculate them just once for a file, and then compare that file against other files. Very useful!

Themania 05:24, 20 April 2007 (UTC)

In traceback since R is {} , we can omit it in the first if .

Talk:Longest common subsequence problem

From Wikipedia, the free encyclopedia

Contents

[edit] Cleanup 2005-08-07

[edit] Number of LCSs

[edit] LCP?

[edit] Notation

[edit] Relation to other problems

[edit] Solution for two sequences

[edit] Needs an example

[edit] Is the Identity Integers reasoning wrong?

Views

Navigation

Interaction

Search