Talk:Levenshtein distance

From Wikipedia, the free encyclopedia

Contents

[edit] Incorrect Calculation of 'cost'

"if s[i-1] = t[j-1]" should have been "if s[i] = t[j]", since s and t are declared as "char s[1..m], char t[1..n]", and the loops go from 1 to m and 1 to n. i-1 and j-1 are needed in implementations in other languages where string character indexing starts at 0 (e.g C). Fixed. --Anon

[edit] Incorrect Date

Levenshtein's paper was published in 1966, not 1965. cf. V. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals. Soviet Physice – Doklady 10, pp. 707-710, 1966. ralian 03:15, 11 April 2007 (UTC)


[edit] Mention Competing Utilities

This page should link to the competing utilities such as wdiff since the Levenshtein distance and diff, wdiff, sdiff, etc all address the same set of problems (i.e., what is the difference between two strings/files)

[edit] Inconsistent Pseudocode

Why in the pseudocode do int arrays start at index 0 and char arrays start at index 1?

Because it simplifies the algorithm. Anything else would require confusing offsets to be inserted for one array or the other. If you read the correctness section you can see why this numbering fits with the invariants. Deco 02:15, 19 Feb 2005 (UTC)

Not really. It's confusing and inconsistent. Anyone just converting the pseudocode into real code (like I was doing earlier) will end up with a program that generates array out of bound errors. It was easy enough to fix when I looked at what it was actually doing, but is still terribly annoying. The offsets would be less confusing than having arrays start at different indexes.

I agree that the array indicing is confusing. Adding offsets doesnt change the algorithm, its just another way to show it. How it is at the moment is not especially useful given how most language lack features as nice as: "declare int d[0..lenStr1, 0..lenStr2]", "declare int d[size1,size2]" is much more common. More importantly though you can express code using the first type with the second, but not the other way around, which makes it more practical for psuedocode. Jheriko 05:41, 8 March 2006 (UTC)

The purpose of pseudocode is not to make the algorithm easily implementable in your-limited-language-of-choice (if it was, we would write algorithms in it, not pseudocode.) It is to give the clearest explanation of the algorithm, and conceptually, this algorithm indexes strings from one and the distance matrix from 0. As stated in the invariants section, this is because D[0][0] means the distance from the zero-length string to the zero-length string (and the algorithm needs this) and string[..0] must be the zero-length string, not a length-1 string. I've just implemented this in my-limited-language-of-choice, and I thought about it this way (without looking at this page -- I came here to check myself), although of course I didn't implement it that way. On a lighter note, this reminded me of XKCD #163 --Taejo|대조 08:15, 7 September 2007 (UTC)

[edit] Haskell Example

[edit] Complicated Haskell

Why is the Haskell code so complicated? Isn't this enough?


editDistance :: Eq a => [a] -> [a] -> Int
editDistance s [] = length s
editDistance [] t = lenght t
editDistance (s:ss) (t:ts) = minimum [ (if s == t then 0 else 1) + editDistance ss ts,
                                       1 + editDistance ss (t:ts),
                                       1 + editDistance (s:ss) ts ]

Okay, I replaced the original code below with the shorter code above. Jirka

min3 :: Int->Int->Int->Int
min3 x y z = min x (min y z)


cost :: Char->Char->Int
cost a b
    | a == b = 0
    | otherwise = 1


sumCosts :: String->Char->Int
sumCosts [] a = 0
sumCosts (s:ss) a = cost s a + sumCosts ss a


editDistance :: String->String->Int
editDistance [] [] = 0
editDistance s [] = sumCosts s '-'
editDistance [] t = sumCosts t '-'
editDistance (s:ss) (t:ts) = min3 (cost s t + editDistance ss ts)
                                  (cost s '-' + editDistance ss (t:ts))
                                  (cost '-' t + editDistance (s:ss) ts)

[edit] On memoization

Also... I really don't think Haskell memoizes like this page claims it does. That would either take tons of space or one crafty LRU cache.... Could we get a reference for that, outside of Wikipedia?

--Daren Brantley 04:15, 16 July 2005 (UTC)

If you want to verify Haskell's memoization features, visit the Dynamic programming article where the claim is also made. I'm sure somebody could give you a reference to the Haskell standard. --132.198.104.164 21:38, 18 July 2005 (UTC)

On the other hand, I wrote both these statements, so I may just be wrong. Deco 05:14, 20 August 2005 (UTC)

I think the current Haskell implementation is wrong. The levenshtein algorithm should use no more than O(n²) time, and this is exponential. It's possible to write a memorizing implementation, but Haskell doesn't do it automatically. (The ruby one is wrong, too, and probably some others.)

An O(n²) version:

 distance str1 str2 = let
     (l1, l2) = (length str1, length str2)
     istr1 = (0, undefined) : zip [1..] str1
     istr2 = (0, undefined) : zip [1..] str2
     table = array ((0, 0), (l1, l2))
         [ ((i, j), value i c1 j c2) | (i, c1) <- istr1, (j, c2) <- istr2 ]
     value 0 _  j _  = j
     value i _  0 _  = i
     value i c1 j c2 = minimum [ table ! (i-1, j) + 1,
                                 table ! (i, j-1) + 1,
                                 table ! (i-1, j-1) + cost ]
         where cost = if c1 == c2 then 0 else 1
   in
     table ! (l1, l2)

The following version isn't O(n²), because // copies arrays and !! is linear in the element number -- using lazy evaluation as above is the key for solving that:

 distance str1 str2 =
    last $ elems $ foldl update table [ (i,j) | i <- [1..length str1] , j <- [1..length str2] ]
    where table = initial (length str1  , length str2 )
          update table (i,j) =  table // [((i,j),value)]
              where value =
                        minimum [  table ! (i-1 , j)     + 1     -- deletion
                                ,  table ! (i   , j-1)   + 1     -- insertion
                                ,  table ! (i-1 , j-1)   + cost  -- substitution
                                ]
                            where cost = if str1!!(i-1) == str2!!(j-1) then 0 else 1 
 initial (b1,b2) = array ((0,0),(b1,b2)) [ ((i,j), value (i,j)) | i <- [0 .. b1] , j <- [0..b2]]
     where value (0,j) = j
           value (i,0) = i
           value (i,j) = 0

[edit] An O(n) in space, faster, stricter, tail recursive version

Here is an O(n) version, using iteration over an initial row. This is much faster (with GHC; have not tried others) since

  • it is O(n) in space,
  • laziness left away by GHC strictness analysis is removed at one point using seq,
  • it is tail recursive in the outer iterations
       distance :: String -> String -> Int
       distance s1 s2 = iter s1 s2 [0..length s2] where
               iter (c:cs) s2 row@(e:es) =
                       iter cs s2 (e' : rest e' c s2 row) where
                               e' = e + 1
               iter [] _ row = last row
               iter _ _ _ = error "iter (distance): unexpected arguments"
               rest e c (c2:c2s) (e1:es@(e2:es')) =
                       seq k (k : rest k c c2s es) where
                               k = (min (e1 + if c == c2 then 0 else 1) $
                                       min (e+1) (e2+1))
               rest _ _ [] _ = []
               rest _ _ _ _ = error "rest (distance): unexpected arguments"

-- Abhay Parvate 05:33, 4 July 2006 (UTC)

[edit] Levenshtein's nationality/ethnicity

I don't see any particular reason to point out his ethnicity/religion in this article. If you want to put it in his article, be my guest -- but please provide documentation of some sort.--SarekOfVulcan 21:22, 3 November 2005 (UTC)

I believe the point of the edit was to indicate that Levenshtein was not an ethnic Russian, just a Jewish guy who lived in Russia. As you suggest, I think such fine points of Levenshtein's nationality are best reserved for his own article, if anyone can drag up enough info to write it. Deco 00:23, 4 November 2005 (UTC)
I dragged up enough on the web to stub it.--SarekOfVulcan 00:41, 4 November 2005 (UTC)
It might help when learning to pronounce the name of the algorithm. -- Mikeblas 08:13, 3 January 2006 (UTC)

[edit] Minimality

I can transform "kitten" into "sitting" in two steps:

  • kitten
  • (delete "kitten")
  • sitting (insert "sitting" at end)

Can someone explain why this is not acceptable? Are we talking single-character edits here? --P3d0 17:48, 16 January 2006 (UTC)

I went ahead and added "of a single character" to the intro sentence. Please feel free to revert if I have this wrong. --P3d0 17:51, 16 January 2006 (UTC)
I guess it doesn't hurt to be pedantic. Deco 18:20, 16 January 2006 (UTC)
Heh... Well, if you use "diff" you wind up with a list of edits which are most definitely not single-character edits. Plus, I think I'm a fairly reasonable person, and I didn't realize it was single-character edits until I read the algorithm. --P3d0 23:46, 16 January 2006 (UTC)
Sorry for the confusion, I hope it's clearer now. :-) Deco 00:15, 17 January 2006 (UTC)

[edit] Implementations

I don't believe the implementations are relevant to the article. Some of them are even wrong - and they certainly aren't more illustrative than the pseudo-code. Currently there are 18 implementations: 2 for C++, C#, 2 for Lisp, Haskell, Java, Python, Ruby, 2 for Scheme, Prolog, VB.NET, Visual FoxPro, Actionscript, Perl, Ocaml and Lua. I vote for removing all of them from the article; if anyone find these useful, just create a page (like www.99-bottles-of-beer.net) with a sample implementation for each language under the sun, and add a link to it. DanielKO 00:21, 12 July 2006 (UTC)

This has a way of happening on wikis. This is why I started a separate wiki for source code. Moving them there is not an option though due to license conflicts - GFDL-licensed code just isn't very useful. Wikisource no longer accepts source code. I suggest we eradicate them all. Deco 00:37, 12 July 2006 (UTC)

Certainly. Get rid of all those implementations and just leave the pseudocode. I truly don't see the point of all this. Zyxoas (talk to me - I'll listen) 16:49, 12 July 2006 (UTC)

I just removed them all. Maybe the Implementations section should be rewritten. DanielKO 23:25, 12 July 2006 (UTC)

Potential license compatability downstream is no reason to delete material, even if the material is source code. That's like saying there should be no programming books at WikiBooks released under the GFDL.

If there's a wider proposal to remove all source code from WikiPedia (and hopefully come up with a sister project to move it to) then I'll accept deleting the implementations. If the quality of implementations are lacking, this can hardly be relevant to their deletion, because even the pseudocode was incorrect at one point.

The implementations provide useful, working examples of the pseudocode for readers. --71.192.61.193 01:54, 13 July 2006 (UTC)

I didn't say license compatibility was a problem. You totally misinterpreted me. I think source code at Wikipedia is totally okay. I would not have removed all of the implementations, maybe just all but 2 or 3. Deco 02:11, 13 July 2006 (UTC)
I don't think that it's reasonable to have that many implementations. The page is "unprintable", because there's too much irrelevant stuffs on it (who needs a Prolog implementation anyways?). But I agree that we should be consistent, and choose just n (for small values of n) languages to be used on all algorithm articles; everything else should not be in an ecyclopedia. So indeed, better lets use this article as an extreme example on how bad the situation may become, and propose a sister project for algorithms. --DanielKO 22:49, 14 July 2006 (UTC)

Well even though it's said here that the implementations were removed, someone must have put them back. I went ahead and removed the Ruby implementation because it didn't work.

I agree that the page is long and unprintable. What if we moved these to separate pages? Something like Levenshtein distance (C++)? Is that poor style, since most parentheticals in an encyclopedia are standardized to thinks like "album", "film", "computer" and not flavors of programming language? --71.254.12.146 00:55, 17 July 2006 (UTC)

That's certainly not reasonable; but maybe a Levenshtein distance (implementations). I think we should try asking this in other algorithm articles saturated with implementations and see if together we can create some basic guidelines. I would suggest that if an article has more than 3 implementations, they should be moved from the main article. What do you think? --DanielKO 08:57, 17 July 2006 (UTC)

That seems reasonable. We *should* try to standardize and therefore get the input from other articles and their editors. --69.54.29.23 15:01, 17 July 2006 (UTC)

Didn't there used to be more history on this page? What's with all the implementations? This article is just silly. ONE is enough. The technicalities of the Haskell code is so irrelevant. 65.100.248.229 01:43, 27 July 2006 (UTC)

The number of implementations are a known problem. What do you think of the proposal?

I think the technicalities of the haskell code are extremely relevant, but the Haskell implementation is disputed.

I don't see any sign of the page's history material removed. --72.92.129.85 03:20, 27 July 2006 (UTC)

I'd like to remove the majority of implementations from this page. Visual FoxPro? Is it really necessary? From a good encyclopedia I would expect pseudo code and supporting implementation in a well known language such as C. Can anyone cite me a good standard algorithm text that has implementations in so many languages? If no one has any objection I will remove most of the implementations tomorrow. I am also against an article called Levenshtein distance (implementations), if such an article were to exist in an encyclopedia then it's purpose would be to describe existing implementations not to provide new ones. It is my understanding that in an encyclopedia we should be documenting existing entities, not providing new implementations or producing original research. New299 13:09, 18 September 2006 (UTC)

I've cleaned out many of the major offenders that were either obscure or just hideous (long lines, comment headers). It would be nice to have a C version for sentimentality, but C++ usually subsumes programming examples in most textbooks these days. Should we keep the VisualBasic implementation? --71.169.128.40 23:28, 18 September 2006 (UTC)

There was this deletion: "(23:59, 19 September 2006) Simetrical (→Implementations - Wikipedia is not a code repository. It's especially not a code repository for that not-explicitly-GFDLed Python script.)". I think it is OK not to store code here. But as code may be extremely useful for those who want to implement the algorithm the links should be kept. I remember there was a link to code for python and perl. These links should be restored! --148.6.178.137 07:54, 20 September 2006 (UTC)

I don't think the article now stands as a "mere collection of ... source code" (quoted from WP:NOT). It probably used to. The python script should be just a link anyway, regardless of copyright issues. I've done that and removed the hideous VisualBasic source code since no one made a motion to oppose it. We'll see if they do now. Currently, the page prints in under 10 pages on my system. --69.54.29.23 12:41, 20 September 2006 (UTC)

Seems to me that Visual FoxPro is as valid as the next language: that's why I keep restoring it. If we want to drop the article down to one or two reference implementations (C++, probably), I don't have an issue with it. Since the Haskell implementation looks nothing like the others, I'd vote for keeping that one as well, regardless of how widely-used it is. --SarekOfVulcan 21:09, 9 October 2006 (UTC)

I would allow PL/SQL, Pascal or even COBOL before I had VFP. But none of those are there. Not just because of its market adoption (read "popularity") or lack thereof, but simply because it's not pedagocial or contributing anything except bytes to the article. --71.169.130.172 02:43, 10 October 2006 (UTC)
COBOL I'd buy for historical interest. :-) I don't see how Pascal is more pedagogical than VFP, though.--SarekOfVulcan 22:21, 10 October 2006 (UTC)

The size of the rendered page is about 44kB. If it get's too large, the VFP implementation should proably be first to go. --71.161.222.65 22:21, 23 October 2006 (UTC)

I think this article should include as many implementations as possible. I came here via Google search: 'edit distance' (couldn't recall the spelling of Levenshtein) looking for a Ruby implementation. Wikipedia ideally is a collection of all knowledge, regardless if the concept manifests itself as prose or code. Clearly the different implementations are conceptually different because of the facilities the language provides, and should all be included.

That's actually a common misconception. Wikipedia is not intended as a collection of all knowledge. It is intended as an encyclopedia, which is one form of collection of knowledge. For instance, Wikipedia does not include primary source material (for which see Wikisource), even though primary source material is clearly knowledge.
It is inappropriate to include a large number of implementations here, because it is redundant, distracting, and each additional is unlikely to provide much marginal benefit. It is better, if we need any at all, to prefer one or two in languages which are both (1) reasonably common, so readers are more likely to have seen them; and (2) easy to read for this algorithm, that is, not requiring a lot of hacks or bookkeeping. --FOo 07:35, 15 November 2006 (UTC)
According to Wikipedia:Algorithms_on_Wikipedia#Code_samples:
...avoid writing sample code unless it contributes significantly to a fundamental understanding of the encyclopedic content'
See also the rest of the arguments there. I feel that this is an algorithm where the implementations to very little extent contribute to the understanding. The implementations should be move somewhere under wikibooks:Algorithm_implementation. See wikibooks:Algorithm_implementation/Sorting/Quicksort for an excellent example. Klem fra Nils Grimsmo 08:43, 8 December 2006 (UTC)

[edit] Python

def distance(a,b):
   "Calculates the Levenshtein distance between a and b."
   n, m = len(a), len(b)
   d=[[0]*(m+1) for i in range(n+1)]
   for i in range(n+1):
       d[i][0]=i
   for j in range(m+1):
       d[0][j]=j
   cost=0
   for i in range(1,n+1):
       for j in range(1,m+1):
           cost=0
           if a[i-1] != b[j-1]:
               cost=1                        
           delete = d[i-1][j]+1
           add = d[i][j-1]+1
           change = d[i-1][j-1]+cost
           d[i][j] = min(add, delete, change)
           #add the following If stamtment for Damerau-Levenshtein distance  
           #if(i>1 and j>1 and a[i-1]==b[j-2] and a[i-2]==b[j-1]):
           #       d[i][j] = min(d[i][j],d[i-2][j-2]+cost)
   return d[n][m]

This implementation was posted to the article on November 13, 2006.[1]

I don't think it belongs on the page. We have a Ruby, Java, C++ and Perl implementation, already. Does Python deserve an implementation? Python's a popular language, but does it offer a serious pedagogical difference over the others? Seems like more declarative programming language array notation to me. Also, it should be placed in alphabetical order with the rest.

If people do think it belongs, let's have at least one person verify it works, format the syntax, and decide whether the commented out Damerau-Levenshtein should be shown or not--I doubt it. --71.169.130.108 23:56, 13 November 2006 (UTC)

[edit] Haskell

I removed the Haskell example, for two reasons:

  1. it was the only instance of Haskell code on Wikipedia outside the Haskell article itself
  2. it relied on compiler support for memoization, which is "not guaranteed" according to the accompanying text.

If anyone knows of a language, not terribly obscure, in which memoization is guaranteed by the standard, please add an implementation in such a language. I think the clarity of such an implementation would be a great addition to the article.

For more information on my recent edits to example-code articles, and proposal to get rid of a lot of relatively esoteric example code along the lines that other editors have suggested in this section, see Wikipedia talk:WikiProject Programming languages#Category:Articles_with_example_code_proposal_and_call_for_volunteers. --Quuxplusone 01:34, 4 December 2006 (UTC)


[edit] Ruby

Doesn't work, though. Consider "cow" and "crow". Edit distance is 1 (one insertion) but this gives 3. --FOo 19:45, 2 September 2007 (UTC)

Sorry. I didn't think that through. For what it's worth, here's a version which follows the rules. The ragged arrays in Ruby make the code more awkward, so I emulated a multidimensional array with a set of [row,col] values. Sethnic 20:40, 2 September 2007 (UTC)

 def Levenshtein.compute(s1, s2)
   d = {}
   (0..s1.size).each do |row|
     d[[row, 0]] = row
   end
   (0..s2.size).each do |col|
     d[[0, col]] = col
   end
   (1..s1.size).each do |i|
     (1..s2.size).each do |j|
       cost = 0
       if (s1[i-1] != s2[j-1])
         cost = 1
       end
       d[[i, j]] = [d[[i - 1, j]] + 1,
                    d[[i, j - 1]] + 1,
                    d[[i - 1, j - 1]] + cost
                   ].min
     end
   end
   return d[[s1.size, s2.size]]
 end

Two years ago I wrote here a ruby version that was shorter and correct.
http://en.wikipedia.org/w/index.php?title=Levenshtein_distance&diff=prev&oldid=21404938 Why not use it in the article?

def editDistance(a, b)
  case
    when a.empty?: b.length
    when b.empty?: a.length
    else [(a[0] == b[0] ? 0 : 1) + editDistance(a[1..-1], b[1..-1]),
          1 + editDistance(a[1..-1], b),
          2 + editDistance(a, b[1..-1])].min
  end
end

Jonas Fagundes (talk) 23:01, 19 December 2007 (UTC)

Precisely because if we allow for one language, then everybody will want an implementation in their favorite language, and then this page becomes just a God awful glob of code, and any additional and useful non-code information may be lost in the mess. If PERHAPS each live example were thoroughly explained (esp. in terms of advantages, security, random qualities, etc), then I think it's justifiable that we could add one. As it stands, I have no problem with all kinds of implementations being on the talk page. (primarily for the reason above). Some additional ideas are found in my response "RE: Implementations." below. I believe there is also a link on the article to add your own version to a wikimedia related page. (IMO multiple versions ought to be allowable, but who knows what the guys over there would do if you added yours and there's already a ruby implementation.) Of course, I'm hoping here that our pseudocode version is correct. Root4(one) 00:06, 20 December 2007 (UTC)
Right. Its http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance. Looks like there already is a Ruby implementation. But your solution, if it is indeed correct, looks more elegant to me (but it does appear to use a strange Ruby-centric syntax). I'd vote for put it in there, if my vote counted (which, of course, it might). Root4(one) 00:12, 20 December 2007 (UTC)
I tested it two years ago. But any proof that it is wrong or any improvement will be appreciated. It works with any sequence (the container should implement [] and the elements should implement equals) not just strings.

Jonas Fagundes (talk) 02:54, 27 December 2007 (UTC)

[edit] Common Lisp version broken

Using the Common Lisp implementation described in the article, I get notably wrong results:

[11]> (levenshtein-distance "cow" "cow")
3

Whoever added this should probably fix it. The Levenshtein distance from any string to itself should be zero. --FOo 01:11, 28 August 2006 (UTC)

Thanks for reporting this. I've deleted until it can be fixed. It wasn't a very elegant implementation anyway, combined with it having been translated from the Python implementation. I'm sure somebody will be able to contribute something more useful in the future. --71.161.221.31 03:55, 28 August 2006 (UTC)

[edit] Dynamic Time Warping

Can someone please explain the difference between DTW and Levenshtein distance?

The algorithms look almost identical. Prehaps a translation of both into C would claify this?

[edit] RE: Implementations.

I removed the implementations and replaced with a link even though we already had one. It seems they were ignoring it anyway. I didn't even look at the source code to see how good the implementations were... I'm not sure I care. I'm not against people sharing their implementations, even multiple implementations per language, especially if one happens to use o(m) space instead of o(mn), or if there is some advantage to a particular implementation over another. If two implementations in the article were compared and contrasted, with benefits and disadvantages for both, that might be encyclopedic and useful for a more in depth investigation of the calculation of this metric. But all I saw was just code. And I'm not even sure it was good code at that. Root4(one) 03:17, 28 September 2007 (UTC)