Talk:Levenshtein distance
From Wikipedia, the free encyclopedia
Contents |
[edit] Inconsistent Pseudocode
Why in the pseudocode do int arrays start at index 0 and char arrays start at index 1?
- Because it simplifies the algorithm. Anything else would require confusing offsets to be inserted for one array or the other. If you read the correctness section you can see why this numbering fits with the invariants. Deco 02:15, 19 Feb 2005 (UTC)
Not really. It's confusing and inconsistent. Anyone just converting the pseudocode into real code (like I was doing earlier) will end up with a program that generates array out of bound errors. It was easy enough to fix when I looked at what it was actually doing, but is still terribly annoying. The offsets would be less confusing than having arrays start at different indexes.
I agree that the array indicing is confusing. Adding offsets doesnt change the algorithm, its just another way to show it. How it is at the moment is not especially useful given how most language lack features as nice as: "declare int d[0..lenStr1, 0..lenStr2]", "declare int d[size1,size2]" is much more common. More importantly though you can express code using the first type with the second, but not the other way around, which makes it more practical for psuedocode. Jheriko 05:41, 8 March 2006 (UTC)
[edit] Haskell Example
[edit] Complicated Haskell
Why is the Haskell code so complicated? Isn't this enough?
editDistance :: Eq a => [a] -> [a] -> Int editDistance s [] = length s editDistance [] t = lenght t editDistance (s:ss) (t:ts) = minimum [ (if s == t then 0 else 1) + editDistance ss ts, 1 + editDistance ss (t:ts), 1 + editDistance (s:ss) ts ]
Okay, I replaced the original code below with the shorter code above. Jirka
min3 :: Int->Int->Int->Int min3 x y z = min x (min y z) cost :: Char->Char->Int cost a b | a == b = 0 | otherwise = 1 sumCosts :: String->Char->Int sumCosts [] a = 0 sumCosts (s:ss) a = cost s a + sumCosts ss a editDistance :: String->String->Int editDistance [] [] = 0 editDistance s [] = sumCosts s '-' editDistance [] t = sumCosts t '-' editDistance (s:ss) (t:ts) = min3 (cost s t + editDistance ss ts) (cost s '-' + editDistance ss (t:ts)) (cost '-' t + editDistance (s:ss) ts)
[edit] On memoization
Also... I really don't think Haskell memoizes like this page claims it does. That would either take tons of space or one crafty LRU cache.... Could we get a reference for that, outside of Wikipedia?
--Daren Brantley 04:15, 16 July 2005 (UTC)
If you want to verify Haskell's memoization features, visit the Dynamic programming article where the claim is also made. I'm sure somebody could give you a reference to the Haskell standard. --132.198.104.164 21:38, 18 July 2005 (UTC)
- On the other hand, I wrote both these statements, so I may just be wrong. Deco 05:14, 20 August 2005 (UTC)
I think the current Haskell implementation is wrong. The levenshtein algorithm should use no more than O(n²) time, and this is exponential. It's possible to write a memorizing implementation, but Haskell doesn't do it automatically. (The ruby one is wrong, too, and probably some others.)
An O(n²) version:
distance str1 str2 = let (l1, l2) = (length str1, length str2) istr1 = (0, undefined) : zip [1..] str1 istr2 = (0, undefined) : zip [1..] str2 table = array ((0, 0), (l1, l2)) [ ((i, j), value i c1 j c2) | (i, c1) <- istr1, (j, c2) <- istr2 ] value 0 _ j _ = j value i _ 0 _ = i value i c1 j c2 = minimum [ table ! (i-1, j) + 1, table ! (i, j-1) + 1, table ! (i-1, j-1) + cost ] where cost = if c1 == c2 then 0 else 1 in table ! (l1, l2)
The following version isn't O(n²), because // copies arrays and !! is linear in the element number -- using lazy evaluation as above is the key for solving that:
distance str1 str2 = last $ elems $ foldl update table [ (i,j) | i <- [1..length str1] , j <- [1..length str2] ] where table = initial (length str1 , length str2 ) update table (i,j) = table // [((i,j),value)] where value = minimum [ table ! (i-1 , j) + 1 -- deletion , table ! (i , j-1) + 1 -- insertion , table ! (i-1 , j-1) + cost -- substitution ] where cost = if str1!!(i-1) == str2!!(j-1) then 0 else 1 initial (b1,b2) = array ((0,0),(b1,b2)) [ ((i,j), value (i,j)) | i <- [0 .. b1] , j <- [0..b2]] where value (0,j) = j value (i,0) = i value (i,j) = 0
[edit] An O(n) in space, faster, stricter, tail recursive version
Here is an O(n) version, using iteration over an initial row. This is much faster (with GHC; have not tried others) since
- it is O(n) in space,
- laziness left away by GHC strictness analysis is removed at one point using
seq
, - it is tail recursive in the outer iterations
distance :: String -> String -> Int distance s1 s2 = iter s1 s2 [0..length s2] where iter (c:cs) s2 row@(e:es) = iter cs s2 (e' : rest e' c s2 row) where e' = e + 1 iter [] _ row = last row iter _ _ _ = error "iter (distance): unexpected arguments" rest e c (c2:c2s) (e1:es@(e2:es')) = seq k (k : rest k c c2s es) where k = (min (e1 + if c == c2 then 0 else 1) $ min (e+1) (e2+1)) rest _ _ [] _ = [] rest _ _ _ _ = error "rest (distance): unexpected arguments"
-- Abhay Parvate 05:33, 4 July 2006 (UTC)
[edit] Levenshtein's nationality/ethnicity
I don't see any particular reason to point out his ethnicity/religion in this article. If you want to put it in his article, be my guest -- but please provide documentation of some sort.--SarekOfVulcan 21:22, 3 November 2005 (UTC)
- I believe the point of the edit was to indicate that Levenshtein was not an ethnic Russian, just a Jewish guy who lived in Russia. As you suggest, I think such fine points of Levenshtein's nationality are best reserved for his own article, if anyone can drag up enough info to write it. Deco 00:23, 4 November 2005 (UTC)
- I dragged up enough on the web to stub it.--SarekOfVulcan 00:41, 4 November 2005 (UTC)
- It might help when learning to pronounce the name of the algorithm. -- Mikeblas 08:13, 3 January 2006 (UTC)
- I dragged up enough on the web to stub it.--SarekOfVulcan 00:41, 4 November 2005 (UTC)
[edit] Minimality
I can transform "kitten" into "sitting" in two steps:
- kitten
- (delete "kitten")
- sitting (insert "sitting" at end)
Can someone explain why this is not acceptable? Are we talking single-character edits here? --P3d0 17:48, 16 January 2006 (UTC)
- I went ahead and added "of a single character" to the intro sentence. Please feel free to revert if I have this wrong. --P3d0 17:51, 16 January 2006 (UTC)
-
- I guess it doesn't hurt to be pedantic. Deco 18:20, 16 January 2006 (UTC)
-
-
- Heh... Well, if you use "diff" you wind up with a list of edits which are most definitely not single-character edits. Plus, I think I'm a fairly reasonable person, and I didn't realize it was single-character edits until I read the algorithm. --P3d0 23:46, 16 January 2006 (UTC)
-
-
-
-
- Sorry for the confusion, I hope it's clearer now. :-) Deco 00:15, 17 January 2006 (UTC)
-
-
[edit] Implementations
I don't believe the implementations are relevant to the article. Some of them are even wrong - and they certainly aren't more illustrative than the pseudo-code. Currently there are 18 implementations: 2 for C++, C#, 2 for Lisp, Haskell, Java, Python, Ruby, 2 for Scheme, Prolog, VB.NET, Visual FoxPro, Actionscript, Perl, Ocaml and Lua. I vote for removing all of them from the article; if anyone find these useful, just create a page (like www.99-bottles-of-beer.net) with a sample implementation for each language under the sun, and add a link to it. DanielKO 00:21, 12 July 2006 (UTC)
- This has a way of happening on wikis. This is why I started a separate wiki for source code. Moving them there is not an option though due to license conflicts - GFDL-licensed code just isn't very useful. Wikisource no longer accepts source code. I suggest we eradicate them all. Deco 00:37, 12 July 2006 (UTC)
Certainly. Get rid of all those implementations and just leave the pseudocode. I truly don't see the point of all this. Zyxoas (talk to me - I'll listen) 16:49, 12 July 2006 (UTC)
I just removed them all. Maybe the Implementations section should be rewritten. DanielKO 23:25, 12 July 2006 (UTC)
Potential license compatability downstream is no reason to delete material, even if the material is source code. That's like saying there should be no programming books at WikiBooks released under the GFDL.
If there's a wider proposal to remove all source code from WikiPedia (and hopefully come up with a sister project to move it to) then I'll accept deleting the implementations. If the quality of implementations are lacking, this can hardly be relevant to their deletion, because even the pseudocode was incorrect at one point.
The implementations provide useful, working examples of the pseudocode for readers. --71.192.61.193 01:54, 13 July 2006 (UTC)
- I didn't say license compatibility was a problem. You totally misinterpreted me. I think source code at Wikipedia is totally okay. I would not have removed all of the implementations, maybe just all but 2 or 3. Deco 02:11, 13 July 2006 (UTC)
- I don't think that it's reasonable to have that many implementations. The page is "unprintable", because there's too much irrelevant stuffs on it (who needs a Prolog implementation anyways?). But I agree that we should be consistent, and choose just n (for small values of n) languages to be used on all algorithm articles; everything else should not be in an ecyclopedia. So indeed, better lets use this article as an extreme example on how bad the situation may become, and propose a sister project for algorithms. --DanielKO 22:49, 14 July 2006 (UTC)
Well even though it's said here that the implementations were removed, someone must have put them back. I went ahead and removed the Ruby implementation because it didn't work.
I agree that the page is long and unprintable. What if we moved these to separate pages? Something like Levenshtein distance (C++)? Is that poor style, since most parentheticals in an encyclopedia are standardized to thinks like "album", "film", "computer" and not flavors of programming language? --71.254.12.146 00:55, 17 July 2006 (UTC)
- That's certainly not reasonable; but maybe a Levenshtein distance (implementations). I think we should try asking this in other algorithm articles saturated with implementations and see if together we can create some basic guidelines. I would suggest that if an article has more than 3 implementations, they should be moved from the main article. What do you think? --DanielKO 08:57, 17 July 2006 (UTC)
That seems reasonable. We *should* try to standardize and therefore get the input from other articles and their editors. --69.54.29.23 15:01, 17 July 2006 (UTC)
Didn't there used to be more history on this page? What's with all the implementations? This article is just silly. ONE is enough. The technicalities of the Haskell code is so irrelevant. 65.100.248.229 01:43, 27 July 2006 (UTC)
The number of implementations are a known problem. What do you think of the proposal?
I think the technicalities of the haskell code are extremely relevant, but the Haskell implementation is disputed.
I don't see any sign of the page's history material removed. --72.92.129.85 03:20, 27 July 2006 (UTC)
I'd like to remove the majority of implementations from this page. Visual FoxPro? Is it really necessary? From a good encyclopedia I would expect pseudo code and supporting implementation in a well known language such as C. Can anyone cite me a good standard algorithm text that has implementations in so many languages? If no one has any objection I will remove most of the implementations tomorrow. I am also against an article called Levenshtein distance (implementations), if such an article were to exist in an encyclopedia then it's purpose would be to describe existing implementations not to provide new ones. It is my understanding that in an encyclopedia we should be documenting existing entities, not providing new implementations or producing original research. New299 13:09, 18 September 2006 (UTC)
I've cleaned out many of the major offenders that were either obscure or just hideous (long lines, comment headers). It would be nice to have a C version for sentimentality, but C++ usually subsumes programming examples in most textbooks these days. Should we keep the VisualBasic implementation? --71.169.128.40 23:28, 18 September 2006 (UTC)
There was this deletion: "(23:59, 19 September 2006) Simetrical (→Implementations - Wikipedia is not a code repository. It's especially not a code repository for that not-explicitly-GFDLed Python script.)". I think it is OK not to store code here. But as code may be extremely useful for those who want to implement the algorithm the links should be kept. I remember there was a link to code for python and perl. These links should be restored! --148.6.178.137 07:54, 20 September 2006 (UTC)
I don't think the article now stands as a "mere collection of ... source code" (quoted from WP:NOT). It probably used to. The python script should be just a link anyway, regardless of copyright issues. I've done that and removed the hideous VisualBasic source code since no one made a motion to oppose it. We'll see if they do now. Currently, the page prints in under 10 pages on my system. --69.54.29.23 12:41, 20 September 2006 (UTC)
Seems to me that Visual FoxPro is as valid as the next language: that's why I keep restoring it. If we want to drop the article down to one or two reference implementations (C++, probably), I don't have an issue with it. Since the Haskell implementation looks nothing like the others, I'd vote for keeping that one as well, regardless of how widely-used it is. --SarekOfVulcan 21:09, 9 October 2006 (UTC)
- I would allow PL/SQL, Pascal or even COBOL before I had VFP. But none of those are there. Not just because of its market adoption (read "popularity") or lack thereof, but simply because it's not pedagocial or contributing anything except bytes to the article. --71.169.130.172 02:43, 10 October 2006 (UTC)
The size of the rendered page is about 44kB. If it get's too large, the VFP implementation should proably be first to go. --71.161.222.65 22:21, 23 October 2006 (UTC)
I think this article should include as many implementations as possible. I came here via Google search: 'edit distance' (couldn't recall the spelling of Levenshtein) looking for a Ruby implementation. Wikipedia ideally is a collection of all knowledge, regardless if the concept manifests itself as prose or code. Clearly the different implementations are conceptually different because of the facilities the language provides, and should all be included.
- That's actually a common misconception. Wikipedia is not intended as a collection of all knowledge. It is intended as an encyclopedia, which is one form of collection of knowledge. For instance, Wikipedia does not include primary source material (for which see Wikisource), even though primary source material is clearly knowledge.
- It is inappropriate to include a large number of implementations here, because it is redundant, distracting, and each additional is unlikely to provide much marginal benefit. It is better, if we need any at all, to prefer one or two in languages which are both (1) reasonably common, so readers are more likely to have seen them; and (2) easy to read for this algorithm, that is, not requiring a lot of hacks or bookkeeping. --FOo 07:35, 15 November 2006 (UTC)
- According to Wikipedia:Algorithms_on_Wikipedia#Code_samples:
- ...avoid writing sample code unless it contributes significantly to a fundamental understanding of the encyclopedic content'
- See also the rest of the arguments there. I feel that this is an algorithm where the implementations to very little extent contribute to the understanding. The implementations should be move somewhere under wikibooks:Algorithm_implementation. See wikibooks:Algorithm_implementation/Sorting/Quicksort for an excellent example. Klem fra Nils Grimsmo 08:43, 8 December 2006 (UTC)
[edit] Python
def distance(a,b): "Calculates the Levenshtein distance between a and b." n, m = len(a), len(b) d=[[0]*(m+1) for i in range(n+1)] for i in range(n+1): d[i][0]=i for j in range(m+1): d[0][j]=j cost=0 for i in range(1,n+1): for j in range(1,m+1): cost=0 if a[i-1] != b[j-1]: cost=1 delete = d[i-1][j]+1 add = d[i][j-1]+1 change = d[i-1][j-1]+cost d[i][j] = min(add, delete, change) #add the following If stamtment for Damerau-Levenshtein distance #if(i>1 and j>1 and a[i-1]==b[j-2] and a[i-2]==b[j-1]): # d[i][j] = min(d[i][j],d[i-2][j-2]+cost) return d[n][m]
This implementation was posted to the article on November 13, 2006.[1]
I don't think it belongs on the page. We have a Ruby, Java, C++ and Perl implementation, already. Does Python deserve an implementation? Python's a popular language, but does it offer a serious pedagogical difference over the others? Seems like more declarative programming language array notation to me. Also, it should be placed in alphabetical order with the rest.
If people do think it belongs, let's have at least one person verify it works, format the syntax, and decide whether the commented out Damerau-Levenshtein should be shown or not--I doubt it. --71.169.130.108 23:56, 13 November 2006 (UTC)
[edit] Haskell
I removed the Haskell example, for two reasons:
- it was the only instance of Haskell code on Wikipedia outside the Haskell article itself
- it relied on compiler support for memoization, which is "not guaranteed" according to the accompanying text.
If anyone knows of a language, not terribly obscure, in which memoization is guaranteed by the standard, please add an implementation in such a language. I think the clarity of such an implementation would be a great addition to the article.
For more information on my recent edits to example-code articles, and proposal to get rid of a lot of relatively esoteric example code along the lines that other editors have suggested in this section, see Wikipedia talk:WikiProject Programming languages#Category:Articles_with_example_code_proposal_and_call_for_volunteers. --Quuxplusone 01:34, 4 December 2006 (UTC)
[edit] Common Lisp version broken
Using the Common Lisp implementation described in the article, I get notably wrong results:
[11]> (levenshtein-distance "cow" "cow") 3
Whoever added this should probably fix it. The Levenshtein distance from any string to itself should be zero. --FOo 01:11, 28 August 2006 (UTC)
Thanks for reporting this. I've deleted until it can be fixed. It wasn't a very elegant implementation anyway, combined with it having been translated from the Python implementation. I'm sure somebody will be able to contribute something more useful in the future. --71.161.221.31 03:55, 28 August 2006 (UTC)