Talk:Stemming algorithm
From Wikipedia, the free encyclopedia
[edit] Redirect to Stemming
- I feel this topic would best be addressed under the Stemming topic, which currently provides a redirect here. Instead, I would like to see this, and Stemmer, redirect to Stemming. Then, under stemming, define and portray stemming algorithms. Note that the other languages use the single topic of Stemming. Major commercial apps, the most popular and likely reference to this, refer to the technology as 'stemming'. I also think it just makes more sense in that it is a process. Please let me know your thoughts. Josh Froelich 20:10, 11 December 2006 (UTC)
[edit] Merge with Stemmer
- I agree this should be merged. I would also prefer to get rid of the commercial references to Google or any software. Google did not invent stemming or do anything revolutionary to it. User:Jfroelich 12/02/2006
- The contents of the Stemmer topic is now merged into this article. This involved deleting 1 or 2 duplicate further reading links, merging the three sentences on lemmas, merging the two introductory sentences, merging the examples, separating out the history, and merging the language-specific challenges. I also took the liberty of adding a short note on stemming error. Josh Froelich 20:10, 11 December 2006 (UTC)
[edit] Usage in commercial software
- I think that suggesting Firefox's "Find in the Page" feature uses a stemming algorithm is perhaps a bit misleading. While searching for "fish" returns results for "fishing", it appears to me at least to mearly be a substring match (fish is a substring of fishing). Had it used a stemming algorithm, then searching for things such as "mice" would return results for "mouse" as well. - Unknown user
- I'm not an expert on this subject, however, but it seems as though that bit of information is misleading if not a complete falacy. - Unknown user
- 90% of the popular stemming algorithms make the same mistake of not recognizing mice and mouse. this is referred to as an understemming error. two popular techniques to reduce this error are to use a rule-based algorithm with rules devoted to such exceptions, or to use a 'stemming exceptions dictionary', which is simply a list (or hash table or associative array) of the exceptions and their correct stem, which is checked prior to the linguistic part of the algorithm. Most likely in the case of Firefox, I did not bother to look (but I am Firefox user!), it is not using an exceptions dictionary, probably just a basic linguistic stemmer, which routinely features the type of error you are mentioning. It should be noted, some do not consider this to be an error, as they rule out the goal of the stemmer to be able to cover these special cases like mice and mouse, because of the general rareness and insignifance. Any analyst knows that only certain models are biased by the anomalies in the data. Does it really matter that the weight is off for mouse? Maybe, probably not. It did in your case. I personally would have implementing a stemming exceptions list. But there are tradeoffs with this as well. The list takes time to query. Millions upon millions of words need to be stemmed and the stemming of a word needs to complete quickly. The lookup time in the exceptions list adds microseconds to the function time, but that adds up quickly. So the question becomes, what is more important, the time it takes to stem each word or the stemming accuracy as measured by the reduction in understemming error. There is a second problem that is inherent in any list based approach. The maintenance of such a list is a practical nightmare. Exhaustiveness for any language is an unrealistic goal. In the end you come back to the same problem, is it really significant to worry about the minor exception cases like mice and mouse. Which is why many do not include such a list. There is also the problem of what is popular. The Porter stemmer, arguably the most popular, is outdated (the guy works on newer stuff), and it never included such a list. There are not well known lists per language. This page is very commerically oriented unfortunately and it displays a lack of understanding of the subject matter and should be improved. Josh Froelich 03:13, 11 December 2006 (UTC)