Stemming algorithm

From Wikipedia, the free encyclopedia

A stemming algorithm, or stemmer, is a computer program or algorithm for reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. The algorithm has been a long-standing problem in computer science; the first paper on the subject was published in 1968. The process of stemming, often called conflation, is useful in search engines for query expansion or indexing and other natural language processing problems.

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".

Contents

[edit] Algorithms and Methods

There are several types of stemming algorithms. Some techniques used are suffix stripping and lookup table replacement.

While much of the work in this area has focused on the English language (with significant use of the Porter Stemmer algorithm), other languages have been investigated including at least French, Italian, Spanish, Portuguese, German, Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Hebrew, and Arabic. Hebrew and Arabic are still considered difficult research languages for stemming. English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but stemmers become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of more possible verb inflections), a Russian one is more complex (more possible noun declensions), a Hebrew one is even more complex (due to non-catenative morphology and a writing system without vowels), and so on. On the other hand, stemmers for true isolating languages such as Vietnamese can be even simpler than those for English.

A more complex approach to the problem of determining a stem of a word is lemmatisation. This process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word's part of speech.

[edit] Stemming Errors

There are two error measurements in stemming algorithms, overstemming and understemming. Overstemming is an error where two separate inflected words are stemmed to the same root, but should not have. Understemming is an error where two separate inflected words should be stemmed to the same root, but are not. Stemming algorithms attempt to minimize each type of error, although reducing one type can lead to increasing the other.

[edit] History

The first ever published stemmer was written by Julie Beth Lovins in 1968. [1] This paper was remarkable for its early date, and had great influence on later work in this area.

A later stemmer was written by Martin Porter, and published in the July 1980 issue of the journal Program. This stemmer became very widely used, and became the de-facto standard algorithm used for English stemming. Dr. Porter received the Tony Kent Strix award in 2000 for his work on stemming and information retrieval.

Many implementations of this algorithm were written and freely distributed. Unfortunately, many of these implementations contained subtle flaws, and as a result systems using these stemmers performed less well than they ought. To eliminate this source of error, around the year 2000 Martin Porter released an official free-software implementation of the algorithm. Over the next few years, he extended this work by building Snowball, a framework for writing stemming algorithms, and implemented an improved English stemmer together with stemmers for several other languages.

[edit] Use and effectiveness in information retrieval

Stemmers are common elements in query systems such as Web search engines, since a user who runs a query on "daffodils" would probably also be interested in documents that contain the word "daffodil" (without the s). The effectiveness of stemming for English query systems were soon found to be rather limited, however, and this has led early Information retrieval researchers to deem stemming irrelevant in general.[2]. An alternative approach, based on searching for n-grams rather than stems, may be used instead. Also, recent research has shown greater benefits for retrieval in other languages.[3][4]

[edit] Usage in commercial products

Google search adopted word stemming in 2003.[5] Previously a search for "fish" would not have returned "fishing". Other software search algorithms vary in their use of word stemming. Programs that simply search for substrings obviously will find "fish" in "fishing" but when searching for "fishes" will not find occurrences of the word "fish".

[edit] See Also

[edit] References

  1. ^ Julie Beth Lovins (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11:22–31.
  2. ^ Ricardo Baeza-Yates and Berthier Ribeiro-Neto (1999). Modern Information Retrieval. ACM Press/Addison Wesley.
  3. ^ Jaap Kamps, Christof Monz, Maarten de Rijke and Börkur Sigurbjörnsson (2004). Language-dependent and Language-independent Approaches to Cross-Lingual Text Retrieval. In: C. Peters, J. Gonzalo, M. Braschler and M. Kluck, eds. Comparative Evaluation of Multilingual Information Access Systems. Springer Verlag, pp. 152–165.
  4. ^ Eija Airio (2006). Word normalization and decompounding in mono- and bilingual IR. Information Retrieval 9:249–271.
  5. ^ http://www.google.com/support/bin/static.py?page=searchguides.html&ctx=basics#stemming

[edit] Further reading

  • Dawson, J.L. (1974) Suffix Removal for Word Conflation, Bulletin of the Association for Literary and Linguistic Computing, 2(3): 33-46
  • Frakes, W.B. (1984) Term Conflation for Information Retrieval, Cambridge University Press
  • Frakes, W.B. & Fox, C.J. (2003) Strength and Similarity of Affix Removal Stemming Algorithms, SIGIR Forum, 37: 26-30
  • Frakes, W. B. (1992) Stemming algorithms, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ
  • Hafer, M.A. & Weiss, S.F., (1974) Word segmentation by letter successor varieties, Information Processing & Management 10 (11/12), 371-386.
  • Harman, D., (1991) How effective is suffixing? Journal of the American Society for Information Science 42 (1), 7-15.
  • Hull, D.A. (1996) Stemming Algorithms – A Case Study for Detailed Evaluation, JASIS, 47(1): 70-84
  • Hull, D.A. & Grefenstette, G. (1996) A Detailed Analysis of English Stemming Algorithms, Xerox Technical Report
  • Kraaij, W. & Pohlmann, R., 1996: Viewing stemming as recall enhancement, in H-P. Frei, D. Harman, P. Schauble & R. Wilkinson (eds.), Proceedings of the 17th ACM SIGIR conference held at Zurich, August 18-22, pp.40-48.
  • Krovetz, R. (1993) Viewing Morphology as an Inference Process, In Proceedings of ACM-SIGIR93, pp191-203
  • Lennon, M., Pierce, D.S., Tarry, B.D. & Willett, P. (1981) An Evaluation of some Conflation Algorithms for Information Retrieval, Journal of Information Science, 3: 177-183
  • Lovins, J. (1971) Error Evaluation for Stemming Algorithms as Clustering Algorithms, JASIS, 22: 28-40
  • Lovins, J. B. "Development of a Stemming Algorithm." Mechanical Translation and Computational Linguistics 11, 1968, 22--31.
  • Marie-Claire, J. and Smith, D. (2005) Conservative stemming for search and indexing
  • Paice, C.D. (1990) Another Stemmer, SIGIR Forum, 24: 56-61
  • Paice, C.D. (1996) Method for Evaluation of Stemming Algorithms based on Error Counting, JASIS, 47(8): 632-649
  • Popovic, M. and Willett, P., (1992) The effectiveness of stemmng for natural language access to Slovene textual data, Journal of the American Society for Information Science, 43(5), 384-390.
  • Porter, M.F. (1980) An Algorithm for Suffix Stripping, Program, 14(3): 130-137
  • Savoy, J., (1993) Stemming of French words based on grammatical categories Journal of the American Society for Information Science, 44(1), 1-9.
  • Ulmschneider, J.E. & Doszkocs, (1983) A practical stemming algorithm for online search assistance, Online Review, 7(4), 301-318.
  • Xu, J. & Croft, W.B., (1998) Corpus-based stemming using coocurrence of word variants, ACM Transactions on Information Systems, 16(1), 61-81.

[edit] External links

This article was originally based on material from the Free On-line Dictionary of Computing, which is licensed under the GFDL.

In other languages