Stop words
From Wikipedia, the free encyclopedia
Stop words, or stopwords, is name given to words which are filtered out prior to, or after, processing of natural language data (text). Hans Peter Luhn, one of the pioneers in information retrieval, is credited with coining the phrase and using the concept in his design and implementation of KWIC indexing programs.[citation needed]
One way stopwords are viewed is using Claude Shannon's model of information, in which there is a sender, encoder, medium, decoder, and receiver. A message sent by a sender is first encoded, transferred over a medium, then decoded by the receiver. In the process of passing over the medium there may be noise which disrupts or distorts the message. This is analogous to the children's whisper game, or to speaking to someone a bad cell phone connection. This concept of noise makes the message harder to interpret and reduces its usefulness and informative quality. In written or spoken natural language communication, stop words can be viewed as a type of signal noise which disrupts the ability to quickly ascertain the relevance of search results or the meaning and importance of words in a document. By filtering out such words, the message becomes clearer or more useful.[citation needed]
Typically, stop words are filtered based on their level of 'usefulness' within a given context or usage, such as a search engine. Search engines filter out stopwords to reduce index size (which is partly measured by the number of distinct words in the index)[citation needed], or to assist users in providing search queries that will net better results, by avoiding searches for words which appear in almost every document searched, which does not provide a way for the search engine to distinguish among documents and rank them appropriately.[citation needed]
A stoplist (or stop list), the name commonly given to a set or list of stopwords, is typically language specific[citation needed], although it may contain words (and other character sequences like numbers and punctuation)[citation needed]. A search engine or other natural language processing system may contain a variety of stoplists, one per language, or it may contain a single stoplist that is multilingual.
Some of the more frequently used stop words for English include "a", "of", "the", "I", "it", "you", and "and".[citation needed] These are generally regarded as 'functional words' which do not carry meaning (are not as important for communication). The assumption is that, when assessing the contents of natural language, the meaning can be conveyed more clearly, or interpeted more easily, by ignoring the functional words.
In another case, some text mining tools may offer customizable lists. When performing KWIC indexing or extracting a list of keywords or performing concept mining, text classification, or one of the several tasks of natural language processing, a common task is to remove the most frequent words, manually, through the use of a stop list.[citation needed] Some tools go as far as to automatically ignore the top X words, regardless of the stop list. The stop list, in this regard, is a form of 'background knowledge' that is controlled by human input and not automated. This is sometimes seen as a negative approach to natural language processing as brute force search is seen as overly-simple and not elegant. It also turns out that the top 10 words in an index (be it a search engine index, a KWIC, or other form) tend to be functional words like articles of speech as mentioned above.
There is no definite list of stop words which all natural language processing tools incorporate.[citation needed] Not all NLP tools use a stoplist. Some tools specifically avoid the use of a stoplist in order to support phrase searching. The use of a stemming algorithm may reduce part of the rationale or dependence on a stoplist to filter out words.[citation needed]
[edit] See also
- Text mining
- Concept mining
- Information extraction
- Natural language processing
- Query expansion
- Stemming
- Search engine indexing
[edit] External links
- A List of English Stop Words (about 3 kilobytes).
- A list of stop words in English and other languages
- The snowball project currently provides lists of stopwords for English, French, Spanish, German, Portuguese, Italian, Dutch, Swedish, Norwegian, Danish, Russian, Finnish and Hungarian as part of a software stemmer project. These lists are used in other software such as the Perl Lingua::StopWords module.