Language model

From Wikipedia, the free encyclopedia

A statistical language model assigns a probability to a sequence of words P(w_1..n) by means of a probability distribution.

Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information retrieval. Estimating the probability of sequences can become difficult in corpora, in which phrases or sentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem of overfitting). For that reason these models are often approximated using smoothed N-gram models.

In speech recognition and in data compression, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.

When used in information retrieval, a language model is associated with a document in a collection. With query Q as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, P(Q|M_d).

[edit] See also

Factored language model

[edit] References

J M Ponte and W B Croft (1998). "A Language Modeling Approach to Information Retrieval". Research and Development in Information Retrieval: 275-281.
F Song and W B Croft (1999). "A General Language Model for Information Retrieval". Research and Development in Information Retrieval: 279-280.