Latent Dirichlet allocation

From Wikipedia, the free encyclopedia

The introduction to this article provides insufficient context for those unfamiliar with the subject matter.
Please help improve the introduction to meet Wikipedia's layout standards. You can discuss the issue on the talk page.

The Latent Dirichlet allocation is a means of determining the topic of text automatically using variational methods and graphical models developed by Dave Blei, Andrew Ng, and Michael Jordan.

LDA is a generative model of documents which attempts to learn a set of topics, and sets of words associated with each topic, so that each document may be viewed as a mixture of various topics.

For example, an LDA model might have topics CAT and DOG. The CAT topic has probabilities of generating various words: the words tabby, kitten and of course cat will have high probability given this topic. The DOG topic likewise has probabilities of generating each word: puppy and dachshund might have high probability. Words without special relevance, like the (see function word), will have roughly even probability between classes.

A document is generated by picking a distribution over topics (ie, mostly about DOG, mostly about CAT, or a bit of both), and given this distribution, picking the topic of each specific word. Then words are generated given their topics. (Notice that words are considered to be independent given the topics. This is a standard bag of words assumption, and makes the individual words exchangeable.)

Learning the various distributions (the set of topics, their associated word probabilities, the topic of each word, and the particular topic mixture of each document) is a problem of Bayesian inference, which can be carried out using variational methods (or also with Markov Chain Monte Carlo methods, which tend to be quite slow in practice).

Topic modeling is a classic problem in information retrieval. See also Tf-idf.

[edit] External links

Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. 3:993-1022. Mar. 2003. Available here or here on Citeseer.