Document-term matrix

From Wikipedia, the free encyclopedia

Document-term matrices are used in natural language processing programs. They represent natural language documents as mathematical objects (a matrix) and make it possible to process them as a whole.

Contents

[edit] General Concept

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

  • D1 = "I like databases"
  • D2 = "I hate hate databases",

then the document-term matrix would be:

I like hate databases
D1 1 1 0 1
D2 1 0 2 1

which shows which documents contain which terms and how many times they appear.

Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

[edit] Choice of Terms

A point of view on the matrix is that each row represents a document. In the vectorial semantic model which is normally the one used when computing a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for Indo-European languages, that nouns, verbs and adjectives are the more significant categories , and that words from those categories should be kept as terms. Adding collocation as terms improves the quality of the vectors, especially when computing similarities between documents.

[edit] Applications

[edit] Improving search results

Latent semantic analysis (performing eigenvalue decomposition on the document-term matrix) can improve search results by disambiguating polysemous words and searching for synonyms of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard trie data structure of search engines.

[edit] Finding topics

Multivariate analysis of the document-term matrix can reveal topics/themes of the corpus. Specifically, latent semantic analysis and data clustering can be used, and more recently probabilistic latent semantic analysis and non-negative matrix factorization have been found to perform well for this task.

[edit] See also