Document-term matrix

From Wikipedia, the free encyclopedia

Document-term matrices are used in natural language processing programs. They represent natural language documents as mathematical objects (a matrix) and make it possible to process them as a whole.

1 General Concept
2 Choice of Terms
3 Applications
- 3.1 Improving search results
- 3.2 Finding topics
4 See also

[edit] General Concept

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"
D2 = "I hate hate databases",

then the document-term matrix would be:

	I	like	hate	databases
D1	1	1	0	1
D2	1	0	2	1

which shows which documents contain which terms and how many times they appear.

Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

[edit] Choice of Terms

A point of view on the matrix is that each row represents a document. In the vectorial semantic model which is normally the one used when computing a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for Indo-European languages, that nouns, verbs and adjectives are the more significant categories , and that words from those categories should be kept as terms. Adding collocation as terms improves the quality of the vectors, especially when computing similarities between documents.

[edit] Applications

[edit] Improving search results

Latent semantic analysis (performing eigenvalue decomposition on the document-term matrix) can improve search results by disambiguating polysemous words and searching for synonyms of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard trie data structure of search engines.