Document-term matrix
From Wikipedia, the free encyclopedia
Document-term matrices are used in natural language processing programs. They represent natural language documents as mathematical objects (a matrix) and make it possible to process them as a whole.
Contents |
[edit] General Concept
When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:
- D1 = "I like databases"
- D2 = "I hate hate databases",
then the document-term matrix would be:
I | like | hate | databases | |
---|---|---|---|---|
D1 | 1 | 1 | 0 | 1 |
D2 | 1 | 0 | 2 | 1 |
which shows which documents contain which terms and how many times they appear.
Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.
[edit] Choice of Terms
A point of view on the matrix is that each row represents a document. In the vectorial semantic model which is normally the one used when computing a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for Indo-European languages, that nouns, verbs and adjectives are the more significant categories , and that words from those categories should be kept as terms. Adding collocation as terms improves the quality of the vectors, especially when computing similarities between documents.
[edit] Applications
[edit] Improving search results
Latent semantic analysis (performing eigenvalue decomposition on the document-term matrix) can improve search results by disambiguating polysemous words and searching for synonyms of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard trie data structure of search engines.
[edit] Finding topics
Multivariate analysis of the document-term matrix can reveal topics/themes of the corpus. Specifically, latent semantic analysis and data clustering can be used, and more recently probabilistic latent semantic analysis and non-negative matrix factorization have been found to perform well for this task.