Tf–idf

From Wikipedia, the free encyclopedia

The correct title of this article is tf–idf. The initial letter is shown capitalized due to technical restrictions.

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines to score and rank a document's relevance given a user query.

The term frequency in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term frequency regardless of the actual importance of that term in the document) to give a measure of the importance of the term $t i$ within the particular document.

$\mathrm{tf_i} = \frac{n_i}{\sum_k n_k}$

where $n i$ is the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms.

The inverse document frequency is a measure of the general importance of the term (obtained by dividing the number of all documents divided by the number of documents containing the term, and then taking the logarithm of that quotient).

$\mathrm{idf_i} = \log \frac{|D|}{|\{d: d \ni t_{i}\}|}$

with

|D| : total number of documents in the corpus
$|\{d :d\ni t_{i}\}|$ : number of documents where the term $t i$ appears (that is $n_{i} \neq 0$ ).

Then

$\mathrm{tfidf} = \mathrm{tf} \cdot \mathrm{idf}$

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tends to filter out common terms.

1 Numeric applications
2 Applications in Vector Space Model
3 References
4 See also
5 External links

[edit] Numeric applications

There are many different formulas used to calculate tf–idf. The term frequency (TF) is the number of times the word appears in a document divided by the number of total words in the document. If a document contains 100 total words and the word cow appears 3 times, then the term frequency of the word cow in the document is 0.03 (3/100). One way of calculating document frequency (DF) is to determine how many documents contain the word cow divided by the total number of documents in the collection. So if cow appears in 1,000 documents out of a total of 10,000,000 then the document frequency is 0.0001 (1000/10,000,000). The final tf-idf score is then calculated by dividing the term frequency by the document frequency. For our example, the tf-idf score for cow in the collection would be 300 (0.03/0.0001). Alternatives to this formula are to take the log of the document frequency.

[edit] Applications in Vector Space Model

The tf-idf weighting scheme is often used in the vector space model together with cosine similarity to determine the similarity between two documents.

[edit] References

Salton, G. and McGill, M. J. 1983 Introduction to modern information retrieval. McGraw-Hill, ISBN 0070544840.
Salton, G., Fox, E. A. and Wu, H. 1983 Extended Boolean information retrieval. Commun. ACM 26, 1022–1036.
Salton, G. and Buckley, C. 1988 Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5): 513–523.