tf–idf

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1]:8 It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

Motivation

Term frequency

Suppose we have a set of English text documents and wish to determine which document is most relevant to the query "the brown cow". A simple way to start out is by eliminating documents that do not contain all three words "the", "brown", and "cow", but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency.

The first form of term weighting is due to Hans Peter Luhn (1957) and is based on the Luhn Assumption:

Inverse document frequency

Because the term "the" is so common, term freqency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". The term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

Karen Spärck Jones (1972) conceived a statistical interpretation of term specificity called IDF, which became a cornerstone of term weighting:

Definition

tf–idf is the product of two statistics, term frequency and inverse document frequency. Various ways for determining the exact values of both statistics exist.

Term frequency

Variants of TF weight
weighting scheme TF weight
binary {0,1}
raw frequency  f_{t,d}
log normalization 1 + \log (f_{t,d})
double normalization 0.5 0.5 + 0.5 \cdot \frac { f_{t,d} }{\max_{\{t' \in d\}} {f_{t',d}}}
double normalization K K + (1 - K) \frac { f_{t,d} }{\max_{\{t' \in d\}} {f_{t',d}}}

In the case of the term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document, i.e. the number of times that term t occurs in document d. If we denote the raw frequency of t by ft,d, then the simple tf scheme is tf(t,d) = ft,d. Other possibilities include[4]:128

\mathrm{tf}(t,d) = 0.5 + 0.5 \cdot  \frac{f_{t, d}}{\max\{f_{t', d}:t' \in d\}}

Inverse document frequency

Variants of IDF weight
weighting scheme IDF weight (n_t = |\{d \in D: t \in d\}| )
unary 1
inverse document frequency  \log \frac {N} {n_t}
inverse document frequency smooth  \log (1 + \frac {N} {n_t})
inverse document frequency max  \log \left(1 + \frac {\max_{\{t' \in d\}} n_{t'}} {n_t}\right)
probabilistic inverse document frequency  \log  \frac {N - n_t} {n_t}

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

 \mathrm{idf}(t, D) =  \log \frac{N}{|\{d \in D: t \in d\}|}

with

Mathematically the base of the log function does not matter and constitutes a constant multiplicative factor towards the overall result.

Term frequency–Inverse document frequency

Then tf–idf is calculated as

\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0.

Recommended TF-IDF weighting schemes
weighting scheme document term weight query term weight
1  f_{t,d} \cdot \log \frac {N} {n_t}  \left(0.5 + 0.5 \frac {f_{t,q}} {\max_t f_{t,q}}\right) \cdot \log \frac {N} {n_t}
2  1 + \log f_{t,d}   \log (1 + \frac {N} {n_t})
3  (1 + \log f_{t,d}) \cdot \log \frac {N} {n_t}  (1 + \log f_{t,q}) \cdot \log \frac {N} {n_t}

Justification of idf

Idf was introduced, as "term specificity", by Karen Spärck Jones in a 1972 paper. Although it has worked well as a heuristic, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.[5]

Spärck Jones's own explanation did not propose much theory, aside from a connection to Zipf's law.[5] Attempts have been made to put idf on a probabilistic footing,[6] by estimating the probability that a given document d contains a term t as the relative document frequency,


P(t|d) = \frac{|\{d \in D: t \in d\}|}{N},

so that we can define idf as


\begin{align}
\mathrm{idf} & = -\log P(t|d) \\
             & = \log \frac{1}{P(t|d)} \\
             & = \log \frac{N}{|\{d \in D: t \in d\}|}
\end{align}

Namely, the inverse document frequency is the logarithm of "inverse" relative document frequency.

This probabilistic interpretation in turn takes the same form as that of self-information. However, applying such information-theoretic notions to problems in information retrieval leads to problems when trying to define the appropriate event spaces for the required probability distributions: not only documents need to be taken into account, but also queries and terms.[5]

Example of tf–idf

Suppose we have term frequency tables for a collection consisting of only two documents, as listed on the right, then calculation of tf–idf for the term "this" in document 1 is performed as follows.

Document 2
Term Term Count
this 1
is 1
another 2
example 3
Document 1
Term Term Count
this 1
is 1
a 2
sample 1

Tf, in its basic form, is just the frequency that we look up in appropriate table. In this case, it's one.

Idf is a bit more involved:

 \mathrm{idf}(\mathsf{this}, D) =  \log \frac{N}{|\{d \in D: t \in d\}|}

The numerator of the fraction is the number of documents, which is two. The number of documents in which "this" appears is also two, giving

 \mathrm{idf}(\mathsf{this}, D) =  \log \frac{2}{2} = 0

So tf–idf is zero for this term, and with the basic definition this is true of any term that occurs in all documents.

A slightly more interesting example arises from the word "example", which occurs three times but in only one document. For this document, tf–idf of "example" is:

\mathrm{tf}(\mathsf{example}, d_2) = 3
\mathrm{idf}(\mathsf{example}, D) = \log \frac{2}{1} \approx 0.3010
\mathrm{tfidf}(\mathsf{example}, d_2) = \mathrm{tf}(\mathsf{example}, d_2) \times \mathrm{idf}(\mathsf{example}, D) = 3 \times 0.3010 \approx 0.9030

(using the base 10 logarithm).

See also

References

  1. Rajaraman, A.; Ullman, J. D. (2011). "Data Mining". Mining of Massive Datasets (PDF). pp. 1–17. doi:10.1017/CBO9781139058452.002. ISBN 9781139058452.
  2. Luhn, Hans Peter (1957). "A Statistical Approach to Mechanized Encoding and Searching of Literary Information" (PDF). IBM Journal of research and development (IBM) 1 (4): 315. doi:10.1147/rd.14.0309. Retrieved 2 March 2015. There is also the probability that the more frequently a notion and combination of notions occur, the more importance the author attaches to them as reflecting the essence of his overall idea.
  3. Spärck Jones, K. (1972). "A Statistical Interpretation of Term Specificity and Its Application in Retrieval". Journal of Documentation 28: 11–21. doi:10.1108/eb026526.
  4. Manning, C. D.; Raghavan, P.; Schutze, H. (2008). "Scoring, term weighting, and the vector space model". Introduction to Information Retrieval (PDF). p. 100. doi:10.1017/CBO9780511809071.007. ISBN 9780511809071.
  5. 1 2 3 Robertson, S. (2004). "Understanding inverse document frequency: On theoretical arguments for IDF". Journal of Documentation 60 (5): 503–520. doi:10.1108/00220410410560582.
  6. See also Probability estimates in practice in Introduction to Information Retrieval.

External links and suggested reading

This article is issued from Wikipedia - version of the Tuesday, February 09, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.