Term Count Model

From Wikipedia, the free encyclopedia

Like the Vector Space Model (or term vector model) which followed, the Term Count Model had been in use even before the advent of search engines and the Internet for searching text in information retrieval (IR) systems, indexing and relevancy rankings.

The basis for the Term Count Model consist of Salton's term weight equation found in the Vector Space Model but without the global term (D/d(n)) where :-

D is the number of documents in the database or search set, and,

d(n) is the number of documents containing the search term, n.

Thus, the Term Count Model is based on the term weight equation :-

w(n) = f(n)

where:

w(n) is the term weight for the keyword search, n and,

f(n) is the number of times or frequency of the term, n, occurring in the document.

As one can see the Term Count Model essentially only requires :-

(1) a collection of local documents forming the search set

(2) an index of words in all the documents in the search set

(3) the query terms

Otherwise, the vector mathematics for computing the cosine of the angles of the query and document vectors after computing the dot products are the same as the Vector Space Model.

[edit] Weaknesses and Limitations of The Term Count Model

The Term Count Model is however susceptible to the following weaknesses :-

(1) Term repetition (also known as term spamming). Many search engines now impose penalties for sites involved in term spamming, the practise of intentionally increasing the frequency of important keywords within one's website in order to have a higher search engine ranking.

(2) Tends to favour long documents due to the higher number of word repetitions and larger term entries in the scalar mathematics making long documents have larger similarity scores.

[edit] References