Vector space model

From Wikipedia, the free encyclopedia

Vector space model (or term vector model) is an algebraic model used for information filtering, information retrieval, indexing and relevancy rankings. It represents natural language documents (or any objects, in general) in a formal manner by the use of vectors (of identifiers, such as, for example, index terms) in a multi-dimensional linear space based. It was used for the first time by the SMART Information Retrieval System.

Vectors are objects having lengths (magnitudes) and direction. (See vectors). The basic theory proposed in the vector space model (VSM) was to represent documents as vectors of index terms (keywords). Relevancy rankings of documents in keyword search could be calculated using the vector theory model based on how large a deviation the angles (based on the cosine of these angles) of each document vector were in relation to the original query vector based on the scalar product between the query vector and the document vector and the assumptions of the document similarities theory. Thus a cosine value of zero meant that the query and document vector were orthogonal to each other and meant that there was no match or the term simply did not exist in the document being considered.

To determine the cosine of the angle between two vectors, use the following equation:

cos(theta) = v1.v2 / (||v1|| * ||v2||)

where:

theta is the angle between v1 and v2
v1 is the first vector
v2 is the second vector
. represents a dot product
||x|| represents the magnitude of vector x

The classic vector space model as proposed by Salton, Wong and Yang had both local and global parameters incorporated in the term weight (w(n)) equation (known as the tf-idf):

w(n) = f(n) x Log (D / d(n))

where:

w(n) is the term weight for keyword search n,
f(n) is the frequency in which the term n occurred in the document (representing the local parameter),
d(n) is the number of documents containing the term n, and,
D is the total number of documents in the set.

Note that the quotient, d(n)/D, is essentially the probability of finding the document containing the term n, in the document set being used and represents the global parameter (compare with term count model below which only considered local parameters.

1 Assumptions and Limitations of The Vector Space Model
2 Comparison with The Term Count Model
3 Models based on and extending the vector space model
4 Further reading
5 See also

[edit] Assumptions and Limitations of The Vector Space Model

The Vector Space Model has the following limitations:

Long documents are considered poor representatives of the Vector Space Model because they had poor similarity values (a small scalar product and a large dimensionality)
Documents with similar context but different term vocabulary ("False negative match")
The search keywords were being typed during the search in an inappropriate manner giving poorer results e.g. key + ing, para + meter ("False positive match")
Semantic limitation

[edit] Comparison with The Term Count Model

The alternative Term Count Model, an earlier model, only considered local parameters and did not account for global parameters. See the separate section on Term Count Model in Wikipedia.

[edit] Models based on and extending the vector space model

Models based on and extending the vector space model include:

Generalized vector space model
Topic-based vector space model (TVSM) — Extends the vector space model by removing the constraint that the term-vectors be orthogonal. In contrast to the generalized vector space model the topic-based vector space model does not depend on concurrence-based similarities between terms.
Latent semantic analysis
DSIR model