Vector space model

From Wikipedia, the free encyclopedia

Vector space model (or term vector model) is an algebraic model used for information filtering, information retrieval, indexing and relevancy rankings. It represents natural language documents (or any objects, in general) in a formal manner through the use of vectors (of identifiers, such as, for example, index terms) in a multi-dimensional linear space. Its first use was in the SMART Information Retrieval System.

Documents are represented as vectors of index terms (keywords). The set of terms is a predefined collection of terms, for example the set of all unique words occurring in the document corpus.

Relevancy rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as same kind of vector as the documents.

In practice, it is easier to calculate the cosine of the angle between the vectors instead of the angle:

$\cos{\theta} = \frac{\mathbf{v_1} \cdot \mathbf{v_2}}{\left\| \mathbf{v_1} \right\| \left \| \mathbf{v_2} \right\|}$

A cosine value of zero means that the query and document vector were orthogonal and had no match (i.e. the query term did not exist in the document being considered).

1 Example
2 Assumptions and Limitations of The Vector Space Model
3 Models based on and extending the vector space model
4 Further reading
5 See also

[edit] Example

In the classic vector space model proposed by Salton, Wong and Yang the term specific weights in the document vectors are products of local and global parameters. The model is known as term frequency-inverse document frequency model. The weight vector for document d is $\mathbf{v}_d = [w_{1,d}, w_{2,d}, \ldots, w_{N,d}]^T$ , where

$w_{t,d} = \mathrm{tf}_t \cdot log{\frac{|D|}{|\{t \in d\}|}}$

and

$tf t$ is term frequency of term t in document d (a local parameter)
$log{\frac{|D|}{|\{t \in d\}|}}$ is inverse document frequency (a global parameter). $| D |$ is the total number of documents in the document set; $|\{t \in d\}|$ is the number of documents containing the term t.

In a simpler Term Count Model the term specific weights do not include the global parameter. Instead the weights are just the counts of term occurrences: $w t, d = tf t$ .

[edit] Assumptions and Limitations of The Vector Space Model

The Vector Space Model has the following limitations:

Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality)
Search keywords must precisely match document terms; word substrings might result in a "false positive match"
Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".

[edit] Models based on and extending the vector space model

Models based on and extending the vector space model include:

Generalized vector space model
Topic-based vector space model (TVSM) — Extends the vector space model by removing the constraint that the term-vectors be orthogonal. In contrast to the generalized vector space model the topic-based vector space model does not depend on concurrence-based similarities between terms.
Latent semantic analysis
DSIR model