Factored language model

From Wikipedia, the free encyclopedia

The factored language model (FLM) is an extension of conventional Language model. In an FLM, each word is viewed as a vector of k factors: w_i = \{f_i^1, ..., f_i^k\}. An FLM provides the probabilistic model P(f | fi,...,fN) where the prediction of factor f is based on N parents {f1,...,fN}. For an example, if w represents word token and t represents Part of speech tag for English, the model P(wi | wi − 2,wi − 1,ti − 1) gives a model for predicting current work token based on traditional Ngram model as well as Part of speech tag of the previous word.

A main advantage of factored language models is they allow users to put in linguistic knowledge such as explicitly model the relationship between word tokens and Part of speech in English, or morphological information (stems, root, etc.) in Arabic.

Like N-gram models, smoothing techniques are necessary in parameter estimation. In particular, generalized backing-off is used in training an FLM.

[edit] References