Bigram

A bigram or digram is every sequence of two adjacent elements in a string of tokens, which are typically letters, syllables, or words; they are n-grams for n=2. The frequency distribution of bigrams in a string are commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on.

Gappy bigrams or skipping bigrams are word pairs which allow gaps (perhaps avoiding connecting words, or allowing some simulation of dependencies, as in a dependency grammar).

Head word bigrams are gappy bigrams with an explicit dependency relationship.

Bigrams help provide the conditional probability of a token given the preceding token, when the relation of the conditional probability is applied:

 P(W_n|W_{n-1}) = { P(W_{n-1},W_n) \over P(W_{n-1}) }

That is, the probability  P() of a token W_n given the preceding token W_{n-1} is equal to the probability of their bigram, or the co-occurrence of the two tokens P(W_{n-1},W_n), divided by the probability of the preceding token.

Applications

Bigrams are used in one of the most successful language models for speech recognition.[1] They are a special case of N-gram.

Bigram frequency attacks can be used in cryptography to solve cryptograms. See frequency analysis.

Bigram frequency is one approach to statistical language identification.

Bigram frequency in the English language

The frequency of the most common letter bigrams in a small English corpus is:[2]

th 1.52       en 0.55       ng 0.18
he 1.28       ed 0.53       of 0.16
in 0.94       to 0.52       al 0.09
er 0.94       it 0.50       de 0.09
an 0.82       ou 0.50       se 0.08
re 0.68       ea 0.47       le 0.08
nd 0.63       hi 0.46       sa 0.06
at 0.59       is 0.46       si 0.05
on 0.57       or 0.43       ar 0.04
nt 0.56       ti 0.34       ve 0.04
ha 0.56       as 0.33       ra 0.04
es 0.56       te 0.27       ld 0.02
st 0.55       et 0.19       ur 0.02

Complete bigram frequencies for a larger corpus are available.[3]

Bigram frequency in the Turkish language

The frequeny of most common letter bigrams in Turkish are illustrated below [4]

ar 0.0192        ya 0.0098         or 0.0064
la 0.0175        di 0.0093         nı 0.0063
an 0.0173        ma 0.0091         li 0.0063
er 0.0152        nd 0.0089         me 0.0062
in 0.0151        ra 0.0086         rı 0.0061
le 0.0134        al 0.0084         ta 0.0059
en 0.0132        ak 0.0079         ne 0.0058
de 0.0126        ri 0.0077         el 0.0058
ın 0.0121        il 0.0070         am 0.0058
da 0.0116        ni 0.0067         ek 0.0057
bi 0.0114        ba 0.0065         dı 0.0057
ir 0.0110        rd 0.0065         yo 0.0055
ka 0.0103        ay 0.0064         ki 0.0054

See also

References

  1. Michael Collins. A new statistical parser based on bigram lexical dependencies. In Proceedings of the 34th Annual Meeting of the Association of Computational Linguistics, Santa Cruz, CA. 1996. pp.184-191.
  2. Cornell Math Explorer's Project Substitution Ciphers
  3. Jones, Michael N; D J K Mewhort (August 2004). "Case-sensitive letter and bigram frequency counts from large-scale English corpora". Behavior Research Methods, Instruments, and Computers 36 (3): 388–396. ISSN 0743-3808. PMID 15641428.
  4. Sefik Ilkin Serengil. Attacking Turkish Texts Encrypted by Homophonic Cipher. MSc thesis, Galatasaray University, 2011.
This article is issued from Wikipedia - version of the Wednesday, February 10, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.