Index of coincidence

From Wikipedia, the free encyclopedia

In cryptography, coincidence counting is the technique (invented by William F. Friedman^[1]) of putting two texts side-by-side and counting the number of times that a letter appears in the same position in both texts. This count, either as a ratio of the total or normalized by dividing by the expected count for a random source model, is known as the index of coincidence. The technique is used to cryptanalyze the Vigenère cipher, for example. Even though only the ciphertext is available for testing, so that plaintext letter identities are disguised, coincidences in the ciphertext can be caused by corresponding coincidences in the (unknown) plaintext. For a repeating-key polyalphabetic cipher arranged into a matrix, the coincidence rate within each column will usually be highest when the width of the matrix is a multiple of the key length, and this fact can be used to determine the key length, which is the first step in cracking the system.

Coincidence counting can help determine when two texts are written in the same language, using the same alphabet. For such texts, the "causal" coincidence count will be distinctly higher than the "accidental" coincidence count for texts in different languages, or texts using different alphabets, or gibberish texts. The technique has been applied to examine the purported Bible code.

To see why, imagine an "alphabet" of only the two letters A and B. Suppose that in our "language", the letter A is used 75% of the time, and the letter B is used 25% of the time. If two texts in this "language" are laid side-by-side, then the following pairs can be expected:

Pair	Probability
AA	56.25%
BB	6.25%
AB	18.75%
BA	18.75%

Overall, the probability of a "coincidence" is 62.5% (56.25% for AA + 6.25% for BB).

Now suppose that we place two messages side-by-side: one message in this "language"; one message encrypted using the substitution cipher which replaces A with B and vice versa. The following pairs can now be expected:

Pair	Probability
AA	18.75%
BB	18.75%
AB	56.25%
BA	6.25%

Now the probability of a coincidence is only 37.5% (18.75% for AA + 18.75% for BB). This is noticeably lower than the probability when same-language, same-alphabet texts were used. In effect, coincidences were more likely because the most frequent letters in each text were the same, so the odds that those letters would appear side-by-side are maximized.

The same principle applies to real languages like English. Certain letters, like E, occur much more frequently than other letters--a fact which is used in frequency analysis of substitution ciphers. Coincidences involving the letter E, for example, are rather likely. So when any two English texts are compared, the coincidence count will be higher than when an English text and a foreign-language text are used.

It can easily be imagined that this effect can be subtle. For example, similar languages will have a higher coincidence count than dissimilar languages. And it isn't hard to generate random text with a frequency distribution similar to real text, artificially raising the coincidence count. Nevertheless, this technique can be used effectively to identify when two texts contain meaningful information in the same language using the same alphabet, to discover periods for repeating keys, and to uncover many other kinds of nonrandom phenomena within or among ciphertexts.

The same idea can be applied to a single text, where the sample is in effect compared with itself. Mathematically we can compute the index of coincidence IC for a given letter-frequency distribution as

$\mathbf{IC} = \frac{\sum_{i=1}^{c}n_i(n_i -1)}{N(N-1)/c}$

where $N$ is the length of the text and $n 1$ through $n c$ are the frequencies (as integers) of the $c$ letters of the alphabet ( $c = 26$ for monocase English). The sum of the $n i$ is necessarily $N$ .

The products $n (n - 1)$ count the number of combinations of $n$ elements taken two at a time. (Actually this counts each pair twice; the extra factors of 2 occur in both numerator and denominator of the formula and thus cancel out.) Each of the $n i$ occurrences of the $i$ -th letter matches each of the remaining $n i - 1$ occurrences of the same letter. There are a total of $N (N - 1)$ letter pairs in the entire text, and $1 / c$ is the probability of a match for each pair, assuming a uniform random distribution of the characters (the "null model"; see below). Thus, this formula gives the ratio of the total number of coincidences observed to the total number of coincidences that one would expect from the null model.^[2]

If all $c$ letters of an alphabet were equally distributed, the expected index would be 1.0. The actual monographic I.C. for telegraphic English text is around 1.73, reflecting the unevenness of natural-language letter distributions. Values for various languages^[3] are:

Language	Index of Coincidence
English	1.73
French	2.02
German	2.05
Italian	1.94
Portuguese	1.94
Russian	1.76
Spanish	1.94

[edit] Generalization

The above description is only an introduction to use of the Index of Coincidence, which is related to the general concept of correlation. Various forms of Index of Coincidence have been devised; the "delta" I.C. (given by the formula above) in effect measures the autocorrelation of a single distribution, whereas a "kappa" I.C. is used when matching two text strings.^[4] Although in some applications constant factors such as $c$ and $N$ can be ignored, in more general situations there is considerable value in truly indexing each I.C. against the value to be expected for the null hypothesis (usually: no match and a uniform random symbol distribution), so that in every situation the expected value for no correlation is 1.0. Thus, any form of I.C. can be expressed as the ratio of the number of coincidences actually observed to the number of coincidences expected (according to the null model), using the particular test setup.

From the foregoing, it is easy to see that the formula for kappa I.C. is

$\mathbf{IC} = \frac{\sum_{j=1}^{N}[a_j=b_j]}{N/c}$

where $N$ is the common aligned length of the two texts A and B, and the bracketed term is defined as 1 if the $j$ -th letter of text A matches the $j$ -th letter of text B, otherwise 0.

A related concept, the "bulge" of a distribution, measures the discrepancy between the observed I.C. and the null value of 1.0.

[edit] References

^ Friedman, W.F. [1922]. The index of coincidence and its applications in cryptology, Department of Ciphers. Publ 22. Geneva, Illinois, USA: Riverbank Laboratories. OCLC 55786052. The original application ignored normalization.
^ Mountjoy, Marjorie (1963). "The Bar Statistics". NSA Technical Journal VII (2,4). Published in two parts.
^ Friedman, W.F. and Callimahos, L.D. [1956]. Military Cryptanalytics, Part I – Volume 2. Reprinted by Aegean Park Press. ISBN 0-89412-074-3.
^ Kahn, David [1967]. The Codebreakers - The Story of Secret Writing. New York: Macmillan. ISBN 0-684-83130-9.

[edit] See also

Topics in cryptography

Classical cryptography v • d • e
Ciphers: ADFGVX \| Affine \| Alberti \| Atbash \| Autokey \| Bifid \| Book \| Caesar \| Four-square \| Hill \| Keyword \| Nihilist \| Permutation \| Pigpen \| Playfair \| Polyalphabetic \| Polybius \| Rail Fence \| Reihenschieber \| Reservehandverfahren \| ROT13 \| Running key \| Scytale \| Smithy code \| Solitaire \| Straddling checkerboard \| Substitution \| Tap Code \| Transposition \| Trifid \| Two-square \| VIC cipher \| Vigenère
Cryptanalysis: Frequency analysis \| Index of coincidence
Misc: Cryptogram \| Bacon \| Polybius square \| Scytale \| Straddling checkerboard \| Tabula recta
Cryptography v • d • e
History of cryptography \| Cryptanalysis \| Cryptography portal \| Topics in cryptography
Symmetric-key algorithm \| Block cipher \| Stream cipher \| Public-key cryptography \| Cryptographic hash function \| Message authentication code \| Random numbers