Good-Turing frequency estimation
From Wikipedia, the free encyclopedia
Good–Turing frequency estimation is a statistical technique for predicting the probability of occurrence of objects belonging to an unknown number of species, given past observations of such objects and their species.
It was developed by Alan Turing and his assistant I.J. Good during World War II. It was part of their efforts at Bletchley Park to crack German ciphers for the Enigma machine during the war, and the process led to the creation of early machines that were the immediate ancestors of the modern digital computer. Turing at first modeled the frequencies as a binomial distribution, but found it inaccurate. Good developed smoothing algorithms to improve the estimator's accuracy. The case where the number of species is known in advance had been already solved by Bayes and Laplace and is called the Laplace-Bayes estimator or the rule of succession.
For example, if an urn contains two species of balls (red and black) and after sampling N of them we have found that r are black, then the current estimate for the probability of a black ball is (r + 1)/(N + 2). Good and Turing consider a more general problem where the number of species is not known in advance; they provide a probability estimate for each of the species known up to the present time, as well as an estimate of the probability that a new, previously unknown, species will be drawn.
[edit] Modern day
The process gained a bit of modern fame due to the Robert Harris novel Enigma. However, most versions of the Good–Turing technique required quite difficult and time consuming mathematical calculations, so it was not used as widely as it might have been.[1]
In the 1990s, Geoffrey Sampson worked with William Gale of AT&T, to create and implement a simplified and easier to use method of the Good-Turing method.[2]
[edit] References
[edit] See also
- Ewens sampling formula
- Probability estimation
- Pseudocount
- Probability smoothing
- Bayesian smoothing