Hypergeometric distribution

From Wikipedia, the free encyclopedia

Hypergeometric
Probability mass function
Cumulative distribution function
Parameters	$N\in 0,1,2,3,\dots\,$ $D\in 0,1,\dots,N\,$ $n\in 0,1,\dots,N\,$
Support	$k \in 0,1,\dots,n\,$
Probability mass function (pmf)	${{{D \choose k} {{N-D} \choose {n-k}}}\over {N \choose n}}$
Cumulative distribution function (cdf)
Mean	$nD\over N$
Median
Mode
Variance	$n(D/N)(1-D/N)(N-n)\over (N-1)$
Skewness	$\frac{(N-2D)(N-1)^\frac{1}{2}(N-2n)}{[nD(N-D)(N-n)]^\frac{1}{2}(N-2)}$
Excess kurtosis	$\left[\frac{N^2(N-1)}{n(N-2)(N-3)(N-n)}\right]$ $\cdot\left[\frac{N(N+1)-6N(N-n)}{D(N-D)}\right.$ $+\left.\frac{3n(N-n)(N+6)}{N^2}-6\right]$
Entropy
Moment-generating function (mgf)	$\frac{{N-D \choose n}} {{N \choose n}}\,_2F_1(-n,\!-D;\!N\!-\!D\!-\!n\!+\!1;\!e^{t})$
Characteristic function	$\frac{{N-D \choose n}} {{N \choose n}}\,_2F_1(-n,\!-D;\!N\!-\!D\!-\!n\!+\!1;\!e^{it})$

In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement.

	drawn	not drawn	total
defective	k	D − k	D
nondefective	n − k	N + k − n − D	N − D
total	n	N − n	N

A typical example is illustrated by the contingency table above: there is a shipment of N objects in which D are defective. The hypergeometric distribution describes the probability that in a sample of n distinctive objects drawn from the shipment exactly k objects are defective.

In general, if a random variable X follows the hypergeometric distribution with parameters N, D and n, then the probability of getting exactly k successes is given by

$f(k;N,D,n) = {{{D \choose k} {{N-D} \choose {n-k}}}\over {N \choose n}}$

The probability is positive when k is between max{ 0, D + n − N } and min{ n, D }.

The formula can be understood as follows: There are $N \choose n$ possible samples (without replacement). There are $D \choose k$ ways to obtain k defective objects and there are ${N-D} \choose {n-k}$ ways to fill out the rest of the sample with non-defective objects.

When the population size is large compared to the sample size (i.e., N is much larger than n) the hypergeometric distribution is approximated reasonably well by a binomial distribution with parameters n (number of trials) and p = D / N (probability of success in a single trial).

The fact that the sum of the probabilities, as k runs through the range of possible values, is equal to 1, is essentially Vandermonde's identity from combinatorics.

1 Application and example
2 Symmetries
3 Relationship to Fisher's exact test
4 Related distributions
5 External links

[edit] Application and example

The classical application of the hypergeometric distribution is sampling without replacement. Think of an urn with two types of marbles, black ones and white ones. Define drawing a white marble as a success and drawing a black marble as a failure (analogous to the binomial distribution). If the variable N describes the number of all marbles in the urn (see contingency table above) and D describes the number of white marbles (called defective in the example above), then N − D corresponds to the number of black marbles.
Now, assume that there are 5 white and 45 black marbles in the urn. Standing next to the urn, you close your eyes and draw 10 marbles without replacement. What's the probability p (k=4) that you draw exactly 4 white marbles (and - of course - 6 black marbles) ?

This problem is summarized by the following contingency table:

	drawn	not drawn	total
white marbles	4 (k)	1 = 5 − 4 (D − k)	5 (D)
black marbles	6 = 10 − 4 (n − k)	39 = 50 + 4 − 10 − 5 (N + k − n − D)	45 (N − D)
total	10 (n)	40 (N − n)	50 (N)

The probability Pr (k = x) of drawing exactly x white marbles (= number of successes) can be calculated by the formula

$\Pr(k=x) = f(k;N,D,n) = {{{D \choose k} {{N-D} \choose {n-k}}}\over {N \choose n}}.$

Hence, in this example x = 4, calculate

$\Pr(k=4) = f(4;50,5,10) = {{{5 \choose 4} {{45} \choose {6}}}\over {50 \choose 10}} = 0.003964583\dots.$

So, the probability of drawing exactly 4 white marbles is quite low (approximately 0.004) and the event is very unlikely. It means, if you repeated your random experiment (drawing 10 marbles from the urn of 50 marbles without replacement) 1000 times you just would expect to obtain such a result 4 times.

But what about the probability of drawing even (all) 5 white marbles? You will intuitively agree upon that this is even more unlikely than drawing 4 white marbles. Let us calculate the probability for such an extreme event.

The contingency table is as follows:

	drawn	not drawn	total
white marbles	5 (k)	0 = 5 − 5 (D − k)	5 (D)
black marbles	5 = 10 − 5 (n − k)	40 = 50 + 5 − 10 − 5 (N + k − n − D)	45 (N − D)
total	10 (n)	40 (N − n)	50 (N)

And we can calculate the probability as follows (notice that the denominator always stays the same):

$\Pr(k=5) = f(5;50,5,10) = {{{5 \choose 5} {{45} \choose {5}}}\over {50 \choose 10}} = 0.0001189375\dots,$

As expected, the probability of drawing 5 white marbles is even much lower than drawing 4 white marbles.

Conclusion:
Consequently, one could expand the initial question as follows: If you draw 10 marbles from an urn (containing 5 white and 45 black marbles), what's the probability of drawing at least 4 white marbles? Or, what's the probability of drawing 4 white marbles and more extreme outcomes such as drawing 5)? This corresponds to calculating the cumulative probability p(k>=4) and can be calculated by the cumulative distribution function (cdf). Since the hypergeometric distribution is a discrete probability distribution the cumulative probability can be calculated easily by adding all corresponding single probability values.

In our example you just have to sum-up Pr (k = 4) and Pr (k = 5):

Pr (k ≥ 4) = 0.003964583 + 0.0001189375 = 0.004083520

You can easily re-calculate the example given above with the hypergeometric distribution calculator (see also link below) or the free statistical programming language R which is sometimes also called GNU S:
The code snippets for R are as follows:
Pr (k = 4): type

choose (5,4) * choose (45,6) / choose (50,10)

p(k=5): type

choose (5,5) * choose (45,5) / choose (50,10)

or alternatively,

dhyper (5, m=5, n=45, k=10)

p(k>=4): type

phyper (q=3, m=5, n=45, k=10, lower.tail = FALSE)

Explanation:
q : number of white marbles drawn [k in the contingency table]
m : number of white marbles (total) [D in the contingency table]
n : number of black marbles (total) [N - D in the contingency table]
k : number of marbles drawn (total) [n in the contingency table]
lower.tail = FALSE : if lower.tail is set TRUE, than p(k<=x) is calculated; if set to FALSE, p(k>x) is calculated --> in order to calculate p(k>=4) you have to calculate p(k>3)

In addition, the open-source spreadsheet program Gnumeric has the function HYPGEOMDIST to perform the same calculation (Microsoft Excel also has the HYPGEOMDIST function, but the Excel implementation does not support the cumulative option provided by Gnumeric).

If n is much smaller than min (D, N − D), then the hypergeometric distribution approaches the binomial distribution. In other words, this is the case when the number of marbles drawn from the urn is clearly smaller than both the number of black and white marbles. Roughly speaking, sampling with or without replacement is almost identical in large populations.

[edit] Symmetries

$f(k;N,D,n) = {{{D \choose k} {{N-D} \choose {n-k}}}\over {N \choose n}} = f(n-k;N,N-D,n)$

This symmetry can be intuitively understood if you repaint all the black marbles to white and vice versa, thus the black and white marbles simply change roles.

f (k; N, D, n) = f (k; N, n, D)

This symmetry can be intuitively understood if instead of drawing marbles, you label the marbles that you would have drawn. Both expressions give the probability that exactly k marbles are "black" and labeled "drawn"

[edit] Relationship to Fisher's exact test

The test (see above) based on the hypergeometric distribution (hypergeometric test) is identical to the corresponding one-tailed version of Fisher's exact test. Reciprocally, the p-value of a two-sided Fisher's exact test can be calculated as the sum of two appropriate hypergeometric tests (for more information see the following web site).

[edit] Related distributions

$Y \sim \mathrm{Binomial}(n = n, p = D/N)$ is a binomial distribution as $Y = \lim_{N \to \infty} X$ where $X \sim \mathrm{Hypergeometric}(D, N, n)$ .

[edit] External links

Hypergeometric Distribution Calculator

	Probability distributions [ view • talk • edit ]
	Univariate	Multivariate
Discrete:	Benford • Bernoulli • binomial • Boltzmann • categorical • compound Poisson • degenerate • Gauss-Kuzmin • geometric • hypergeometric • logarithmic • negative binomial • parabolic fractal • Poisson • Rademacher • Skellam • uniform • Yule-Simon • zeta • Zipf • Zipf-Mandelbrot	Ewens • multinomial • multivariate Polya
Continuous:	Beta • Beta prime • Cauchy • chi-square • Dirac delta function • Erlang • exponential • exponential power • F • fading • Fisher's z • Fisher-Tippett • Gamma • generalized extreme value • generalized hyperbolic • generalized inverse Gaussian • Half-Logistic • Hotelling's T-square • hyperbolic secant • hyper-exponential • hypoexponential • inverse chi-square • inverse Gaussian • inverse gamma • Kumaraswamy • Landau • Laplace • Lévy • Lévy skew alpha-stable • logistic • log-normal • Maxwell-Boltzmann • Maxwell speed • normal (Gaussian) • normal inverse Gaussian • Pareto • Pearson • polar • raised cosine • Rayleigh • relativistic Breit-Wigner • Rice • shifted Gompertz • Student's t • triangular • type-1 Gumbel • type-2 Gumbel • uniform • Variance-Gamma • Voigt • von Mises • Weibull • Wigner semicircle • Wilks' lambda	Dirichlet • Kent • matrix normal • multivariate normal • multivariate Student • von Mises-Fisher • Wigner quasi • Wishart
Miscellaneous:	Cantor • conditional • exponential family • infinitely divisible • location-scale family • marginal • maximum entropy • phase-type • posterior • prior • quasi • sampling • singular