Zipf-Mandelbrot law

From Wikipedia, the free encyclopedia

Zipf-Mandelbrot
Probability mass function
Cumulative distribution function
Parameters N \in \{1,2,3\ldots\} (integer)
q \in [0;\infty) (real)
s>0\, (real)
Support k \in \{1,2,\ldots,N\}
Probability mass function (pmf) \frac{1/(k+q)^s}{H_{N,q,s}}
Cumulative distribution function (cdf) \frac{H_{k,q,s}}{H_{N,q,s}}
Mean \frac{H_{N,q,s-1}}{H_{N,q,s}}-q
Median
Mode 1\,
Variance
Skewness
Excess kurtosis
Entropy
Moment-generating function (mgf)
Characteristic function

In probability theory and statistics, the Zipf-Mandelbrot law is a discrete probability distribution. Also known as the Pareto-Zipf law, it is a power-law distribution on ranked data, named after the linguist George Kingsley Zipf who suggested a simpler distribution called Zipf's law, and the mathematician Benoît Mandelbrot, who subsequently generalized it.

The probability mass function is given by:

f(k;N,q,s)=\frac{1/(k+q)^s}{H_{N,q,s}}

where HN,q,s is given by:

H_{N,q,s}=\sum_{i=1}^N \frac{1}{(i+q)^s}

which may be thought of as a generalization of a harmonic number. In the limit as N approaches infinity, this becomes the Hurwitz zeta function ζ(q,s). For finite N and q = 0 the Zipf-Mandelbrot law becomes Zipf's law. For infinite N and q = 0 it becomes a Zeta distribution.

[edit] Applications

The distribution of words ranked by their frequency in a random text corpus is generally a power-law distribution, known as Zipf's law.

If one plots the frequency rank of words contained in a large corpus of text data versus the number of occurrences or actual frequencies, one obtains a power-law distribution, with exponent close to one (but see Gelbukh and Sidorov 2001).

[edit] References and links