Directional statistics

Directional statistics is the subdiscipline of statistics that deals with directions (unit vectors in Rⁿ), axes (lines through the origin in Rⁿ) or rotations in Rⁿ. More generally, directional statistics deals with observations on compact Riemannian manifolds.

The overall shape of a protein can be parameterized as a sequence of points on the unit sphere. Shown are two views of the spherical histogram of such points for a large collection of protein structures. The statistical treatment of such data is in the realm of directional statistics.^[1]

The fact that 0 degrees and 360 degrees are identical angles, so that for example 180 degrees is not a sensible mean of 2 degrees and 358 degrees, provides one illustration that special statistical methods are required for the analysis of some types of data (in this case, angular data). Other examples of data that may be regarded as directional include statistics involving temporal periods (e.g. time of day, week, month, year, etc.), compass directions, dihedral angles in molecules, orientations, rotations and so on.

Circular and higher-dimensional distributions

Any probability density function $p(x)$ on the line can be "wrapped" around the circumference of a circle of unit radius.^[2] That is, the pdf of the wrapped variable

\theta = x_w=x \mod 2\pi\ \ \in (-\pi,\pi]

p_w(\theta)=\sum_{k=-\infty}^{\infty}{p(\theta+2\pi k)}.

This concept can be extended to the multivariate context by an extension of the simple sum to a number of $F$ sums that cover all dimensions in the feature space:

p_w(\vec\theta)=\sum_{k_1=-\infty}^{\infty}\cdots \sum_{k_F=-\infty}^\infty{p(\vec\theta+2\pi k_1\mathbf{e}_1+\dots+2\pi k_F\mathbf{e}_F)}

where $\mathbf{e}_k=(0,\dots,0,1,0,\dots,0)^{\mathsf{T}}$ is the $k$ th Euclidean basis vector.

The following sections show some relevant circular distributions.

von Mises circular distribution

The von Mises distribution is a circular distribution which, like any other circular distribution, may be thought of as a wrapping of a certain linear probability distribution around the circle. The underlying linear probability distribution for the von Mises distribution is mathematically intractable; however, for statistical purposes, there is no need to deal with the underlying linear distribution. The usefulness of the von Mises distribution is twofold: it is the most mathematically tractable of all circular distributions, allowing simpler statistical analysis, and it is a close approximation to the wrapped normal distribution, which, analogously the linear normal distribution, is important because it is the limiting case for the sum of a large number of small angular deviations. In fact, the von Mises distribution is often known as the "circular normal" distribution because of its ease of use and its close relationship to the wrapped normal distribution (Fisher, 1993).

The pdf of the von Mises distribution is:

f(\theta;\mu,\kappa)=\frac{e^{\kappa\cos(\theta-\mu)}}{2\pi I_0(\kappa)}

where

I_0

is the modified Bessel function of order 0.

Circular uniform distribution

The probability density function (pdf) of the circular uniform distribution is given by

U(\theta)=1/2\pi.\,

Wrapped normal distribution

The pdf of the wrapped normal distribution (WN) is:

WN(\theta;\mu,\sigma)=\frac{1}{\sigma \sqrt{2\pi}} \sum^{\infty}_{k=-\infty} \exp \left[\frac{-(\theta - \mu - 2\pi k)^2}{2 \sigma^2} \right]=\frac{1}{2\pi}\zeta\left(\frac{\theta-\mu}{2\pi},\frac{i\sigma^2}{2\pi}\right)

where μ and σ are the mean and standard deviation of the unwrapped distribution, respectively and

\zeta(\theta,\tau)

is the Jacobi theta function:

\zeta(\theta,\tau)=\sum_{n=-\infty}^\infty (w^2)^n q^{n^2}

where

w \equiv e^{i\pi \theta}

and

q \equiv e^{i\pi\tau}.

Wrapped Cauchy distribution

The pdf of the wrapped Cauchy distribution (WC) is:

WC(\theta;\theta_0,\gamma)=\sum_{n=-\infty}^\infty \frac{\gamma}{\pi(\gamma^2+(\theta+2\pi n-\theta_0)^2)} =\frac{1}{2\pi}\,\,\frac{\sinh\gamma}{\cosh\gamma-\cos(\theta-\theta_0)}

where

\gamma

is the scale factor and

\theta_0

is the peak position.

Wrapped Lévy distribution

The pdf of the Wrapped Lévy distribution (WL) is:

f_{WL}(\theta;\mu,c)=\sum_{n=-\infty}^\infty \sqrt{\frac{c}{2\pi}}\,\frac{e^{-c/2(\theta+2\pi n-\mu)}}{(\theta+2\pi n-\mu)^{3/2}}

where the value of the summand is taken to be zero when

\theta+2\pi n-\mu \le 0

c

is the scale factor and

\mu

is the location parameter.

Distributions on higher-dimensional manifolds

Three points sets sampled from different Kent distributions on the sphere.

There also exist distributions on the two-dimensional sphere (such as the Kent distribution^[3]), the N-dimensional sphere (the von Mises-Fisher distribution^[4]) or the torus (the bivariate von Mises distribution^[5]).

The von Mises–Fisher distribution is a distribution on the Stiefel manifold, and can be used to construct probability distributions over rotation matrices.^[6]

The Bingham distribution is a distribution over axes in N dimensions, or equivalently, over points on the (N − 1)-dimensional sphere with the antipodes identified.^[7] For example, if N = 2, the axes are undirected lines through the origin in the plane. In this case, each axis cuts the unit circle in the plane (which is the one-dimensional sphere) at two points that are each other's antipodes. For N = 4, the Bingham distribution is a distribution over the space of unit quaternions. Since a unit quaternion corresponds to a rotation matrix, the Bingham distribution for N = 4 can be used to construct probability distributions over the space of rotations, just like the Matrix-von Mises–Fisher distribution.

These distributions are for example used in geology,^[8] crystallography^[9] and bioinformatics.^[1] ^[10] ^[11]

The fundamental difference between linear and circular statistics

A simple way to calculate the mean of a series of angles (in the interval [0°, 360°)) is to calculate the mean of the cosines and sines of each angle, and obtain the angle by calculating the inverse tangent. Consider the following three angles as an example: 10, 20, and 30 degrees. Intuitively, calculating the mean would involve adding these three angles together and dividing by 3, in this case indeed resulting in a correct mean angle of 20 degrees. By rotating this system anticlockwise through 15 degrees the three angles become 355 degrees, 5 degrees and 15 degrees. The naive mean is now 125 degrees, which is the wrong answer, as it should be 5 degrees. The vector mean $\scriptstyle\bar \theta$ can be calculated in the following way, using the mean sine $\scriptstyle\bar s$ and the mean cosine $\scriptstyle\bar c \not = 0$ :

\bar s = \frac{1}{3} \left( \sin (355^\circ) + \sin (5^\circ) + \sin (15^\circ) \right) = \frac{1}{3} \left( -0.087 + 0.087 + 0.259 \right) \approx 0.086

\bar c = \frac{1}{3} \left( \cos (355^\circ) + \cos (5^\circ) + \cos (15^\circ) \right) = \frac{1}{3} \left( 0.996 + 0.996 + 0.966 \right) \approx 0.986

\bar \theta = \left. \begin{cases} \arctan \left( \frac{\bar s}{ \bar c} \right) & \bar s > 0 ,\ \bar c > 0 \\ \arctan \left( \frac{\bar s}{ \bar c} \right) + 180^\circ & \bar c < 0 \\ \arctan \left (\frac{\bar s}{\bar c} \right)+360^\circ & \bar s <0 ,\ \bar c >0 \end{cases} \right\} = \arctan \left( \frac{0.086}{0.986} \right) = \arctan (0.087) = 5^\circ.

This may be more succinctly stated by realizing that directional data are in fact vectors of unit length. In the case of one-dimensional data, these data points can be represented conveniently as complex numbers of unit magnitude $z=\cos(\theta)+i\,\sin(\theta)=e^{i\theta}$ , where $\theta$ is the measured angle. The mean resultant vector for the sample is then:

\overline{\mathbf{\rho}}=\frac{1}{N}\sum_{n=1}^N z_n.

The sample mean angle is then the argument of the mean resultant:

\overline{\theta}=\mathrm{Arg}(\overline{\mathbf{\rho}}).

The length of the sample mean resultant vector is:

\overline{R}=|\overline{\mathbf{\rho}}|

and will have a value between 0 and 1. Thus the sample mean resultant vector can be represented as:

\overline{\mathbf{\rho}}=\overline{R}\,e^{i\overline{\theta}}.

Moments

The raw vector (or trigonometric) moments of a circular distribution are defined as

m_n=E(z^n)=\int_\Gamma P(\theta)z^n d\theta\,

where $\Gamma$ is any interval of length $2\pi$ and $P(\theta)$ is the PDF of the circular distribution. Since the integral $P(\theta)$ is unity, and the integration interval is finite, it follows that the moments of any circular distribution are always finite and well defined.

Sample moments are analogously defined:

\overline{m}_n=\frac{1}{N}\sum_{i=1}^N z_i^n.

The population resultant vector, length, and mean angle are defined in analogy with the corresponding sample parameters.

\rho=m_1\,

R=|m_1|\,

\theta_\mu=\mathrm{Arg}(m_1).\,

In addition, the lengths of the higher moments are defined as:

R_n=|m_n|\,

while the angular parts of the higher moments are just $(n \theta_\mu) \mod 2\pi$ . The lengths of the higher moments will all lie between 0 and 1.

Measures of location and spread

Main article: Mean of circular quantities

Various measures of location and spread may be defined for both the population and a sample drawn from that population.^[12] The most common measure of location is the circular mean. The population circular mean is simply the first moment of the distribution while the sample mean is the first moment of the sample. The sample mean will serve as an unbiased estimator of the population mean.

When data is concentrated, the median and mode may be defined by analogy to the linear case, but for more dispersed or multi-modal data, these concepts are not useful.

The most common measures of circular spread are:

The circular variance. For the sample the circular variance is defined as:

\overline{\mathrm{Var}(z)}=1-\overline{R}\,

and for the population

\mathrm{Var}(z)=1-R\,

Both will have values between 0 and 1.

The circular standard deviation

S(z)=\sqrt{\ln(1/R^2)}=\sqrt{-2\ln(R)}\,

\overline{S}(z)=\sqrt{\ln(1/{\overline{R}}^2)}=\sqrt{-2\ln({\overline{R}})}\,

with values between 0 and infinity. This definition of the standard deviation (rather than the square root of the variance) is useful because for a wrapped normal distribution, it is an estimator of the standard deviation of the underlying normal distribution. It will therefore allow the circular distribution to be standardized as in the linear case, for small values of the standard deviation. This also applies to the von Mises distribution which closely approximates the wrapped normal distribution. Note that for small

S(z)

, we have

S(z)^2=2 \mathrm{Var}(z)

The circular dispersion

\delta=\frac{1-R_2}{2R^2}

\overline{\delta}=\frac{1-{\overline{R}_2}}{2{\overline{R}}^2}

with values between 0 and infinity. This measure of spread is found useful in the statistical analysis of variance.

Distribution of the mean

Given a set of N measurements $z_n=e^{i\theta_n}$ the mean value of z is defined as:

\overline{z}=\frac{1}{N}\sum_{n=1}^N z_n

which may be expressed as

\overline{z} = \overline{C}+i\overline{S}

where

\overline{C} = \frac{1}{N}\sum_{n=1}^N \cos(\theta_n) \text{ and } \overline{S} = \frac{1}{N}\sum_{n=1}^N \sin(\theta_n)

or, alternatively as:

\overline{z} = \overline{R}e^{i\overline{\theta}}

where

\overline{R} = \sqrt{{\overline{C}}^2+{\overline{S}}^2}\,\,\,\mathrm{and}\,\,\,\,\overline{\theta} = \mathrm{ArcTan}(\overline{S},\overline{C}).

The distribution of the mean ( $\overline{\theta}$ ) for a circular pdf P(θ) will be given by:

P(\overline{C},\overline{S}) \, d\overline{C} \, d\overline{S} = P(\overline{R},\overline{\theta}) \, d\overline{R} \, d\overline{\theta} = \int_\Gamma ... \int_\Gamma \prod_{n=1}^N \left[ P(\theta_n) \, d\theta_n \right]

where $\Gamma$ is over any interval of length $2\pi$ and the integral is subject to the constraint that $\overline{S}$ and $\overline{C}$ are constant, or, alternatively, that $\overline{R}$ and $\overline{\theta}$ are constant.

The calculation of the distribution of the mean for most circular distributions is not analytically possible, and in order to carry out an analysis of variance, numerical or mathematical approximations are needed.^[13]

The central limit theorem may be applied to the distribution of the sample means. (main article: Central limit theorem for directional statistics). It can be shown^[14] that the distribution of $[\overline{C},\overline{S}]$ approaches a bivariate normal distribution in the limit of large sample size.

Goodness of fit and significance testing

For cyclic data - (eg is it uniformly distributed) :

Rayleigh test for a unimodal cluster
Kuiper's test for possible multimodal data.

Software

R has some packages devoted to circular statistics, including CircStats (CircStats package for R), circular (circular package for R), CircNNTSR (CircNNTSR package for R) and isocir (isocir package for R to draw isotonic inference for circular data).^[15]^[16]
Circular Statistics, a MATLAB toolbox containing the essentials to work with circular data (Documentation).
Mocapy: a dynamic Bayesian network software package implemented in Python and C++. Uses stochastic expectation maximization for parameter learning, and supports directional statistics.
Oriana, Windows software for directional statistics.
SPAK: MATLAB package dealing with Kent distributions for spherical data.

References

↑ 1.0 1.1 "Hamelryck, T., Kent, J., Krogh, A. (2006) Sampling realistic protein conformations using local structural bias. PLoS Comput. Biol., 2(9): e131". Public Library of Science (PLoS). Retrieved 2008-02-01.
↑ Bahlmann, C., (2006), Directional features in online handwriting recognition, Pattern Recognition, 39
↑ Kent, J (1982) The Fisher–Bingham distribution on the sphere. J Royal Stat Soc, 44, 71–80.
↑ Fisher, RA (1953) Dispersion on a sphere. Proc. Roy. Soc. London Ser. A., 217, 295–305
↑ Mardia, KM. Taylor, CC., Subramaniam, GK. (2007) Protein Bioinformatics and Mixtures of Bivariate von Mises Distributions for Angular Data. Biometrics, 63, 505–512
↑ Downs, (1972) Orientational statistics. Biometrica, 59, 665–676
↑ Bingham, C. (1974) An Antipodally Symmetric Distribution on the Sphere. Ann. Statist., 2, 1201-1225.
↑ Peel, D., Whiten, WJ., McLachlan, GJ. (2001) Fitting mixtures of Kent distributions to aid in joint set identification. J. Am. Stat. Ass., 96, 56–63
↑ Krieger Lassen, N. C., Juul Jensen, D. & Conradsen, K. (1994) On the statistical analysis of orientation data. Acta Cryst., A50, 741–748.
↑ Kent, J.T., Hamelryck, T. (2005). Using the Fisher–Bingham distribution in stochastic models for protein structure. In S. Barber, P.D. Baxter, K.V.Mardia, & R.E. Walls (Eds.), Quantitative Biology, Shape Analysis, and Wavelets, pp. 57–60. Leeds, Leeds University Press
↑ "Boomsma, W., Mardia, KV., Taylor, CC., Ferkinghoff-Borg, J., Krogh, A., Hamelryck, T. (2008) A generative, probabilistic model of local protein structure. Proc. Natl. Acad. Sci. USA, 105(26), 8932-8937". Retrieved 2008-06-26.
↑ Fisher, NI., Statistical Analysis of Circular Data, Cambridge University Press, 1993. ISBN 0-521-35018-2
↑ Jammalamadaka, S. Rao; Sengupta, A. (2001). Topics in Circular Statistics. World Scientific Publishing Company. ISBN 978-981-02-3778-3. Retrieved 2010-03-03.
↑ Jammalamadaka, S. Rao; SenGupta, A. (2001). Topics in circular statistics. New Jersey: World Scientific. ISBN 981-02-3778-2. Retrieved 2011-05-15.
↑ Fernandez M, Rueda C, Peddada S (2012). Identification of a core set of signature cell cycle genes whose relative order of time to peak expression is conserved across species. Nucl. Acids Res., 40(7), 2823--2832. URL http://nar.oxfordjournals.org/content/40/7/2823.
↑ Rueda C, Fernandez M, Peddada S (2009). Estimation of Parameters Subject to Order Restrictions on a Circle with Application to Estimation of Phase Angles of Cell-Cycle Genes. Journal of the American Statistical Association, 104(485), 338--347. URL http://amstat.tandfonline.com/doi/abs/10.1198/jasa.2009.0120

Books on directional statistics

Batschelet, E. Circular statistics in biology, Academic Press, London, 1981. ISBN 0-12-081050-6.
Fisher, NI., Statistical Analysis of Circular Data, Cambridge University Press, 1993. ISBN 0-521-35018-2
Fisher, NI., Lewis, T., Embleton, BJJ. Statistical Analysis of Spherical Data, Cambridge University Press, 1993. ISBN 0-521-45699-1
Mardia, KV. and Jupp P., Directional Statistics (2nd edition), John Wiley and Sons Ltd., 2000. ISBN 0-471-95333-4

External links

Directional Statistics, Concepts and Techniques in Modern Geography 25
CircStat: A MATLAB Toolbox for Circular Statistics, Journal of Statistical Software, Vol. 31, Issue 10, Sep 2009
Circular Values Math and Statistics with C++11, A C++11 infrastructure for circular values (angles, time-of-day, etc.) mathematics and statistics

Probability distributions

Discrete univariate with finite support

Benford Bernoulli Beta-binomial binomial categorical hypergeometric Poisson binomial Rademacher discrete uniform Zipf Zipf–Mandelbrot

Discrete univariate with infinite support

beta negative binomial Borel Conway–Maxwell–Poisson discrete phase-type Delaporte extended negative binomial Gauss–Kuzmin geometric logarithmic negative binomial parabolic fractal Poisson Skellam Yule–Simon zeta

Continuous univariate supported on a bounded interval, e.g. [0,1]

Arcsine ARGUS Balding–Nichols Bates Beta Beta rectangular Irwin–Hall Kumaraswamy logit-normal Noncentral beta raised cosine Triangular U-quadratic uniform Wigner semicircle

[[List of probability distributions#Supported_on_semi-infinite_intervals.2C_usually_.5B0.2C.E2.88.9E.29|Continuous univariate supported on a semi-infinite interval, usually [0,∞)]]

Benini
Benktander 1st kind
Benktander 2nd kind
Beta prime
Burr
chi-squared
chi
Coxian
Dagum
Davis
EL
Erlang
exponential
F
folded normal
Flory-Schulz
Fréchet
Gamma
Gamma/Gompertz
generalized inverse Gaussian
Gompertz
half-logistic
half-normal
Hotelling's T-squared
hyper-Erlang
hyperexponential
hypoexponential
inverse chi-squared (scaled inverse chi-squared)
inverse Gaussian
inverse gamma
Kolmogorov
Lévy
log-Cauchy
log-Laplace
log-logistic
log-normal
matrix-exponential
Maxwell–Boltzmann
Maxwell–Jüttner
Mittag–Leffler
Nakagami
noncentral chi-squared
Pareto
phase-type
Poly-Weibull
Rayleigh
relativistic Breit–Wigner
Rice
Rosin–Rammler
shifted Gompertz
truncated normal
type-2 Gumbel
Weibull
Wilks' lambda

Continuous univariate supported on the whole real line (−∞, ∞)

Cauchy exponential power Fisher's z generalized normal generalized hyperbolic geometric stable Gumbel Holtsmark hyperbolic secant Johnson SU Landau Laplace Linnik logistic noncentral t normal (Gaussian) normal-inverse Gaussian skew normal slash stable Student's t type-1 Gumbel Tracy–Widom variance-gamma Voigt

Continuous univariate with support whose type varies

generalized extreme value generalized Pareto Tukey lambda q-Gaussian q-exponential q-Weibull shifted log-logistic

Mixed continuous-discrete univariate distributions

rectified Gaussian

Multivariate (joint)

Discrete Ewens multinomial Dirichlet-multinomial negative multinomial Continuous Dirichlet Generalized Dirichlet multivariate normal Multivariate stable multivariate Student normal-scaled inverse gamma normal-gamma Matrix-valued inverse matrix gamma inverse-Wishart matrix normal matrix t matrix gamma normal-inverse-Wishart normal-Wishart Wishart

Directional

Univariate (circular) directional Circular uniform univariate von Mises wrapped normal wrapped Cauchy wrapped exponential wrapped Lévy Bivariate (spherical) Kent Bivariate (toroidal) bivariate von Mises Multivariate von Mises–Fisher Bingham

Degenerate and singular

Degenerate discrete degenerate Dirac delta function Singular Cantor

Families

Circular compound Poisson elliptical exponential natural exponential location-scale maximum entropy mixture Pearson Tweedie wrapped

Directional statistics

Circular and higher-dimensional distributions

von Mises circular distribution

Circular uniform distribution

Wrapped normal distribution

Wrapped Cauchy distribution

Wrapped Lévy distribution

Distributions on higher-dimensional manifolds

The fundamental difference between linear and circular statistics

Moments

Measures of location and spread

Distribution of the mean

Goodness of fit and significance testing

Software

See also

References

Books on directional statistics

External links