Anderson-Darling test

From Wikipedia, the free encyclopedia

The Anderson-Darling test, named after Theodore Wilbur Anderson, Jr. (1918–?) and Donald A. Darling (1915–?), who invented it in 1952^[1], is one of the most powerful statistics for detecting most departures from normality. It may be used with small sample sizes n ≤ 25. Very large sample sizes may reject the assumption of normality with only slight imperfections, but industrial data with sample sizes of 200 and more have passed the Anderson-Darling test.^{[citation needed]}

The Anderson-Darling test assesses whether a sample comes from a specified distribution. The formula for the test statistic $A$ to assess if data $\{Y_1<\cdots <Y_N\}$ (note that the data must be put in order) comes from a distribution with cumulative distribution function (CDF) $F$ is

A 2 = - N - S

where

$S=\sum_{k=1}^N \frac{2k-1}{N}\left[\ln F(Y_k) + \ln\left(1-F(Y_{N+1-k})\right)\right].$

The test statistic can then be compared against the critical values of the theoretical distribution (dependent on which $F$ is used) to determine the P-value.

The Anderson-Darling test for normality is a distance or empirical distribution function (EDF) test. It is based upon the concept that when given a hypothesized underlying distribution, the data can be transformed to a uniform distribution. The transformed sample data can be then tested for uniformity with a distance test (Shapiro 1980).

In comparisons of power, Stephens (1974) found $A 2$ to be one of the best EDF statistics for detecting most departures from normality.^[2] The only statistic close was the $W 2$ (Cramér von-Mises test) statistic.

1 Procedure
2 See also
3 External links
4 References

[edit] Procedure

(If testing for normal distribution of the variable X)

1) The data $X i$ , for $i=1,\ldots n$ , of the variable $X$ that should be tested is sorted from low to high.

2) The mean $\bar{X}$ and standard deviation $s$ are calculated from the sample of $X$ .

3) The values $X i$ are standardized as

$Y_i=\frac{X_i-\bar{X}}{s}$

4) With the standard normal CDF $Φ$ , $A 2$ is calculated using

$A^2 = -n -\frac{1}{n} \sum_{i=1}^n (2i-1)(\ln \Phi(Y_i)+ \ln(1-\Phi(Y_{n+1-i})))$

or without repeating indices as

$A^2 = -n -\frac{1}{n} \sum_{i=1}^n\left[(2i-1)\ln\Phi(Y_i)+(2(n-i)+1)\ln(1-\Phi(Y_i))\right].$

5) $A * 2$ , an approximate adjustment for sample size, is calculated using

$A^{*2}=A^2\left(1+\frac{0.75}{n}+\frac{2.25}{n^2}\right)$

6) If $A * 2$ exceeds 0.752 then the hypothesis of normality is rejected for a 5% level test.

Note:

1. If s = 0 or any $P i =$ (0 or 1) then $A 2$ cannot be calculated and is undefined.

2. Above, it was assumed that the variable $X i$ was being tested for normal distribution. Any other theoretical distribution can be assumed by using its CDF. Each theoretical distribution has its own critical values, and some examples are: lognormal, exponential, Weibull, extreme value type I and logistic distribution.

3. Null hypothesis follows the true distribution (in this case, N(0, 1)).

[edit] See also

[edit] External links

US NIST Handbook of Statistics

[edit] References

^ Anderson, T. W.; Darling, D. A. (1952). "Asymptotic theory of certain "goodness-of-fit" criteria based on stochastic processes". Annals of Mathematical Statistics 23: 193–212. doi:10.1214/aoms/1177729437.
^ Stephens, M. A. (1974). "EDF Statistics for Goodness of Fit and Some Comparisons". Journal of the American Statistical Association 69: 730–737. doi:10.2307/2286009.