Squared deviations

From Wikipedia, the free encyclopedia

The definition of variance is either the expected value (when considering a theoretical distribution), or average (for actual experimental data) of squared deviations from the mean. Computations for analysis of variance involve the partitioning of a sum of squared deviations. An understanding of the complex computations involved is greatly enhanced by a detailed study of the statistical value:

\operatorname{E}(  X ^ 2 ).

It is well-known that for a random variable X with mean μ and variance σ2:

\sigma^2 = \operatorname{E}(  X ^ 2 ) - \mu^2[1]

Therefore

\operatorname{E}(  X ^ 2 ) = \sigma^2 + \mu^2.

From the above, the following are readly derived:

\operatorname{E}\left( \sum\left( X ^ 2\right) \right) = n\sigma^2 + n\mu^2
\operatorname{E}\left( \left(\sum X \right)^ 2 \right) = n\sigma^2 + n^2\mu^2

Contents

[edit] Sample variance

The sum of squared deviations needed to calculate variance (before deciding whether to divide by n or n − 1) is most easily calculated as

S = \sum x ^ 2 - \left(\sum x\right)^2/n

From the two derived expectations above the expected value of this sum is

\operatorname{E}(S) = n\sigma^2 + n\mu^2 - (n\sigma^2 + n^2\mu^2)/n

which implies

\operatorname{E}(S) = (n - 1)\sigma^2.

This effectively proves the use of the divisor (n − 1) in the calculation of an unbiased sample estimate of σ2

[edit] Partition — analysis of variance

In the situation where data is available for k different treatment groups having size ni where i varies from 1 to k, then it is assumed that the expected mean of each group is

\operatorname{E}(\mu_i) = \mu + T_i

and the variance of each treatment group is unchanged from the population variance σ2.

Under the Null Hyporthesis that the treatments have no effect, then each of the Ti will be zero.

It is now possible to calculate three sums of squares:

Individual
I = \sum x^2
\operatorname{E}(I) = n\sigma^2 + n\mu^2
Treatments
T = \sum_{i=1}^k \left(\left(\sum x\right)^2/n_i\right)
\operatorname{E}(T) = k\sigma^2 + \sum_{i=1}^k n_i(\mu + T_i)^2
\operatorname{E}(T) = k\sigma^2 + n\mu^2 + 2\mu \sum_{i=1}^k (n_iT_i) + \sum_{i=1}^k n_i(T_i)^2

Under the null hypothesis that the treatments cause no differences and all the Ti are zero, the expectation simplifies to

\operatorname{E}(T) = k\sigma^2 + n\mu^2.
Combination
C = \left(\sum x\right)^2/n
\operatorname{E}(C) = \sigma^2 + n\mu^2

[edit] Sums of squared deviations

Under the null hypothesis, the difference of any pair of I, T, and C does not contain any dependency on μ, only σ2.

\operatorname{E}(I - C) = (n - 1)\sigma^2 Total Squared Deviations
\operatorname{E}(T - C) = (k - 1)\sigma^2 Treatment Squared Deviations
\operatorname{E}(I - T) = (n - k)\sigma^2 Residual Squared Deviations

The constants (n − 1), (k − 1), and (n − k) are normally referred to as the number of degrees of freedom.

[edit] Example

In a very simple example, 5 observations arise from two treatments. The first treatment gives three values 1, 2, and 3, and the second treatment gives two values 4, and 6.

I = \frac{1^2}{1} + \frac{2^2}{1} + \frac{3^2}{1} + \frac{4^2}{1} + \frac{6^2}{1} = 66
T = \frac{(1 + 2 + 3)^2}{3} + \frac{(4 + 6)^2}{2} = 12 + 50 = 62
C = \frac{(1 + 2 + 3 + 4 + 6)^2}{5} = 256/5 = 51.2

Giving

Total squared deviations = 66 − 51.2 = 14.8 with 4 degrees of freedom.
Treatment squared deviations = 62 − 51.2 = 10.8 with 1 degree of freedom.
Residual squared deviations = 66 − 62 = 4 with 3 degrees of freedom.

[edit] Two-way analysis of variance

The following hypothetical example gives the yields of 15 plants subject to two environmental variations, and three fertilisers.

Extra CO2 Extra Humidity
No Fertiliser 7, 2, 1 7, 6
Nitrate 11, 6 10, 7, 3
Phosphate 5, 3, 4 11, 4

Five sums of squares are calculated:

Factor Calculation Sum σ2
Individual 72 + 22 + 12 + 72 + 62 + 112 + 62 + 102 + 72 + 32 + 52 + 32 + 42 + 112 + 42 641 15
Fertiliser × Environment \frac{(7+2+1)^2}{3} + \frac{(7+6)^2}{2} + \frac{(11+6)^2}{2} + \frac{(10+7+3)^2}{3} + \frac{(5+3+4)^2}{3} + \frac{(11+4)^2}{2} 556.1667 6
Fertiliser \frac{(7+2+1+7+6)^2}{5} + \frac{(11+6+10+7+3)^2}{5} + \frac{(5+3+4+11+4)^2}{5} 525.4 3
Environment \frac{(7+2+1+11+6+5+3+4)^2}{8} + \frac{(7+6+10+7+3+11+4)^2}{7} 519.2679 2
Composite \frac{(7+2+1+11+6+5+3+4+7+6+10+7+3+11+4)^2}{15} 504.6 1

Finally, the sums of squared deviations required for the analysis of variance can be calculated.

Factor Sum σ2 Total Environment Fertiliser Fertiliser × Environment Residual
Individual 641 15 1 1
Fertiliser × Environment 556.1667 6 1 −1
Fertiliser 525.4 3 1 −1
Environment 519.2679 2 1 −1
Composite 504.6 1 −1 −1 −1 1
Squared deviations 136.4 14.668 20.8 16.099 84.833
Degrees of freedom 14 1 2 2 9

[edit] See also

[edit] References

  1. ^ Mood & Graybill: An introduction to the Theory of Statistics (McGraw Hill)