Talk:Algorithms for calculating variance

From Wikipedia, the free encyclopedia

Editorial comment: it is actually the first formula that has precision problems when dealing with limited-precision arithmetic. If the difference between measurements and the mean is very small, then the first formula will yield precision problems, as information will be lost in the (xi - µ) operation. There is no such loss of significance in the intermediate operations of the second formula. -- ScottMoonen
Editorial comment the second: in fact, the second formula is the one more commonly beset with problems. The first can have problems when the mean is very large relative to the variance, but this is relatively rare in practice and this problem also affects the second formula. Much more common is the situation where you have comparable mean and variance and a very large number of observations. In this case, the second formula will result in the subtraction of two very large numbers whose difference is relatively small (by a factor roughly equal to the number of observations). If you have one million observations, you lose roughly six significant figures with the second formula if you use ordinary floating point arithmetic. -- TedDunning
comment: The problem may occur
  1. when the deviations are very small relative to the mean or
  2. when they are small relative to the representational capacity of the arithmetic instrument (floating point computer, fixed point calculator, paper and pencil).

To be precise we have to specify the instrument and the nature of the data. -- DickBeldin


Ted, what do you mean by ordinary floating point arithmetic? Are talking about precision with repsect to register width, or is there an implicit reference to some "non-ordinary" scheme, that is, non-IEEE? Thanks.


I thought the reference was pretty explicit. I didn't mean, however, to imply the use of non-standard but still floating point arithmetic. I had in mind systems that were entirely different. A simple example is a system that uses rationals based on unlimited precision integers. Another system might be fixed point arithmetic. The first would not lose precision in the computation while the second would lose more or less precision than I quote depending on the values in question and the characteristics of the fixed point system. You could also imagine an optimizer that noted that you might be adding up large numbers of very small numbers and would re-order the addition to avoid most of the round-off errro by using log N temporary values to perform the addition. -- TedDunning


FWIW my favorite variance algorithm is the 'corrected two pass algorithm' of Chan, Golub, Leveque (American Statistician, 1983, page 242):

avg=0
for all i
  avg=avg+data[i]
end for
avg=avg/n
sum=0
sum2=0
for all i
   sum2=sum2+(data[i]-avg)^2
   sum=sum+(data[i]-avg)
end for
var=(sum2-sum/n)/(n-1)
return var

AlexC

I added this algorithm (as Algorithm II), but without the sum in the second loop -- the sum of the variations from the mean is of course 0 (\sum_i (x_i-\mu)=\sum_i x_i-N\frac 1N \sum_i x_i=0). But perhaps this is a clever sort of compensation, where the errors in sum and sum2 are known to be (probably, perhaps) correlated, so that subtracting them is a benefit? Still, though, for all datasets where \sigma^2 \not\ll \mu/N, \left|s\right|\ll s_2 (as in sum and sum2), so the gain in accuracy might be irrelevant. And, if the second loop is meant to be the normal variance operation on preconditioned data, why isn't sum squared at the end? Anyway, if it's a Good Thing that I've omitted, please tell me and/or add it to the article (with explanation!). --Tardis 03:26, 9 June 2006 (UTC)

I removed the following comments from the article as they do not belong there. I have posted them here so that somebody may investigat them further. It would be nice if users use the talk page for comments and not the article. --Richss 16:12, Nov 11, 2004 (UTC)

Why is the second equation n(n-1)? I know two possibilities n (unbiased) or n-1 biased...See also the article Variance in wikipedia. This equations do not look good... --132.231.54.1
The above algorithm does not seem to be correct. The resulting variance does not equal either one of the variances whose formulas are given at the top of the page. --137.226.12.196

--- I had a look at the source for AlexCs algorithm. I believe there is a mistake: var=(sum2-sum/n)/(n-1) should read var=(sum2-sum*sum/n)/(n-1)

The result of this algorithm agrees with my graphic calculator. I think the corrected algorithm should be added to the page

-- aj504@york.ac.uk who still hasn't got a wikipedia account :)

Contents

[edit] Incorrect Algorithm?

The second algorithm doesn't look right to me. At each step it's calculating the variance of data[i] with respect to the *average so far*, not the true average.

Can someone confirm that I'm not just reading it wrong?

I have derived and added the update formulas (only the results) for the unbiased/biased variance. The clearly shown that the (old) second algorithm which was in dispute was wrong, so I removed it. I have also removed the example which just shows some numbers (was it a weak numerical proof?) and does not provide any insight on how the formulas work. It was not clear whether the algorithms were for biased or unbiased estimators, so I added comments on it. BurkayDonderici

[edit] Algorithm III does not compute

Hi, In the mathematical equation for m(new), I don't agree with (n+1) in the denominator. I believe that the denominator should be simply (n).

I didn't want to edit the entry since I am not a mathematician. CAN SOMEONE WHO IS PLEASE MAKE THE APPROPRIATE COMMENTS .


If you plug some numbers into the equation shown, and compare your results to what a spreadsheet calculator gives, you will see that there is a significant error. I tested the show equation in MS Excel. I used a population of 404 values generated by the RANDBETWEEN(0,1000) function. I used 4 methods to calculate the average at each point through the population. Method 1) (Cumulative total)/(Number of samples) [this calculation was done for each sample for tracking purposes Method 2) Previous average + new contribution. M(new) = M(old) + ( (M(old)-X(new))/n ) Method 3) Previous average + new contribution. M(new) = M(old) + ( (M(old)-X(new))/(n+1) ) Method 4) I then used the Average function on the total population.

Methods 1 and 2 produced the same results all the way through the population and matched Method 4's result for the whole population. However, Method 3 always produced a lower result, and the error (which started significatly) reduced as the number of samples grew.

Regards, Napoleon BlownApart

[edit] Algorithm III

I'm pretty sure this is wrong. From Welford's Paper:

S[n]=S[n-1]+(n-1)/n*(x[n]-mean[n-1])^2

--cfp 21:58, 15 August 2006 (UTC)

I'm talking rubbish it's fine as x[n]-mean[n] = (n-1)/n*(x[n]-mean[n-1]). Sorry! --cfp 22:20, 15 August 2006 (UTC)

[edit] Online algorithm

I removed the line about Knuth/Welford being on online algorithm. All the algorithms in this article are online algorithms, it's not worth noting. —johndburger 12:24, 18 September 2007 (UTC)

Well, algorithms I and III only scan the list of data once, whereas both versions of algorithm II scan it twice. Hence, since I/O is by several orders of magnitude slower than calculations, that means that, unless the whole data set is read into memory, algorithm II will be twice as slow, and it won't work if the data file is not seekable. --Army1987 16:00, 28 October 2007 (UTC)



[edit] Algorithm IV

Doesn't the final step in West's algorithm have a typo? Instead of:

Variance = S * sumweight * n / (n-1)

Shouldn't it actually be:

Variance = S / sumweight * n / (n-1)

Looking at West's paper, he has "mathematical description" and "computational description" side by side. The math for the final line looks like: 
  S^2 = \frac{T}{(n-1)/n \sum w_i}
And if you work through it, T should be the variance times the (incremental) weight for the recurrence to work. West's "computational description" is as its presented here, but I think its a typo in West.

Shauncutts (talk) 06:15, 29 January 2008 (UTC)

Actually, I'm pretty sure the last line should be: Variance = S * n / ((n-1) * sumweight)
(I just implemented this algorithm based on the original paper, and have been manually checking results - I'm going to edit the main page to reflect this)
Gorm (talk) 12:28, 19 February 2008 (UTC)


Incidentally, would it not make sense to alter algorithm 4 by adding the temporary variables used in the original paper. This will save at least three operations per loop, at the cost of making it less readable. Specifically the algorith would then look like this:

n = 0
foreach x in the data:
  if n=0 then 
      n = 1
      mean = x
      S = 0
      sumweight = weight
  else
      n = n + 1
      temp = weight + sumweight
      Q = x-mean
      R = Q*weight/temp
      S = S + sumweight*R*Q
      mean = mean + R
      sumweight = temp
  end if
end for
Variance = S * n / ((n-1) * sumweight)  // if sample is the population, omit n/(n-1)

Gorm (talk) 12:54, 19 February 2008 (UTC)

[edit] Minor spelling and grammar fixes

OSFTactical (talk) 22:59, 6 December 2007 (UTC)