Talk:Chi-square distribution

From Wikipedia, the free encyclopedia

I did appreciate the way the chi square distribution was discussed. This page will greatly help the students of statistics. To understand really statistic subject is that very hard,if one doesn't know on how to give his or her full concentration regarding the topic,but then this subject can greatly help students and researchers in interpreting what they had research on. But then I am just a little bit confuse of making a table for instance in the experimental and control group in their oral presentation skills. I used rating scales with the help of the rubric i got from one of the authors. What I am after with is on how to make a table as i have said relating it to the statistical treatment which is the chi-square. How a reseaarcher will organize the data presented by using many skills. Can you give us useful tips that will guide us to arrive in a more comprehensive interpretation of our research.Because knowing this will lead the researcher in order to get the degrees of freedom and eventually the interpretation of the study. (User-AL TAN... of Bicol University College of Education,English Major)

Thanks to the authoer for writing this article. However, although the article might be useful for students of maths, I can't understand it easily. I'm a research and use statistics for my work, but I need conceptual explanations, I don't care about the technicalities. User:xinelo

---

That phrase is absolutely correct--and really is only confusing if one doesn't know what affects the shape of the distribution: its parameters. And in the case of chi-square, there is but one--degrees of freedom.User:pc 1:09, 18 April 2006 (UTC)

The phrase "The chi-square distribution has one parameter: k - a positive integer which specifies the number of degrees of freedom (i.e. the number of Xi)" is perhaps confusing... the number of degrees of freedom being mostly related to the model to fit and not to the number of data points Meduz 14:18, 5 April 2006 (UTC)


this is too tough for those who don't know much about maths. Kokiri 10:28, 3 Jul 2004 (UTC) 203.47.84.40 04:15, 9 Mar 2005 (UTC)

I've tried to give it a less technical introduction, though the gradient of difficulty is still pretty steep (and, later, bumpy). seglea 23:34, 14 July 2005 (UTC)

But, MarkSweep, I dispute that the title is wrong. It is perhaps not ideal - but the great majority of references to this distribution, both in Wikipedia and in other literature, use the written-out form of "chi-square". Many readers would not know to look it up under "χ²". It is our business as editors to be better informed than our readers, but not to say things that suggest they are stupid. Introducing the article by "The Chi-square distribution, or χ² distribution..." does all that is necessary to tell people that they could refer to it symbolically. seglea 23:34, 14 July 2005 (UTC)

I think Mark's point is that the title is wrong because of the capital "C". Wikipedia wants titles to start with capitals, and sometimes (as in this case) this is wrong, in which case a wrongtitle template reference is added. I think the title "Chi-square distribution" with the wrongtitle template, and then the reference to "the chi-square or χ2 distribution" is good. PAR 00:22, 15 July 2005 (UTC)
The upper/lower case distinction was indeed my point: The distribution is usually referred to as the "chi-square(d) distribution" with a lower-case c, or as the "χ² distribution" with a lower-case chi (never an upper-case Chi). If and when the MediaWiki software allows us to have lower-case for the first character of a title, this page should be called "chi-square distribution". Meanwhile, we can use the wrongtitle template. FWIW, I'm very much in favor of PAR's recent version. --MarkSweep 01:32, 15 July 2005 (UTC)
Oh, right, fair enough - I agree no-one ever uses upper case C for chi... though I think they would if it came at the beginning of a sentence, and it wouldn't look all that wrong. It's not a total solecism like talking about t-tests with a capital T (something students let their wordprocessors do to them with tedious frequency). Really the standard wrongtitle template is too strong here - it risks misleading the reader; we need a more specific statement that explains what the title would be for preference. seglea 23:56, 15 July 2005 (UTC)
In my opinion, the disclaimer about the capital latter at the start of this page is unnecessary and looks silly. Many thousands of titles in Wikipedia are names that don't usually take a capital letter, yet they don't have this disclaimer. What about Digamma function and Beta distribution? For that matter, why is the title "Chi-square distribution" any worse than "Normal distribution", or "Statistics"? When "chi" is put at the start of a sentence, it is written "Chi". Same for titles. --Zero 09:16, 10 January 2006 (UTC)
Because for familiar words like "Normal", everyone can be expected to know that it would ordinarily (that is, not in titles and not at the beginning of a sentence) be written as "normal". But if you see "Chi" in a title, you may not be able to tell whether it should be "Chi" in ordinary contexts, or "chi". The disclaimer tells you that it's the latter, rather than the former. --MarkSweep (call me collect) 09:05, 13 February 2006 (UTC)

In most recent textbooks as well as lectures, I've seen "chi-squared" used more often than "chi-square". I also came across Wolfram's page that used the title "Chi-Squared." Perhaps someone should at least make sure that chi-squared get redirected to this page. Iav 06:54, 23 September 2005 (UTC)


Contents

[edit] motivation

this article was useful for me. However the claim that "under reasonable assumptions, easily calculated quantities can be proved to have distributions that approximate to the chi-square distribution if the null hypothesis is true." could do with some sort of motivation (not the actual proof, that would be over the top, just some insightful commentary), further down the article, for example expounding on the meaning of 'easily calculated' 14:30, 28 September 2005 (UTC)

[edit] noncentral chi-square distribution, and a question

The chi-square distribution is what you get if you take the sum of the squares of a bunch of normal distributions with mean zero and standard deviation one. If the standard deviations are all the same but not 1, rescaling allows the chi-square to be used. If the means are not zero, however, you need a generalization, the noncentral chi-square distribution (see [1] for Wolfram's description). If the standard deviations differ, the result can supposedly still be expressed in terms of noncentral chi-square distributions ([2]) but how?

Further, there is what Wolfram calls the "chi distribution" (but which is more or less absent elsewhere on the Web) which is what you get if you take the square root of a chi-square.

It might be valuable to mention these as generalizations.

Which brings me to my question: suppose you have a short list of numbers with uncertainties and you want to compute their root-mean-squared length. What is a good estimator? Put another way, suppose you have a small collection of normal random variables, with assorted means and standard deviations; what is a good estimator for the square root of the sum of the squares of the means?

The standard trick for estimating the sum of the squares is to take the sum of the squares of the values minus the sum of the squares of the standard deviations. This gives a probably unbiased but often negative answer. Taking its square root leads to problems.

If it helps, the uncertainties are probably about the same for all the random variables.

A few points:

It'd be handy to link to it from here!

  • If you have a set of random variates xi, all drawn from a population with mean zero and standard deviation 1, then the sum of their squares will be chi-square distributed. Alternatively, if you have a set of random variates xi such that xi is drawn from a population with different means μi and different standard deviations σi, then the sum of the (xi − μi) / σi squared will be chi-square distributed.
  • If you have a set of random variates, all drawn from a population with mean μ and standard deviation 1, then the sum of their squares will be non-central chi-square distributed. Alternatively, if you have a set of random variates xi such that xi is drawn from a population with the same mean μ and different standard deviations σi, then the sum of the xi / σi squared will be non-central chi-square distributed.
All of the above assumes you know beforehand the mean and standard deviation of the population(s) from which you draw your sample. Its not clear whether that is the situation for your case.

No; in fact, I'm trying to estimate the means (well, actually, the root of the sum of their squares).

What exactly is your data? Do you have a bunch of random variates xi and each has a separate mean μi and std. deviation σi which you know beforehand? With maybe the σi all the same? Do you want to calculate an unbiased estimator of the square root of the sum of the xi squared? If so, the above two distributions won't do it. There's probably a name for the one that will, but I don't know what it is, but maybe we could calculate it. PAR 15:38, 6 October 2005 (UTC)

More or less, yes. If I knew the means and standard deviations beforehand, what would I be trying to estimate?

I have a collection of n random variables (n is around ten) X_i, each (approximately) normal with known standard deviation (not all equal) and unknown mean μi. I want to estimate \sqrt{\mu_1^2+\cdots+\mu_n^2}.

Put another way, I have a point in n-space whose coordinates are physical measurements of known uncertainty, and I want to estimate its distance from the origin. n is not large, and the uncertainties are not small compared to the values.

There is a standard estimator for \mu_1^2+\cdots+\mu_n^2 (the square of the distance from the origin): it is x_1^2+\cdots+x_n^2-\sigma_1^2-\cdots-\sigma_n^2. This estimator is, I think, unbiased; but unfortunately it frequently takes negative values, which means that you can't simply take its square root to get an estimator for the square root (even if you don't mind some bias).

Ok, I don't know the answer to your question, but my first guess is that the best estimator is simply the square root of the sum of the squares of the xi. I don't understand why anyone would want to subtract the variances from the squares.

The reason is, if you want to estimate the square of the quantity, well, the expected value of X_1^2+\cdots+X_n^2 is

\frac{1}{\sigma_1\sqrt{2\pi}}\cdots\frac{1}{\sigma_n\sqrt{2\pi}}\int_{\mathbb{R}^n}(x_1^2+\cdots+x_n^2)e^{-\frac{(x_1-\mu_1)^2}{2\sigma_1^2}-\cdots-\frac{(x_n-\mu_n)^2}{2\sigma_n^2}}dx_1\cdots dx_n

which is equal to \mu_1^2+\sigma_1^2+\cdots+\mu_n^2+\sigma_n^2.

The question is then, is my simple guess an unbiased estimator? I don't know, but I do know how to set up the equations to derive the answer. To start out with, lets just try to do the case where you have two random variables. If we can't do that, then forget it. Assuming each is normally distributed with its own mean and variance, and that they are independent, then the probability that the first has value x1 to x1 + dx1 AND the second has value x2 to x2 + dx2 is P(x1,x2)dx1dx2 where:
P(x_1,x_2)\,dx_1\,dx_2=\frac{1}{2\pi\sigma_1\sigma_2}\exp\left( -\frac{1}{2}\left(\frac{x_1-\mu_1}{\sigma_1}\right)^2 -\frac{1}{2}\left(\frac{x_2-\mu_2}{\sigma_2}\right)^2 \right)\,dx_1\,dx_2
Changing to circular coordinates with x_1=r\,\cos(\theta) and x_2=r\,\sin(\theta) gives
P(r,\theta)\,r\,dr\,d\theta=
\frac{1}{2\pi\sigma_1\sigma_2}\exp\left(  \!-\!\frac{r^2\cos^2(\theta)}{2\sigma_1^2} \!+\!\frac{r\mu_1\cos(\theta)}{\sigma_1^2} \!-\!\frac{\mu_1^2}{2\sigma_1^2} \!-\!\frac{r^2\sin^2(\theta)}{2\sigma_2^2} \!+\!\frac{r\mu_2\sin(\theta)}{\sigma_2^2} \!-\!\frac{\mu_2^2}{2\sigma_2^2}  \right) r\,dr\,d\theta
If you set the μ's to zero and the σ's to one, and integrate over θ from 0 to 2π you get the chi-square distribution for 2 degrees of freedom, with r2 = χ2. If you set all the μ's equal and all the σ's equal and integrate over θ you get the non-central chi-square distribution, again with r2 = χ2. For your problem, we have to integrate as it stands, or at best, set all the σ's equal. We will have to work on that. Anyway, once that is done, and we have P(r), we then want to calculate the expectation of r which is the integral of rP(r) from zero to infinity. If that turns out to be the square root of the sum of the squares of the μ's, then we have an unbiased estimator. On the other hand, we could seek a maximum likelihood estimator, which is usually easier to calculate. Its the value of r for which P(r) is maximum. PAR 23:50, 6 October 2005 (UTC)

You can do better. Sort of. If what we want is the expected value of a function of a random variable having pdf f, E(\phi(X))=\int \phi(x) f(x) dx, so it suffices to compute

\frac{1}{\sigma_1\sqrt{2\pi}}\cdots\frac{1}{\sigma_n\sqrt{2\pi}}\int_{\mathbb{R}^n}\sqrt{x_1^2+\cdots+x_n^2}e^{-\frac{(x_1-\mu_1)^2}{2\sigma_1^2}-\cdots-\frac{(x_n-\mu_n)^2}{2\sigma_n^2}}dx_1\cdots dx_n

This is not an integral I know how to do, nor does MAPLE.

It's possible that one could express this distribution in terms of some sort of "noncentral chi distribution" whose pdf we could actually calculate; then a maximum likelihood estimator would be a reasonable thing to obtain. But actually finding the pdf will be very difficult:

f(r)=\frac{1}{\sigma_1\sqrt{2\pi}}\cdots\frac{1}{\sigma_n\sqrt{2\pi}}\int_{x_1^2+\cdots+x_n^2=r^2} e^{-\frac{(x_1-\mu_1)^2}{2\sigma_1^2}-\cdots-\frac{(x_n-\mu_n)^2}{2\sigma_n^2}} dm

where dm is the area measure on the sphere. I suppose we only need to find the derivative of this with respect to r and set that to zero, but it seems unlikely that that will be possible.

The problem arose because we noticed that our data was consistently biased upwards, especially when the variances were high. If we do the variance-subtraction trick, then the estimates of the square are no longer biased, but they are sometimes negative, which reminds you that simply taking the square root just won't cut it.

--Andrew 03:05, 7 October 2005 (UTC)


We are on exactly the same track. The integral that MAPLE can't do if it is reduced to n=2 is just my rP(r) from zero to infinity because by my definition r^2=x_1^2+x_2^2. Your dm is my rdr, the "area element" on the circle. What I want to do is convert from cartesian coordinates x1,x2,... to spherical coordinates, because they are the natural coordinates to use. If you make the right assumptions and integrate over all angles, you get the chi-square and the noncentral chi-square, but your full problem makes no assumptions.
Something I didn't realize before is that if you take your assumption that all std. deviations are the same, but the means are different, it does become the noncentral chi-square distribution with x = r2 / σ2 and λ = μ2 / σ2 where
\mu^2\equiv\sum_1^n \mu_i^2
r^2\equiv\sum_1^n x_i^2
That means you can solve the equation for the expectation of r=\sqrt{x\sigma^2}:
E(r)=\int_0^\infty \sqrt{x\sigma^2}~f(x;n,\mu^2/\sigma^2)dx
where f() is the noncentral chi-square. Mathematica solves this to be:
E(r)=\sigma\sqrt{\pi/2}~L_{1/2}^{k/2-1}(-\mu^2/2\sigma^2)
where L is the generalized Laguerre polynomial. I guess that proves that just taking the rms is not unbiased.

PAR 04:48, 7 October 2005 (UTC)

[edit] Lead paragraph

Hi

sorry to revert those well-meaning edits... but the old version was correct, terse, and reasonably clear. The new version was incorrect and (IMO) confusing. Nevertheless, I see your point: the article is deficient in that the (o-e)^2/e formula we all learned in school is not mentioned. Remember that this statistic is only asymptotically chi-squared. I will add a section on this specific application of the chi-squared distribution today (if I get a minute).

best wishes, Robinh 09:00, 13 February 2006 (UTC)

I completely agree. Incidentally, there is a separate article on Pearson's chi-square test, which discusses the well-known chi-square test statistic. --MarkSweep (call me collect) 09:08, 13 February 2006 (UTC)

[edit] Removal of external link? Input please

Greetings all,

Recently User:128.135.17.105 removed an external link (Free Chi-Square Calculator) that I had placed on this page to an online Chi-square calculator that is available for free on my website. The reason given for this removal by User:128.135.17.105 is that the link "looks like an add" (sp). I believe that the free calculator adds a great deal of value to the page, and should therefore be reposted (perhaps in a less audacious "ad-like" form). Here's why:

1. The other external link to an online calculator (On-line calculator for the significance of chi-square) computes probabilities for Chi square values, but not the Chi-square values themselves.
2. The distribution calculator referenced by another external link (Distribution Calculator) is not online, but instead requires that the user download and install software on their computer in order to compute Chi-square values.
3. The form of the external link that User:128.135.17.105 removed (which supposedly looks like an ad) was modeled after another external link on the page that User:128.135.17.105 had no problem with. I invite you to compare for yourselves:
Existing external link: On-line calculator for the significance of chi-square, in Richard Lowry's statistical website at Vassar College.
Link removed by User:128.135.17.105: Free Chi-Square Calculator from Daniel Soper's Free Statistics Calculators website. Computes chi-square values given a probability value and the degrees of freedom.

Out of respect for the opinion of User:128.135.17.105, I will not repost the link right away. Also note that I wouldn't mind toning down the verbage of the external link if it is restored. If anyone besides myself agrees that there is value in the external link that User:128.135.17.105 removed, please let the community know by posting your thoughts here. I would particularly enjoy discussing this issue further with User:128.135.17.105, as I believe that (in the spirit of Wikipedia) we can resolve this issue amicably.  :-)

--DanSoper 00:29, 23 June 2006 (UTC)

I agree with the author of the above message that the chi-square calculator holds significant value for the readers of the Chi-square distribution page. After reviewing the Wikipedia logs, I’ve discovered that the user who removed the link from this page also removed other links posted by the author of the above message on several additional statistics-related Wikipedia pages, perhaps indicating a personal issue between the two contributors. While the author of the above message makes solid arguments supporting the inclusion of his/her links on these pages, I feel that I should remind him/her that it is generally considered poor etiquette to post links on Wikipedia to one’s own web site. Nevertheless, I am of the opinion that the links would improve these pages, and as a neutral party, I will repost the links myself in the coming days if there are no objections. -J.K., Kings College

  • Note: Since writing this message, it looks the anonymous user purporting to be "J.K., Kings College" has been blocked for one month for vandalism. Not sure if that mmakes sense, actually, but it's a shared IP address, maybe even an open proxy. See User talk:213.249.155.239 for the panoply of warnings, blocks and vandalism, though. Not a very credible source. · rodii · 01:54, 28 June 2006 (UTC)
The original text is, "Free Chi-Square Calculator from Daniel Soper's Free Statistics Calculators website. Computes chi-square values given a probability value and the degrees of freedom." which looks like advertisement because it contains the name of the poster, claims to be free twice, is the only link in statistics to a .com page, and page contains google ads. Removing a few of these and I would have just started a discussion.
While nice an perhaps useful, is the calculator encyclopedic? The external linking policy is that it should be lined externally if it would be linked internally. 128.135.226.222 00:00, 28 June 2006 (UTC)

Though I should let you guys know, I was using this as a reference to calculate a p-value ( from a point x -> inf). I think the cumulative function listed here is actually calculating the area under the density curve from a point x -> infinity and NOT -infinity -> x. Thought I'd let you know, maybe I'm just confused

[edit] Citations and Historical Context

I appreciate the external links and support keeping them.

Some dates would be very useful in a historical context here. When did Fisher and Snedecor work together? When were the foundations laid for the chi-square test?

It would also be great to see some citations - for instance to books (possibly Statistical Methods, Snedecor GW and Cochran WG, Iowa State University Press, 1960) and also to peer-reviewed journals, especially in the life sciences.

[edit] degrees of freedom

The current text reads: If p independent linear homogeneous constraints are imposed on these variables, the distribution of X conditional on these constraints is \chi^2_{k-p}, justifying the term "degrees of freedom".

I don't really understand this. It seems to me that some correction factor would be needed. Can somebody explicate that section a bit? Perhaps even add an example or reference? ( and shouldn't X be Q ? ) Thanks. Sander123 16:46, 13 March 2007 (UTC)

I've removed the part from the article. It is probably untrue and at least unsourced. As an example let x1 and x2 be N(0,1), let x3 = (x1+x2)/2. Then all xi are N(0,1) and the xi satisfy a homogeneous equation (x1+x2-2x3=0). As x1^2+x2^2 is chi-squared with 2 degrees of freedom, it cannot be that x1^2+x2^2+x3^2 is also chi-squared with 2 degrees since the x3^2 will have a positive contribution. Sander123 09:00, 26 March 2007 (UTC)