Talk:Correlation

From Wikipedia, the free encyclopedia

[edit] Algorithms

I recently checked the algorithm and it computes the correlation with slightly different formula: \frac{\sum{(x_i-\bar{x})(y_i-\bar{y})}}{n s_x s_y}, is this on purpose (in that case some note in the text would be needed) or an error? --Tomas.hruz 09:53, 6 October 2006 (UTC)


--210.212.205.18 10:02, 8 August 2006 (UTC)--210.212.205.18 10:02, 8 August 2006 (UTC)

Contents

Could we see some account of this concept of "correlation ratio"? All I can find is on Eric Weisstein's site, and it looks like what in conventional nomenclature is called an F-statistic. Michael Hardy 21:02 Mar 19, 2003 (UTC)


it goes somewhat like:
correlation_ratio(Y|X) = 1 - E(var(Y|X))/var(Y)
I don't know the conventional nomenclature, but in the literature on similarity measures for image registration it is called just this...


The relation between them is already there on the autocorrelation page. "...the autocorrelation is simply the correlation of the process against a time shifted version of itself." You can see this trivially by considering the equation for correlation if the series Yt = Xt-k. --Richard Clegg 20:43, 7 Feb 2005 (UTC)


This page currently tells only the mathematical aspects of correlation. While it is, obviously, a mathematical concept, it is used in many areas of research such as Psychology (my own field; sort of) in ways that would be better defined by purpose than mathematical properties. What I mean is, I'm not sure how to add information about what correlation is used for into this article - I wanted to put in the "vicars and tarts" demonstration of "correlation doesn't prove causality", for instance. But that would require a rather different definition of correlation, in terms of "the relationship between two variables" or something. Any ideas on how to rewrite would be welcome - if not, of course, I'll do it myself at some point...

Oh, and I can't decide what to do about that ext. link - as is, it's rather useless, taking you to the homepage of a particular reference site (I suspect it of being "Wikispam"); but if you find the right page and break out of their frameset, there is actually some interesting info at http://www.statsoft.com/textbook/stbasic.html#Correlations. Ah well, maybe I'll come back to this after I've sorted out some of the memory-related pages... IMSoP 17:43, 20 May 2004 (UTC)

I have now partially addressed the concerns above by putting in a link to spurious relationship, which treats the "correlation does not imply causation" cliche. Michael Hardy 21:33, 20 May 2004 (UTC)

I thought that the deleted stuff about the sample for correlation was useful. Not enough stats people pay attention to the difference between a statistic and an estimator for that statistic. The Pearson product-moment correlation coefficient page does cover this but it would be nice to see the treatment for the standard correlation too (IMHO at least). --Richard Clegg 20:21, 10 Feb 2005 (UTC)

No -- you're wrong on two counts: There is no such thing as an estimator for a statistic; probably you mean as estimator of a population parameter; and statisticians pay a great deal of attention to such matters; it is non-statisticians who contribute to statistics articles on wikipedia and in journal article who loose sight of the distrinction. Michael Hardy 23:34, 10 Feb 2005 (UTC)
... but I agree with you that the material on sample correlation should be here. Michael Hardy 23:34, 10 Feb 2005 (UTC)
Apologies, I was writing in haste. You are correct here that my comments refer to a population parameter rather than a "statistic" in the formal sense of a function on the data. My comments were intended to refer to contributors to wikipedia articles on staticss (in the wider sense). --Richard Clegg 11:37, 11 Feb 2005 (UTC)

I think so too, but I was rushed. I will put the section back soon, but I will combine it with the Pearsons section. Paul Reiser 21:09, 10 Feb 2005 (UTC)

Thanks. I, for one, think it would help clarify this page.

--Richard Clegg 22:45, 10 Feb 2005 (UTC)


[edit] Cross-correlation in signal processing

what about the signal processing version of correlation? kind of the opposite of convolution, with one function not reversed. also autocorrelation. does it have an article under a different name? if so, there should be a link. after reading this article over again, i believe the two are related. i will research some and see, (and add them to my to do list) but please add a bit if you know the connection... Omegatron 20:10, Feb 13, 2004 (UTC)

This has been created under a separate article called cross-correlation, although they are clearly related. Merge? Or link to each other? - Omegatron 04:37, Mar 20, 2005 (UTC)

Correlation matrix search redirects to this page but I can't find here what a correlation matrix is. I have some idea from http://www.vias.org/tmdatanaleng/cc_covarmat.html , but don't feel confident enough to write an entry, and I am no sure where to add it.

Covariance_matrix exists.

Scatter_matrix do not.

--Dax5 19:16, 7 May 2005 (UTC)

[edit] Correlation function in spreadsheets

The "Correlation function in spreadsheets" section looks very useless to me, and the information included is probably wrong since the correlation of two real numbers does not make sense. I will delete it, if you put it back can you tell me why?

Muzzle 12:44, 6 September 2006 (UTC)

I agree with your edit. Thanks. Chris53516 13:17, 6 September 2006 (UTC)

[edit] Random Variables

I was the one that put the disclaimer on "random" variables. If anybody would like to discuss, I'm all ears, so to speak. The preceding unsigned comment was added by Phili (talk • contribs) .

I reverted that note. You wrote:
Several places in this article refer to "random" variables. By definition a random variable has no correlation with anything else (if it does have a correlation the variable is either 1) not random, or 2) the correlation is a coincidence likely due to a small sample size). It is more accurate to think of these not as random variables, but simply as variables that have an undetermined relationship.
By definition, a random variable is just a measurable function one some probability space. And yes, two random variables can be very much correlated. :) Oleg Alexandrov (talk) 01:56, 30 November 2005 (UTC)

The "unsigned" person wrote utter nonsense. This is a crackpot. Michael Hardy 02:38, 30 November 2005 (UTC)

That might be a bit harsh. User:Phli has exactly three edits so far. Let's assume he just isn't familiar with the technical notion of a random variable, until proved otherwise. --Trovatore 21:13, 5 December 2005 (UTC)

[edit] Disambiguation : geometry

Isn't correlation also a term in projective geometry? When PG(V) and PG(W) are projective spaces, a correlation α is a bijection from the subspaces of V to the subspaces of W, such that V\subset W is equivalent with W^{\alpha}\subset V^{\alpha}

[edit] Diagram

Perhaps I'm being thick, but after a minute or two of scrutinising it I couldn't work out how to read the diagram on this article. Which scatter plot corresponds to which coefficient, and why are they arranged in that way? It is not clear. Ben Finn 22:12, 18 January 2006 (UTC)

You're right -- I'd never looked for that. I'll leave a note to the author of the illustration. Michael Hardy 00:33, 19 January 2006 (UTC)
Wait -- it's stated in the caption. You have to read it carefully. Michael Hardy 00:36, 19 January 2006 (UTC)
The figure is not very intuitive... --128.135.82.179 06:13, 6 February 2006 (UTC)

I actually thought the figure is awesome, but now that I consider it, I wonder if it is intuitive and informative only for those who understand correlation well enough not to really need the figure. Also, I think it would be instructive to show a high-correlation scatterplot where the variances of the two underlying series are in a ratio of, say, 1:6 rather than 1:1 in the plots shown. --Brianboonstra 15:53, 3 March 2006 (UTC)

I have no clue how that figure works, and I'm in a PhD program. --Alex Storer 22:50, 17 April 2006 (UTC)


I have added this sentence to the caption to try to clarify it, in case anyone is still confused:

Each square in the upper right corresponds to its mirror-image square in the lower left, the "mirror" being the diagonal of the whole array.

Michael Hardy 22:14, 22 April 2006 (UTC)


I understand the figure, but I think it's WAY too complicated, especially for someone who doesn't already know what it is. Its slightly neat to see that you have four different data sets generated, and you're looking at all pairs... but I think for most people it would be MUCH MUCH clearer if you just showed four examples in a row, with labels directly above: R2 in {0, .5, .75, 1 } or something. 24.7.106.155 09:27, 7 May 2006 (UTC)


I have to agree that the diagram is over complicated. It also doesn't show negative correlations. Would it be better to have a table with two rows. Each colum could have a correlation coefficient as a number in the first row, and a scatter plot in the second row. The coeffecients could range between -1 and 1. I think that this would also emphasise that a negative correlation is still a strong correlation. 80.176.151.208 07:49, 31 May 2006 (UTC)

[edit] Intercorrelation

Sometimes one sees the term "intercorrelation". What does this exactly signifies? I associate "intercorrelation" as the correlation between two different variables - but that is what standard "correlation" is. It seems to me that "inter" is redundant... And the opposite of autocorrelation is not intercorrelation but cross-correlation... -fnielsen 15:43, 10 February 2006 (UTC)

I guess "intercorrelation" has some utility in multivariate analysis(?), see, e.g., supermatrix. - fnielsen 15:49, 10 February 2006 (UTC)

[edit] Algorithms

I just yesterday inserted a disclaimer about using the formula supplied as the basis for a one-pass algorithm, and included pseudocode for a stable single-pass algorithm in a separate section. For standard deviation, there is a separate page instead, at Algorithms_for_calculating_variance, but it seems to me that an analogous separate page should contain this algorithm only if a similar explication of the problems of numerical instability is included.--Brianboonstra 16:00, 3 March 2006 (UTC)

The last_x and last_y variables are unused in the pseudocode. They should probably be removed, no ? -- 29 March 2006

Agreed, and done. Brianboonstra 18:31, 11 April 2006 (UTC)

The algorithm does not take into account the case when either pop_sd_x or pop_sd_y is zero, causing a divide by zero on the last line. holopoj 17:06, 5 August 2006 (UTC)

That's arguably correct since correlation would not be defined in this case -- the equation which we are calculating would also have a divide by zero. --Richard Clegg 19:48, 5 August 2006 (UTC)

[edit] Table

The table at the beginning of the article is flawed in almost every regard. First, it is poorly designed. Suppose a reader wants to know what a low correlation is. He or she looks at the row, sees "low," and sees that the cell below it says "> -0.9." At first glance, this makes it sound as though ANY correlation that is greater than -0.9 is low, including 0, 0.9, etc. Then the next column says "low: < -0.4." It takes a moment to figure out that the author was actually intending to convey "low: -0.9 < r < -0.4." Something like this would be better:

Correlation coefficient
High correlation Low correlation No correlation (random) Low correlation High correlation
−1 < r < −0.9 −0.9 < r < −0.4 −0.4 < r < +0.4 +0.4 < r < +0.9 +0.9 < r < +1

though some letter other than r might be better, and less-than-or-equal-to signs belong in there somewhere. That brings up the second problem, though: Where on earth did these numbers come from? Cohen, for example, defines a "small" correlation as 0.10 <= |r| < 0.3, a "medium" correlation as 0.3 <= |r| < 0.5, and a "large" correlation as 0.5 <= |r| <=1. I know of no one who thinks that a correlation between -0.4 and 0.4 signifies no correlation.

Then there's the argument--made by Cohen himself, among others--that any such distinctions are potentially flawed and misleading, and that the "importance" of correlations depends on the context. No such disclaimer appears in the article, and the reader might take these values as dogma.

I suggest that the table be removed entirely. Failing that, it should at the very least be revised for clarity as described above, and a disclaimer should be added. The values in the table should be changed to Cohen's values, or else the source of these values should be mentioned somewhere.

I'd be happy to make all of the changes that I can, but as I'm new to Wikipedia I thought I'd defer to more experienced authors.

--Trilateral chairman 22:51, 22 March 2006 (UTC)

I agree with everything you say. My preference would be to have a statement to the effect that various descriptions of ranges of correlations have been offered. One such description could perhaps be included with a reference. The point should be made that whether or not a correlation is of consequence or sufficient depends on the context and purposes at hand. If you're confirming a physical law with pretty good equipment, 0.9 might be a poor correlation. If you're ascertaining whether a selection measurement correlates with later performance measurements, 0.9 would be excellent (you'd be lucky to do better!). The table with scatterplots corresponding with each (linear) correlation coefficient is excellent. The table to which you refer is not referenced and it is hardly a commonly agreed classificatioin. On this basis alone it should not appear in the article. Be bold! Holon 01:15, 23 March 2006 (UTC)

Okay. I've removed the old table, added Cohen's table with a citation, and added the disclaimer with the explanation you suggested. Here is the old table if anyone wants it:

Correlation coefficient
High correlation High Low Low No No correlation (random) No Low Low High High correlation
−1 < −0.9 > −0.9 < −0.4 > −0.4 0 < +0.4 > +0.4 < +0.9 > +0.9 +1

--Trilateral chairman 01:18, 24 March 2006 (UTC)

[edit] Formulas: What is E?

What is the E in the first equations? Why isn't the E replaced by capital sigma indicating the sum of?

I have seen E before in statistics texts. If it is some standard notation, it should be explained.

Gary 16:24, 28 March 2006 (UTC)


I think it is the expected value.It indeed is the expected value. Often calculated as (k/N) where k is the total number of objects and N is the number of intervals.

The article says explicitly that it is expected value, and gives a link to that article. Michael Hardy 00:12, 30 March 2006 (UTC)

[edit] Clarification for non math-people?

I'd really appreciate if someone could expand the first couple paragraphs a bit to better explain correlation. While I'm sure that the rest of the article is correct, for me, as smeone without a math background, it doesn't make much sense. I understand that by the very nature of the topic it is complicated, but I'd still like to have some sort of understanding of the text. Thank you! cbustapeck 17:16, 13 October 2006 (UTC)