Talk:Correlation

From Wikipedia, the free encyclopedia

Contents

[edit] Algorithms

I recently checked the algorithm and it computes the correlation with slightly different formula: \frac{\sum{(x_i-\bar{x})(y_i-\bar{y})}}{n s_x s_y}, is this on purpose (in that case some note in the text would be needed) or an error? --Tomas.hruz 09:53, 6 October 2006 (UTC)

I just yesterday inserted a disclaimer about using the formula supplied as the basis for a one-pass algorithm, and included pseudocode for a stable single-pass algorithm in a separate section. For standard deviation, there is a separate page instead, at Algorithms_for_calculating_variance, but it seems to me that an analogous separate page should contain this algorithm only if a similar explication of the problems of numerical instability is included.--Brianboonstra 16:00, 3 March 2006 (UTC)
The last_x and last_y variables are unused in the pseudocode. They should probably be removed, no ? -- 29 March 2006
Agreed, and done. Brianboonstra 18:31, 11 April 2006 (UTC)

The algorithm does not take into account the case when either pop_sd_x or pop_sd_y is zero, causing a divide by zero on the last line. holopoj 17:06, 5 August 2006 (UTC)

That's arguably correct since correlation would not be defined in this case -- the equation which we are calculating would also have a divide by zero. --Richard Clegg 19:48, 5 August 2006 (UTC)

It seems like the algorithm is calculating something wrong (or maybe it is just that I coded it wrong!), but I wrote it in C++, and it does not calculate the correct covariance for the proposed example, which should be -841.667 as calculated in Excel and R. I used a straight algorithm in C++ (without any optimization) and it gave the right answer. Could somebody tell me what was my mistake in coding it? Thanks in advance. Here is the code:

 double cov(double* x,double* y,int tamano,int tipo) {
   int i;
   double sumCuadX = 0.0, sumCuadY = 0.0, sumCoprod = 0.0,
          mediaX = x[0], mediaY = y[0],
          barre, deltaX, deltaY, pobSDX, pobSDY, covcor;
   for (i = 1; i < tamano; i++) {
       barre = ((double)i - 1.0)/(double)i;
       deltaX = x[i] - mediaX;
       deltaY = y[i] - mediaY;
       sumCuadX += deltaX*deltaX*barre;
       sumCuadY += deltaY*deltaY*barre;
       sumCoprod += deltaX*deltaY*barre;
       mediaX += deltaX/(double)i;
       mediaY += deltaY/(double)i;
   }
   pobSDX = sqrt(sumCuadX/(double)tamano);
   pobSDY = sqrt(sumCuadY/(double)tamano);
   covcor = sumCoprod/(double)tamano;
   if (tipo == CORRELACION)
      covcor /= pobSDX*pobSDY;
   return covcor;
 }

Paulrc 25 19:39, 27 December 2006 (UTC)

The problem lies in your incomplete translation of the one-based index code to your zero-based index version. There appear to be three changes, all within the loop and all dealing with adjustments to i:

 double cov(double* x,double* y,int tamano,int tipo) {
   int i;
   double sumCuadX = 0.0, sumCuadY = 0.0, sumCoprod = 0.0,
          mediaX = x[0], mediaY = y[0],
          barre, deltaX, deltaY, pobSDX, pobSDY, covcor;
   for (i = 1; i < tamano; i++) {
       barre = i/(1d + i);
       deltaX = x[i] - mediaX;
       deltaY = y[i] - mediaY;
       sumCuadX += deltaX*deltaX*barre;
       sumCuadY += deltaY*deltaY*barre;
       sumCoprod += deltaX*deltaY*barre;
       mediaX += deltaX/(1 + i);
       mediaY += deltaY/(1 + i);
   }
   pobSDX = sqrt(sumCuadX/(double)tamano);
   pobSDY = sqrt(sumCuadY/(double)tamano);
   covcor = sumCoprod/(double)tamano;
   if (tipo == CORRELACION)
      covcor /= pobSDX*pobSDY;
   return covcor;
 }
I do not believe Tomas [see above] is correct. Note that the std devs used in the denominator are population std devs.Brianboonstra 21:40, 24 January 2007 (UTC)
Your algorithm does not process the last element. Brianboonstra 21:41, 24 January 2007 (UTC)

[edit] Ratio

Could we see some account of this concept of "correlation ratio"? All I can find is on Eric Weisstein's site, and it looks like what in conventional nomenclature is called an F-statistic. Michael Hardy 21:02 Mar 19, 2003 (UTC)

it goes somewhat like:
correlation_ratio(Y|X) = 1 - E(var(Y|X))/var(Y)
I don't know the conventional nomenclature, but in the literature on similarity measures for image registration it is called just this...

The relation between them is already there on the autocorrelation page. "...the autocorrelation is simply the correlation of the process against a time shifted version of itself." You can see this trivially by considering the equation for correlation if the series Yt = Xt-k. --Richard Clegg 20:43, 7 Feb 2005 (UTC)

This page currently tells only the mathematical aspects of correlation. While it is, obviously, a mathematical concept, it is used in many areas of research such as Psychology (my own field; sort of) in ways that would be better defined by purpose than mathematical properties. What I mean is, I'm not sure how to add information about what correlation is used for into this article - I wanted to put in the "vicars and tarts" demonstration of "correlation doesn't prove causality", for instance. But that would require a rather different definition of correlation, in terms of "the relationship between two variables" or something. Any ideas on how to rewrite would be welcome - if not, of course, I'll do it myself at some point...

Oh, and I can't decide what to do about that ext. link - as is, it's rather useless, taking you to the homepage of a particular reference site (I suspect it of being "Wikispam"); but if you find the right page and break out of their frameset, there is actually some interesting info at http://www.statsoft.com/textbook/stbasic.html#Correlations. Ah well, maybe I'll come back to this after I've sorted out some of the memory-related pages... IMSoP 17:43, 20 May 2004 (UTC)

I have now partially addressed the concerns above by putting in a link to spurious relationship, which treats the "correlation does not imply causation" cliche. Michael Hardy 21:33, 20 May 2004 (UTC)

I thought that the deleted stuff about the sample for correlation was useful. Not enough stats people pay attention to the difference between a statistic and an estimator for that statistic. The Pearson product-moment correlation coefficient page does cover this but it would be nice to see the treatment for the standard correlation too (IMHO at least). --Richard Clegg 20:21, 10 Feb 2005 (UTC)

No -- you're wrong on two counts: There is no such thing as an estimator for a statistic; probably you mean as estimator of a population parameter; and statisticians pay a great deal of attention to such matters; it is non-statisticians who contribute to statistics articles on wikipedia and in journal article who loose sight of the distrinction. Michael Hardy 23:34, 10 Feb 2005 (UTC)
... but I agree with you that the material on sample correlation should be here. Michael Hardy 23:34, 10 Feb 2005 (UTC)
Apologies, I was writing in haste. You are correct here that my comments refer to a population parameter rather than a "statistic" in the formal sense of a function on the data. My comments were intended to refer to contributors to wikipedia articles on staticss (in the wider sense). --Richard Clegg 11:37, 11 Feb 2005 (UTC)

I think so too, but I was rushed. I will put the section back soon, but I will combine it with the Pearsons section. Paul Reiser 21:09, 10 Feb 2005 (UTC)

Thanks. I, for one, think it would help clarify this page.

--Richard Clegg 22:45, 10 Feb 2005 (UTC)

[edit] Cross-correlation in signal processing

what about the signal processing version of correlation? kind of the opposite of convolution, with one function not reversed. also autocorrelation. does it have an article under a different name? if so, there should be a link. after reading this article over again, i believe the two are related. i will research some and see, (and add them to my to do list) but please add a bit if you know the connection... Omegatron 20:10, Feb 13, 2004 (UTC)

This has been created under a separate article called cross-correlation, although they are clearly related. Merge? Or link to each other? - Omegatron 04:37, Mar 20, 2005 (UTC)

Correlation matrix search redirects to this page but I can't find here what a correlation matrix is. I have some idea from http://www.vias.org/tmdatanaleng/cc_covarmat.html , but don't feel confident enough to write an entry, and I am no sure where to add it.

Covariance_matrix exists.

Scatter_matrix do not.

--Dax5 19:16, 7 May 2005 (UTC)

[edit] Correlation function in spreadsheets

The "Correlation function in spreadsheets" section looks very useless to me, and the information included is probably wrong since the correlation of two real numbers does not make sense. I will delete it, if you put it back can you tell me why?

Muzzle 12:44, 6 September 2006 (UTC)

I agree with your edit. Thanks. Chris53516 13:17, 6 September 2006 (UTC)

[edit] Random Variables

I was the one that put the disclaimer on "random" variables. If anybody would like to discuss, I'm all ears, so to speak. The preceding unsigned comment was added by Phili (talkcontribs) .

I reverted that note. You wrote:
Several places in this article refer to "random" variables. By definition a random variable has no correlation with anything else (if it does have a correlation the variable is either 1) not random, or 2) the correlation is a coincidence likely due to a small sample size). It is more accurate to think of these not as random variables, but simply as variables that have an undetermined relationship.
By definition, a random variable is just a measurable function one some probability space. And yes, two random variables can be very much correlated. :) Oleg Alexandrov (talk) 01:56, 30 November 2005 (UTC)

The "unsigned" person wrote utter nonsense. This is a crackpot. Michael Hardy 02:38, 30 November 2005 (UTC)

That might be a bit harsh. User:Phli has exactly three edits so far. Let's assume he just isn't familiar with the technical notion of a random variable, until proved otherwise. --Trovatore 21:13, 5 December 2005 (UTC)

[edit] Disambiguation of geometry

Isn't correlation also a term in projective geometry? When PG(V) and PG(W) are projective spaces, a correlation α is a bijection from the subspaces of V to the subspaces of W, such that V\subset W is equivalent with W^{\alpha}\subset V^{\alpha}

[edit] Diagram

Perhaps I'm being thick, but after a minute or two of scrutinising it I couldn't work out how to read the diagram on this article. Which scatter plot corresponds to which coefficient, and why are they arranged in that way? It is not clear. Ben Finn 22:12, 18 January 2006 (UTC)

You're right -- I'd never looked for that. I'll leave a note to the author of the illustration. Michael Hardy 00:33, 19 January 2006 (UTC)
Wait -- it's stated in the caption. You have to read it carefully. Michael Hardy 00:36, 19 January 2006 (UTC)
The figure is not very intuitive... --128.135.82.179 06:13, 6 February 2006 (UTC)

I actually thought the figure is awesome, but now that I consider it, I wonder if it is intuitive and informative only for those who understand correlation well enough not to really need the figure. Also, I think it would be instructive to show a high-correlation scatterplot where the variances of the two underlying series are in a ratio of, say, 1:6 rather than 1:1 in the plots shown. --Brianboonstra 15:53, 3 March 2006 (UTC)

I have no clue how that figure works, and I'm in a PhD program. --Alex Storer 22:50, 17 April 2006 (UTC)


I have added this sentence to the caption to try to clarify it, in case anyone is still confused:

Each square in the upper right corresponds to its mirror-image square in the lower left, the "mirror" being the diagonal of the whole array.

Michael Hardy 22:14, 22 April 2006 (UTC)


I understand the figure, but I think it's WAY too complicated, especially for someone who doesn't already know what it is. Its slightly neat to see that you have four different data sets generated, and you're looking at all pairs... but I think for most people it would be MUCH MUCH clearer if you just showed four examples in a row, with labels directly above: R2 in {0, .5, .75, 1 } or something. 24.7.106.155 09:27, 7 May 2006 (UTC)


I have to agree that the diagram is over complicated. It also doesn't show negative correlations. Would it be better to have a table with two rows. Each colum could have a correlation coefficient as a number in the first row, and a scatter plot in the second row. The coeffecients could range between -1 and 1. I think that this would also emphasise that a negative correlation is still a strong correlation. 80.176.151.208 07:49, 31 May 2006 (UTC)

[edit] Intercorrelation

Sometimes one sees the term "intercorrelation". What does this exactly signifies? I associate "intercorrelation" as the correlation between two different variables - but that is what standard "correlation" is. It seems to me that "inter" is redundant... And the opposite of autocorrelation is not intercorrelation but cross-correlation... -fnielsen 15:43, 10 February 2006 (UTC)

I guess "intercorrelation" has some utility in multivariate analysis(?), see, e.g., supermatrix. - fnielsen 15:49, 10 February 2006 (UTC)

[edit] Table

The table at the beginning of the article is flawed in almost every regard. First, it is poorly designed. Suppose a reader wants to know what a low correlation is. He or she looks at the row, sees "low," and sees that the cell below it says "> -0.9." At first glance, this makes it sound as though ANY correlation that is greater than -0.9 is low, including 0, 0.9, etc. Then the next column says "low: < -0.4." It takes a moment to figure out that the author was actually intending to convey "low: -0.9 < r < -0.4." Something like this would be better:

Correlation coefficient
High correlation Low correlation No correlation (random) Low correlation High correlation
−1 < r < −0.9 −0.9 < r < −0.4 −0.4 < r < +0.4 +0.4 < r < +0.9 +0.9 < r < +1

though some letter other than r might be better, and less-than-or-equal-to signs belong in there somewhere. That brings up the second problem, though: Where on earth did these numbers come from? Cohen, for example, defines a "small" correlation as 0.10 <= |r| < 0.3, a "medium" correlation as 0.3 <= |r| < 0.5, and a "large" correlation as 0.5 <= |r| <=1. I know of no one who thinks that a correlation between -0.4 and 0.4 signifies no correlation.

Then there's the argument--made by Cohen himself, among others--that any such distinctions are potentially flawed and misleading, and that the "importance" of correlations depends on the context. No such disclaimer appears in the article, and the reader might take these values as dogma.

I suggest that the table be removed entirely. Failing that, it should at the very least be revised for clarity as described above, and a disclaimer should be added. The values in the table should be changed to Cohen's values, or else the source of these values should be mentioned somewhere.

I'd be happy to make all of the changes that I can, but as I'm new to Wikipedia I thought I'd defer to more experienced authors.

--Trilateral chairman 22:51, 22 March 2006 (UTC)

I agree with everything you say. My preference would be to have a statement to the effect that various descriptions of ranges of correlations have been offered. One such description could perhaps be included with a reference. The point should be made that whether or not a correlation is of consequence or sufficient depends on the context and purposes at hand. If you're confirming a physical law with pretty good equipment, 0.9 might be a poor correlation. If you're ascertaining whether a selection measurement correlates with later performance measurements, 0.9 would be excellent (you'd be lucky to do better!). The table with scatterplots corresponding with each (linear) correlation coefficient is excellent. The table to which you refer is not referenced and it is hardly a commonly agreed classificatioin. On this basis alone it should not appear in the article. Be bold! Holon 01:15, 23 March 2006 (UTC)

Okay. I've removed the old table, added Cohen's table with a citation, and added the disclaimer with the explanation you suggested. Here is the old table if anyone wants it:

Correlation coefficient
High correlation High Low Low No No correlation (random) No Low Low High High correlation
−1 < −0.9 > −0.9 < −0.4 > −0.4 0 < +0.4 > +0.4 < +0.9 > +0.9 +1

--Trilateral chairman 01:18, 24 March 2006 (UTC)

[edit] Formulas: What is E

What is the E in the first equations? Why isn't the E replaced by capital sigma indicating the sum of?

I have seen E before in statistics texts. If it is some standard notation, it should be explained.

Gary 16:24, 28 March 2006 (UTC)

I think it is the expected value.It indeed is the expected value. Often calculated as (k/N) where k is the total number of objects and N is the number of intervals.

The article says explicitly that it is expected value, and gives a link to that article. Michael Hardy 00:12, 30 March 2006 (UTC)

[edit] Clarification for non math-people

I'd really appreciate if someone could expand the first couple paragraphs a bit to better explain correlation. While I'm sure that the rest of the article is correct, for me, as smeone without a math background, it doesn't make much sense. I understand that by the very nature of the topic it is complicated, but I'd still like to have some sort of understanding of the text. Thank you! cbustapeck 17:16, 13 October 2006 (UTC)

What I find confusing is that it first defines correlation to then move to the pearsons correlation coefficient without really explaining the relation between the two. The first section on correlation is quite clear. But the Sample correlation section is entirely confusing and there is not a single mention of any relationship to what was said in the previous section. And the formula is different than the one in the first section. Computing the expected value would mean dividing by n but the formula in 'Sample Correlation' divides by n-1. —The preceding unsigned comment was added by 67.93.205.78 (talk)

[edit] Restore Cohen et al's book as a Reference

Recently an editor removed a whole string of on-line publications by Herve Abdi, which did seem somewhat self-promotional to include here. However the textbook by Cohen et al. is the only major textbook (that people might use in a course) that was listed, and it also was removed. Does anyone object to restoring the Cohen book to the Reference list, or under the heading 'Further Reading' if you prefer? EdJohnston 19:07, 13 December 2006 (UTC)

Did I remove that? Sorry. Sure, go ahead and restore it. If you don't know where to find it, I can do it. — Chris53516 (Talk) 20:33, 13 December 2006 (UTC)
I am all for the book being placed there, so long as we vote - not a literal vote - but so long as there is an understanding among the majority of the population that the book is there for its merits, its widespread use etc. and NOT for commercial purpose or with intent to benefit Cohen et al.--ToyotaPanasonic 13:36, 24 December 2006 (UTC)
I originally added the citation to the book (at least I think it was me). I included it only because it is a common reference in the behavioral sciences...and besides that, it was the textbook for my graduate stats course. :) I have never met the Cohens, have never corresponded with them, and have no financial interest whatsoever in their book. Heck, I'm not even interested in selling my copy. I support replacing the reference.Trilateral chairman 03:25, 7 February 2007 (UTC)

[edit] Ambiguity

Isn't there a difference between correlation and coefficient of correlation? The coefficient lying between -1 and 1, while the general term 'correlation', can have any numerical value attached to it?

If so, you will find the introduction somewhat misleading: "In probability theory and statistics, correlation, also called correlation coefficient" - correlation and correlation coefficient are not quite the same thing, but are very very similar.

With someone who has a fresher statistics/econometrics backbround please confirm this and accordingly edit the main page. Cheers all, --ToyotaPanasonic 13:31, 24 December 2006 (UTC)

What would just a "correlation" be then? No, unfortunately, correlation cannot have any random number "attached" to it. The term "correlation" can be used in non-numerical senses, but when you're talking about it's number form, it's always the correlation coefficient. Otherwise, what would the number be? How would you interpret it? (Rhetorical questions.) — Chris53516 (Talk) 22:16, 24 December 2006 (UTC)
Yes, quite brilliant. I have sourced my undergrad econometrics textbook from my basement - it turned out I was confusing Covariance (any number, which has a unit attached to it) and correlation which is standardised to lie between -1 and +1. Yes, quite brilliant. --ToyotaPanasonic 04:27, 26 December 2006 (UTC) [Feel free to delete this above topic - I have no further use for it, and neither would anyone else]

[edit] Correlation and Causation

Removed entry.

[A comment left here by User:Jjoffe was removed by EdJohnston. See my further note below. I left intact the response by User:Chris53516 who was responding to Joffe. -- EdJohnston 14:32, 8 January 2007 (UTC)

This is an excellent example of original research. Please do not add this to the article. Furthermore, it is NOT a good idea to post your email address. — Chris53516 (Talk) 14:09, 8 January 2007 (UTC)
Per WP:REFACTOR, an editor may remove content from a talk page that is 'entirely and unmistakably irrelevant'. I did so with a recent posting by User:Jjoffe. You can still see the removed material here in the edit history. -- EdJohnston 14:32, 8 January 2007 (UTC)
I don't think I agree with your changes. What was written was not "entirely and unmistakably irrelevant." Please explain yourself. — Chris53516 (Talk) 14:47, 8 January 2007 (UTC)
Talk pages are for discussing article changes. Joffe's contribution looked like a literal reprint of material that had been (or was intended to be) published elsewhere. It was not clear that he was proposing anything specific for this article, though he is welcome to do so if he thinks it can be improved. The WP:REFACTOR strategy has been used elsewhere, for a submitter who spammed Talk:IEEE_754r. If you believe Joffe's comments are relevant, you are welcome to restore them, but please explain how you would want to change the article as a result. -- EdJohnston 15:44, 8 January 2007 (UTC)
I see what you mean. It appears that if he wants to make a response to the other article somewhere, he should find his own webspace to do so. — Chris53516 (Talk) 16:13, 8 January 2007 (UTC)

[edit] Misconceptions

I'm not crazy about this section under common misconceptions: An appropriately expanded expression may be "correlation is not causation, but it sure is a hint." I don't think this is an illuminating rephrasing, in part because the rationale behind neither dictum (correlation is not causation nor the one quoted above), is explained sufficiently. I'd be more happy with:

The conventional dictum that "correlation does not imply causation" is a commonly-used admonition to using correlation to support a direct causal relationship among the variables. However, this admonition should not be taken to mean that correlations are acausal, merely that the causes underlying the correlation may be indirect and unknown. A correlation between age and height is fairly causally transparent, but a correlation between mood and health might be less so. Does improved mood lead to improved health? Or does good health lead to good mood? Or does some other factor underlie both? In other words, a correlation can be taken as evidence for a causal relationship, but cannot indicate precisely what the causal relationship might be.

Comments? SJS1971 13:07, 24 January 2007 (UTC)

I like your rewrite. It sounds good. — Chris53516 (Talk) 14:41, 24 January 2007 (UTC)
Okay, with that positive feedback I made the edit. SJS1971 15:30, 24 January 2007 (UTC)
I changed wording. I don't understand the meaning of "this admonition should not be taken to mean that correlations are acausal". I assume you are referring to whether the actual relation between two quantitative attributes is a causal relation (e.g. height and weight). Literally, correlation could be taken as a kind of ralation, but the article defines it in purely algebraic terms. So it could be confusing. I'm also no sure whether admonition is the best term -- seems a little emotive. Is there a more neutral term. The point is the nature of the logical argument, I would think. Holon 10:06, 4 March 2007 (UTC)