Pearson product-moment correlation coefficient

From Wikipedia, the free encyclopedia

In statistics, the Pearson product-moment correlation coefficient (sometimes known as the PMCC) (r) is a measure of the correlation of two variables X and Y measured on the same object or organism, that is, a measure of the tendency of the variables to increase or decrease together. It is defined as the sum of the products of the standard scores of the two measures divided by the degrees of freedom:

r = \frac {\sum z_x z_y}{n - 1}

Note that this formula assumes that the standard deviations on which the Z scores are based are calculated using n − 1 in the denominator.

The result obtained is equivalent to dividing the covariance between the two variables by the product of their standard deviations. In general the correlation coefficient is one of the two square roots (either positive or negative) of the coefficient of determination (r2), which is the ratio of explained variation to total variation:

r^2 = {\sum (Y' - \overline Y)^2 \over \sum (Y - \overline Y)^2}

where:

Y = a score on a random variable Y
Y' = corresponding predicted value of Y, given the correlation of X and Y and the value of X
\overline Y = sample mean of Y (i.e., the mean of a finite number of independent observed realizations of Y, not to be confused with the expected value of Y)

The correlation coefficient adds a sign to show the direction of the relationship. The formula for the Pearson coefficient conforms to this definition, and applies when the relationship is linear.

The coefficient ranges from −1 to 1. A value of 1 shows that a linear equation describes the relationship perfectly and positively, with all data points lying on the same line and with Y increasing with X. A score of −1 shows that all data points lie on a single line but that Y increases as X decreases. A value of 0 shows that a linear model is inappropriate – that there is no linear relationship between the variables.

The Pearson coefficient is a statistic which estimates the correlation of the two given random variables.

The linear equation that best describes the relationship between X and Y can be found by linear regression. This equation can be used to "predict" the value of one measurement from knowledge of the other. That is, for each value of X the equation calculates a value which is the best estimate of the values of Y corresponding the specific value of X. We denote this predicted variable by Y.

Any value of Y can therefore be defined as the sum of Y′ and the difference between Y and Y′:

Y = Y^\prime + (Y - Y^\prime)

The variance of Y is equal to the sum of the variance of the two components of Y:

s_y^2 = S_{y^\prime}^2 + s^2_{y.x}\,

Since the coefficient of determination implies that sy.x2 = sy2(1 − r2) we can derive the identity

r^2 = {s_{y^\prime}^2 \over s_y^2}.

The square of r is conventionally used as a measure of the association between X and Y. For example, if the coefficient is 0.90, then 81% of the variance of Y can be "accounted for" by changes in X and the linear relationship between X and Y.

[edit] Trivia

  • The CORREL() function in many major spreadsheet packages, such as Microsoft Excel, OpenOffice.org Calc and Gnumeric calculates Pearson's correlation coefficient.
  • The Pearson() function in Microsoft Excel calculates Pearson's correlation coefficient.
  • In MATLAB and Minitab, corr(X) calculates Pearsons correlation coefficient along with p-value.
    • In MATLAB, corrcoef calculates Pearsons correlation coefficient.
  • In S-Plus, cor.test(X,Y) calculates Pearson's correlation coefficient.
R = corrcoef(X) returns a matrix R of correlation coefficients calculated from an input matrix X whose rows are observations and whose columns are variables.

[edit] See also

[edit] External links