Pearson correlation coefficient

In statistics, the Pearson correlation coefficient (PCC, pronounced /ˈpɪərsən/), also referred to as the Pearson's r, Pearson product-moment correlation coefficient (PPMCC) or bivariate correlation,^[1]is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It is widely used in the sciences. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s.

Definition[edit]

Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. The form of the definition involves a "product moment", that is, the mean (the first moment about the origin) of the product of the mean-adjusted random variables; hence the modifier product-moment in the name.

For a population[edit]

Pearson's correlation coefficient when applied to a population is commonly represented by the Greek letter ρ (rho) and may be referred to as the population correlation coefficient or the population Pearson correlation coefficient. The formula for ρ^[5] is:

\rho _{X,Y}={\frac {\operatorname {cov} (X,Y)}{\sigma _{X}\sigma _{Y}}}

where:

$\operatorname {cov}$ is the covariance
$\sigma _{X}$ is the standard deviation of $X$
$\sigma _{Y}$ is the standard deviation of $Y$

The formula for ρ can be expressed in terms of mean and expectation. Since

\operatorname {cov} (X,Y)=\operatorname {E} [(X-\mu _{X})(Y-\mu _{Y})]

^[5]

Then the formula for ρ can also be written as

\rho _{X,Y}={\frac {\operatorname {E} [(X-\mu _{X})(Y-\mu _{Y})]}{\sigma _{X}\sigma _{Y}}}

where:

$\operatorname {cov}$ and $\sigma _{X}$ are defined as above
$\mu _{X}$ is the mean of $X$
$\operatorname {E}$ is the expectation.

The formula for ρ can be expressed in terms of uncentered moments. Since

$\mu _{X}=\operatorname {E} [X]$
$\mu _{Y}=\operatorname {E} [Y]$
$\sigma _{X}^{2}=\operatorname {E} [(X-\operatorname {E} [X])^{2}]=\operatorname {E} [X^{2}]-\operatorname {[} {E}[X]]^{2}$
$\sigma _{Y}^{2}=\operatorname {E} [(Y-\operatorname {E} [Y])^{2}]=\operatorname {E} [Y^{2}]-\operatorname {[} {E}[Y]]^{2}$
$\operatorname {E} [(X-\mu _{X})(Y-\mu _{Y})]=\operatorname {E} [(X-\operatorname {E} [X])(Y-\operatorname {E} [Y])]=\operatorname {E} [XY]-\operatorname {E} [X]\operatorname {E} [Y],\,$

the formula for ρ can also be written as

\rho _{X,Y}={\frac {\operatorname {E} [XY]-\operatorname {E} [X]\operatorname {E} [Y]}{{\sqrt {\operatorname {E} [X^{2}]-\operatorname {[} {E}[X]]^{2}}}~{\sqrt {\operatorname {E} [Y^{2}]-\operatorname {[} {E}[Y]]^{2}}}}}.

For a sample[edit]

Pearson's correlation coefficient when applied to a sample is commonly represented by the letter r and may be referred to as the sample correlation coefficient or the sample Pearson correlation coefficient. We can obtain a formula for r by substituting estimates of the covariances and variances based on a sample into the formula above. So if we have one dataset {x₁,...,x_n} containing n values and another dataset {y₁,...,y_n} containing n values then that formula for r is:

r={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}}}

where:

$n,x_{i},y_{i}$ are defined as above
${\bar {x}}={\frac {1}{n}}\sum _{i=1}^{n}x_{i}$ (the sample mean); and analogously for ${\bar {y}}$

Rearranging gives us this formula for r:

r=r_{xy}={\frac {n\sum x_{i}y_{i}-\sum x_{i}\sum y_{i}}{{\sqrt {n\sum x_{i}^{2}-(\sum x_{i})^{2}}}~{\sqrt {n\sum y_{i}^{2}-(\sum y_{i})^{2}}}}}.

where:

$n,x_{i},y_{i}$ are defined as above
This formula suggests a convenient single-pass algorithm for calculating sample correlations, but, depending on the numbers involved, it can sometimes be numerically unstable.

Rearranging again gives us this^[5] formula for r:

r=r_{xy}={\frac {\sum x_{i}y_{i}-n{\bar {x}}{\bar {y}}}{{\sqrt {(\sum x_{i}^{2}-n{\bar {x}}^{2})}}~{\sqrt {(\sum y_{i}^{2}-n{\bar {y}}^{2})}}}}.

where:

$n,x_{i},y_{i},{\bar {x}},{\bar {y}}$ are defined as above

An equivalent expression gives the formula for r as the mean of the products of the standard scores as follows:

r=r_{xy}={\frac {1}{n-1}}\sum _{i=1}^{n}\left({\frac {x_{i}-{\bar {x}}}{s_{x}}}\right)\left({\frac {y_{i}-{\bar {y}}}{s_{y}}}\right)

where

$n,x_{i},y_{i},{\bar {x}},{\bar {y}}$ are defined as above, and $s_{x},s_{y}$ are defined below
$\left({\frac {x_{i}-{\bar {x}}}{s_{x}}}\right)$ is the standard score (and analogously for the standard score of y)

Alternative formulae for r are also available. One can use the following formula for r:

r=r_{xy}={\frac {\sum x_{i}y_{i}-n{\bar {x}}{\bar {y}}}{(n-1)s_{x}s_{y}}}

where:

$n,x_{i},y_{i},{\bar {x}},{\bar {y}}$ are defined as above and:
$s_{x}={\sqrt {{\frac {1}{n-1}}\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}$ (the sample standard deviation); and analogously for s_y

Practical issues

Under heavy noise conditions, extracting the correlation coefficient between two sets of stochastic variables is nontrivial, in particular where Canonical Correlation Analysis reports on degraded correlation values due to the heavy noise contributions. A generalization of the approach is given elsewhere.^[6]

In case of missing data, Garren derived the maximum likelihood estimator.^[7]

Interpretation[edit]

The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as Xincreases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables.

More generally, note that (X_i − X)(Y_i − Y) is positive if and only if X_i and Y_i lie on the same side of their respective means. Thus the correlation coefficient is positive if X_i and Y_i tend to be simultaneously greater than, or simultaneously less than, their respective means. The correlation coefficient is negative (anti-correlation) if X_i and Y_i tend to lie on opposite sides of their respective means. Moreover, the stronger is either tendency, the larger is the absolute value of the correlation coefficient.

Rogers and Nicewander ^[8] cataloged thirteen ways of interpreting covariance:

Function of raw scores and means
Standardized covariance
Standardized slope of the regression line
Geometric mean of the two regression slopes
Square root of the ratio of two variances
Mean cross-product of standardized variables
Function of the angle between two standardized regression lines
Function of the angle between two variable vectors
Rescaled variance of the difference between standardized scores
Estimated from the balloon rule
Related to the bivariate ellipses of isoconcentration
Function of test statistics from designed experiments
Ratio of two means

[1]

[5]

[6]

[7]

[8]

ZEN CONSULTING

martes, 4 de julio de 2017