Pearson correlation coefficient
In statistics, the Pearson correlation coefficient (PCC, pronounced /ˈpɪərsən/), also referred to as the Pearson's r, Pearson product-moment correlation coefficient (PPMCC) or bivariate correlation,[1]is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It is widely used in the sciences. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s.
Definition[edit]
Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. The form of the definition involves a "product moment", that is, the mean (the first moment about the origin) of the product of the mean-adjusted random variables; hence the modifier product-moment in the name.
For a population[edit]
Pearson's correlation coefficient when applied to a population is commonly represented by the Greek letter ρ (rho) and may be referred to as the population correlation coefficient or the population Pearson correlation coefficient. The formula for ρ[5] is:
-
- where:
- is the covariance
- is the standard deviation of
- is the standard deviation of
- where:
The formula for ρ can be expressed in terms of mean and expectation. Since
Then the formula for ρ can also be written as
-
- where:
- and are defined as above
- is the mean of
- is the expectation.
- where:
The formula for ρ can be expressed in terms of uncentered moments. Since
the formula for ρ can also be written as
For a sample[edit]
Pearson's correlation coefficient when applied to a sample is commonly represented by the letter r and may be referred to as the sample correlation coefficient or the sample Pearson correlation coefficient. We can obtain a formula for r by substituting estimates of the covariances and variances based on a sample into the formula above. So if we have one dataset {x1,...,xn} containing n values and another dataset {y1,...,yn} containing n values then that formula for r is:
-
- where:
-
- are defined as above
- (the sample mean); and analogously for
Rearranging gives us this formula for r:
-
- where:
-
- are defined as above
- This formula suggests a convenient single-pass algorithm for calculating sample correlations, but, depending on the numbers involved, it can sometimes be numerically unstable.
Rearranging again gives us this[5] formula for r:
-
- where:
-
- are defined as above
An equivalent expression gives the formula for r as the mean of the products of the standard scores as follows:
-
- where
-
- are defined as above, and are defined below
- is the standard score (and analogously for the standard score of y)
Alternative formulae for r are also available. One can use the following formula for r:
-
- where:
-
- are defined as above and:
- (the sample standard deviation); and analogously for sy
- Practical issues
Under heavy noise conditions, extracting the correlation coefficient between two sets of stochastic variables is nontrivial, in particular where Canonical Correlation Analysis reports on degraded correlation values due to the heavy noise contributions. A generalization of the approach is given elsewhere.[6]
In case of missing data, Garren derived the maximum likelihood estimator.[7]
Interpretation[edit]
The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as Xincreases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables.
More generally, note that (Xi − X)(Yi − Y) is positive if and only if Xi and Yi lie on the same side of their respective means. Thus the correlation coefficient is positive if Xi and Yi tend to be simultaneously greater than, or simultaneously less than, their respective means. The correlation coefficient is negative (anti-correlation) if Xi and Yi tend to lie on opposite sides of their respective means. Moreover, the stronger is either tendency, the larger is the absolute value of the correlation coefficient.
Rogers and Nicewander [8] cataloged thirteen ways of interpreting covariance:
- Function of raw scores and means
- Standardized covariance
- Standardized slope of the regression line
- Geometric mean of the two regression slopes
- Square root of the ratio of two variances
- Mean cross-product of standardized variables
- Function of the angle between two standardized regression lines
- Function of the angle between two variable vectors
- Rescaled variance of the difference between standardized scores
- Estimated from the balloon rule
- Related to the bivariate ellipses of isoconcentration
- Function of test statistics from designed experiments
- Ratio of two means
No hay comentarios:
Publicar un comentario