<<Up     Contents

Pearson product-moment correlation coefficient

In mathematics, the Pearson product-moment correlation coefficient (r) is a measure of how well a linear equation describes the relation between two variables X and Y measured on the same object or organism. It is defined as the sum of the products of the standard scores of the two measures divided by the degrees of freedom:

<math> r = \frac {\sum z_x z_y}{N - 1}</math>

The result obtained is equivalent to dividing the covariance between the two variables by the product of their standard deviations. In general the quantity of a correlation coefficient[?] is the square root of the coefficient of determination[?] (r2), which is the ratio of explained variation to total variation:

<math> r^2 = {\sum (Y' - \overline Y)^2 \over \sum (Y - \overline Y)^2}</math>

where:

Y = a score on a random variable Y
Y' = corresponding predicted value of Y, given the correlation of X and Y and the value of X
<math>\overline Y</math> = mean of Y

The correlation coefficient adds a sign to show the direction of the relationship. The formula for the Pearson coefficient conforms to this definition, and applies when the relationship is linear.

The coefficient ranges from -1 to 1. A value of 1 shows that a linear equation describes the relationship perfectly and positively, with all data points lying on the same line and with Y increasing with X. A score of -1 shows that all data points lie on a single line but that Y increases as X decreases. A value of 0 shows that a linear model is inappropriate – that there is no linear relationship between the variables.

The Pearson coefficient is a statistic which estimates the correlation of the two given random variables.

The linear equation that best describes the relationship between X and Y can be found by linear regression. If X and Y are both normally distributed, this can be used to "predict" the value of one measurement from knowledge of the other. That is, for each value of X the equation calculates a value which is the best estimate of the values of Y corresponding the specific value of X. We denote this predicted variable by Y.

Any value of Y can therefore be defined as the sum of Y and the difference between Y and Y:

<math>Y = Y^\prime + (Y - Y^\prime)</math>

The variance of Y is equal to the sum of the variance of the two components of Y:

<math>s_y^2 = S_{y^\prime}^2 + s^2_{y.x}</math>

Since the coefficient of determination implies that sy.x2 = sy2(1 - r2) we can derive the identity

<math>r^2 = {s_{y^\prime}^2 \over s_y^2}</math>

The square of r is conventionally used as a measure of the strength of the association between X and Y. For example, if the coefficient is .90, then 81% of the variance of Y is said to be explained by the changes in X and the linear relation between X and Y.

r is a parametric statistic. It assumes that the variables being assessed are normally distributed. If this assumption is violated, a non-parametric alternative such as Spearman's ρ may be more successful in detecting a linear relationship.

wikipedia.org dumped 2003-03-17 with terodump