CORRELATION COEFFICIENT: FORMULAS, CALCULATION, INTERPRETATION, EXAMPLE - DUDAS

The correlation coefficient in statistics is an indicator that measures the tendency of two quantitative variables X and Y to have a linear or proportional relationship between them.

Generally, the pairs of variables X and Y are two characteristics of the same population. For example, X could be a person's height and Y her weight.

Figure 1. Correlation coefficient for four data pairs (X, Y). Source: F. Zapata.

In this case, the correlation coefficient would indicate whether or not there is a trend towards a proportional relationship between height and weight in a given population.

Pearson's linear correlation coefficient is denoted by the lowercase letter r and its minimum and maximum values are -1 and +1 respectively.

A value r = +1 would indicate that the set of pairs (X, Y) are perfectly aligned and that when X grows, Y will grow in the same proportion. On the other hand, if it happened that r = -1, the set of pairs would also be perfectly aligned, but in this case when X increases, Y decreases in the same proportion.

Figure 2. Different values of the linear correlation coefficient. Source: Wikimedia Commons.

On the other hand, a value r = 0 would indicate that there is no linear correlation between the variables X and Y. While a value of r = +0.8 would indicate that the pairs (X, Y) tend to cluster on one side and another of a certain line.

The formula to calculate the correlation coefficient r is as follows:

How to calculate the correlation coefficient?

The linear correlation coefficient is a statistical quantity that is built into scientific calculators, most spreadsheets, and statistical programs.

However, it is convenient to know how the formula that defines it is applied, and for this a detailed calculation will be shown, carried out on a small data set.

And as stated in the previous section, the correlation coefficient is the covariance Sxy divided by the product of the standard deviation Sx for the variables X and Sy for the variable Y.

Covariance and variance

The covariance Sxy is:

Sxy = / (N-1)

Where the sum goes from 1 to the N pairs of data (Xi, Yi). and are the arithmetic means of the data Xi and Yi respectively.

For its part, the standard deviation for the variable X is the square root of the variance of the data set Xi, with i from 1 to N:

Sx = √

Similarly, the standard deviation for variable Y is the square root of the variance of the data set Yi, with i from 1 to N:

Sy = √

Illustrative case

In order to show in detail how to calculate the correlation coefficient, we will take the following set of four pairs of data

(X, Y): {(1, 1); (2. 3); (3, 6) and (4, 7)}.

First we calculate the arithmetic mean for X and Y, as follows:

= (1 + 2 + 3 + 4) / 4 = 2.5

= (1 + 3 + 6 + 7) / 4 = 4.25

Then the remaining parameters are calculated:

Covariance Sxy

Sxy = / (4-1)

Sxy = / (3) = 10.5 / 3 = 3.5

Standard deviation Sx

Sx = √ = √ = 1.29

Standard deviation Sy

Sx = √ =

√ = 2.75

Correlation coefficient r

r = 3.5 / (1.29 * 2.75) = 0.98

Interpretation

In the data set of the previous case, a strong linear correlation is observed between the variables X and Y, which is manifested both in the scatter plot (shown in Figure 1) and in the correlation coefficient, which yielded a value quite close to unity.

To the extent that the correlation coefficient is closer to 1 or to -1, the more sense it makes to fit the data to a line, the result of linear regression.

Linear regression

The linear regression line is obtained from the least squares method. in which the parameters of the regression line are obtained from the minimization of the sum of the square of the difference between the estimated Y value and the Yi of the N data.

On the other hand, the parameters a and b of the regression line y = a + bx, obtained by the method of least squares, are:

* b = Sxy / (Sx ²) for the slope

* a = - b for the intersection of the regression line with the Y axis.

Recall that Sxy is the covariance defined above and Sx ² is the variance or square of the standard deviation defined above. and are the arithmetic means of the data X and Y respectively.

Example

The correlation coefficient is used to determine if there is a linear correlation between two variables. It is applicable when the variables to be studied are quantitative and, furthermore, it is assumed that they follow a normal type distribution.

We have an illustrative example below: a measure of the degree of obesity is the body mass index, which is obtained by dividing a person's weight in kilograms by the squared height of the person in units of meters squared.

You want to know if there is a strong correlation between the body mass index and the concentration of HDL cholesterol in the blood, measured in millimoles per liter. For this purpose, a study has been carried out with 533 people, which is summarized in the following graph, in which each point represents the data of one person.

Figure 3. Study of BMI and HDL cholesterol in 533 patients. Source: Aragonese Institute of Health Sciences (IACS).

Careful observation of the graph shows that there is a certain linear trend (not very marked) between the HDL cholesterol concentration and the body mass index. The quantitative measure of this trend is the correlation coefficient, which in this case turned out to be r = -0.276.

References

González C. General Statistics. Recovered from: tarwi.lamolina.edu.pe
IACS. Aragonese Institute of Health Sciences. Recovered from: ics-aragon.com
Salazar C. and Castillo S. Basic principles of statistics. (2018). Recovered from: dspace.uce.edu.ec
Superprof. Correlation coefficient. Recovered from: superprof.es
USAC. Descriptive statistics manual. (2011). Recovered from: statistics.ingenieria.usac.edu.gt
Wikipedia. Pearson's correlation coefficient. Recovered from: es.wikipedia.com.

CORRELATION COEFFICIENT: FORMULAS, CALCULATION, INTERPRETATION, EXAMPLE - DUDAS - 2025