Select Page

Data Engineering and MLOps specialist: Streamlining EDW & Data Pipelines for ML & AI products.

What is Correlation Coefficient

Correlation means a mutual relationship or connection between two or more things (variables).  The correlation coefficient is a numeric measure to quantify this relationship. The coefficient describes two aspects of a relationship.

  1. Strength of the relationship
  2. Direction of the relationship.

Correlation is used for predicting the relationship between two variables. It is also used for concurrent validity ( correlation between a new measure and an established measure). One more use of correlation is in reliability testing, e.g. Test-retest reliability (measure consistent).

Strength of relationship

For the strength of the relationship, the value ranges from 0 to 1. A perfectly strong relationship between variables will be 1, whereas no relationship between two variables will be 0. Values in the vicinity of 0.5 ( middle of 0 & 1 will represent a medium correlation).

Perfect Correlation

The perfect correlation will have a value of 1. The Scatter plot below shows a perfect correlation, as one variable increases, the other variable increases as well.

 

Strong & Weak Correlation

A correlation will mean stronger relationship between two variables under study. There is no rule for determining what’s considered Strong,  medium or week. The interpretation of coefficient depends on subject of study.

In one area of study 0.4 and above could be considered as relatively strong, whereas in another subject 0.4 can be considered relatively week, and only vales of 0.75 and up are considered as strong correlation.

The scatter below has a correlation value of 0.944. It’s direction is positive. It represent a very strong correlation indicating that as X increase, Y will strongly increase with it.

 

The scatterplot below has a correlation value of 0.0478. Its direction is positive, however, it’s a weak correlation.

 

Direction of relationship

The direction of the relationship indicates an inverse relationship between the variables. An increase in one variable is associated with a decrease in the other variable. An example of negative correlation will be Boyle’s law as the volume of a container increases, the pressure of gas decreases.

The Scatter plot below shows a perfect correlation of (-1), as one variable increases, the other variable decreases with it.

 

Methods of calculating Correlation Coefficient

There are multiple methods for calculating correlation coefficient e.g.

  1. Pearson’s method
  2. Kendal’s methods
  3. Pearson’s methods.

For this article, I will focus on Pearson’s correlation coefficient.

Pearson’s Correlation Coefficient

Let’s begin by looking at the formulae for Pearson’s Correlation coefficient

 

r = \frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})} {\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2(y_i - \overline{y})^2}}

Where,

x_i is the individual value of the x

\overline{y}] is the mean of all x values

Numerator in Pearson’s equation

r = \frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})} {}

All we are doing here is finding the error by subtracting the predicted x and y values from their respective mean. For examples,  (x_i - \overline{x}) finds the difference of each x value from it’s mean. We repeat it for each value of y_i and sum the resultant number. In a more statistical language, we are essentially normalizing the chart around the mean.

Denominator in Pearson’s equation

r = \frac{} {\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2(y_i - \overline{y})^2}}

(x_i - \overline{x})^2 is the sum of errors squared. It calculated the distance between individual predicted x values and mean of all x values. Similarly