Understanding the correlation between variables is very essential part of understanding the data before applying any machine learning algorithm as some algorithms like Naive Bayes depends on the correlation between variables , and if you don’t understand the interdependence of variables then you may not be able to apply the algorithm correctly.The goal of this article is to introduce you with some techniques by using which you can understand the underlying correlation between variables of your data.
Why to understand Correlation ?
correlation means the interdependence between the variables of a dataset .There are mainly three types of interdependence :-
- Positive Interdependence:- when both variables grows w.r.t each other.
- Negative Interdependence:-In this if one values goes up other goes down and vice versa.
- Neutral Interdependence:-In this there is no dependence between variables.
So, If you understand the dependence of data then you will be able to apply algorithm much accurately.Let’s consider Naive Bayes ,the algorithm assumes no conditional interdependence between data variable ,so if you apply this on highly interdependent data then it will not give you best result.
How to study correlation ?
There are many statistical techniques for studying correlation ,In this article we will cover three techniques with their python implementation:-
- Pearson’s correlation
- Spearman rank correlation
This is a extension to the concept of variance, you can consider meaning of covariance as how much the variables co-vary with the change in values of one another.
Calculating covariance is very simple:-
Although the covariance tells us about interdependence but there is couple of disadvantages with covariance
- It has no finite range of value it can take. So there is no perfect way of determining the degree of interdependence .
- Output of covariance is a matrix so interpretation is not very beginner’s friendly.
To Overcome this problems we Use Pearson’s correlation coefficient ,this a measure of the linear combination of two variables and the value of pearson coefficient lies between -1 and 1.This property makes pearson coefficient much better in interpretation ,for complete understand refer figure below.
Pearson coefficient = covariance (X,Y) / (stdv (X) * stdv(Y))
Spearman rank Coefficient
In spite of having Pearson’s coefficient we have couple disadvantages associated with it ,
- Pearson’s coefficient does not work well with monotonic function.
- It fails to provide a correlation evaluation in case of complex functions like some given below.
To Overcome this problem we use Spearman rank correleation coefficient.There is a big advantage when using Spearman rank coefficient as it takes the relationship of variables under consideration rather than their linear behavior.A Spearman correlation of 1 results when the two variables being compared are monotonically related, even if their relationship is not linear.
Spearman Rank Coefficient = covariance (rank(X) ,rank(Y)) / (stdv (rank(X)))
Understanding the data is before applying any machine learning algorithm is very important for good performance of model as well as for good data preprocessing. Relationship in data variables plays important role in the performance of model as some algorithms take relationship under consideration.