Cross Validation is a technique most commonly used to determine the hyper-parameter of almost all supervised Machine learning algorithms.
Cross validation generally used to estimates the hyper-parameter and to test the model before deploying it in production ,Before jumping into discussion of Cross-validation just a simple definition of hyper-parameters.
Hyper-parameter – Parameters of model which are been set by developer and they do not participate during “optimization” of parameters of algorithm. e.g Value of K in Knn .
After completing this article , you will know
- What is k-fold cross validation
- How to select value of k
Before explaining k-fold cross validation i would like to spent some time on simple cross validation . The simple cross validation includes split of whole dataset into train , cross-validate , test set. The reason for this is simple ,
First we train the model on training dataset then validate its performance and its hyper-parameter selection on cross-validation dataset and its performance comes out to be good then we test its performance on test dataset. The main difference between cross-validation and test set is , in cross-validate set we use its corresponding ‘y’ labels to tweak models performance if not found satisfactory but in case of test set we don’t use ‘y’ labels to tweak its performance , they are only used to calculate test error i.e how well the model is performing in real situations.
The usual ratio of dataset split is ,
- First we split whole data into 30% for test and 70% for further split.
- The 70% is then split again into 70% for train and 30% for Cross-validate.
Due to the simplicity of this technique it comes with a disadvantage , i.e We are not able to use whole first 70% split into training our model as we need cross validation set .This is not a big problem when we have large data but having large data every time is not simply possible and having small data an that too cannot be fully utilized for training of model is a important which must be addressed.
The solution to above problem is K-fold cross validation , and in-spite of providing solution to above problem it also have its own advantages.
The Further tutorial is sequenced into following format,
- What is k-fold cross validation
- How to determine K
- Sklearn implementation of K-fold.
What is K-fold Cross Validation
The idea of K-fold Cross Validation is simple first divide the whole dataset excluding test set into K groups like 10 or 11 .After choosing the value of K say 11 it becomes 11-fold cross validation.
Algorithm for K-fold cross validation,
- Random shuffle the data and divide into ‘K’ groups
- Do the following steps for all groups
- Select one group as CV and rest as training data.
- train the model using training data and apply CV to test its performance
- evaluate its score and hyper-parameters then discard the model
- Summarize all the models like mean of the models .
This is quite a good hack as all the groups gets access to be used as CV and the model gets to trained on whole dataset rather than a part of it.
How to choose the value of K
The value of K must be chosen correctly as wrong value of misrepresent the performance of the model which can result a model with high variance or high bias .Some common tactics of Chosing the right value of K is ,
- Choosing according to size : Choose a value of K which results a sufficiently large size groups which represent the pattern in whole data correctly.
- Take K=10 : Choosing K=10 has been proved to be a good value of K .
- Take K=n : Choosing K=n where n is dataset size give chance to each sample to participate in model evaluation .Though this model sometimes performs exceptional but still not very much preferred .
Implementation of K-fold using Sklearn
Implementing K-fold using sklearn is quite simple , check out the code snippet below.
Line 1 : importing Kfold module from model_selection
Line 2 : Creating object of KFold with K = 2 and the object is referred using kfold variable
Line 3 : Iterating over groups and printing the splits for every group.
Kfold Cross validation is a very efficient method using which can be used in times having less data to train the model. We will be covering many such article in future so stay connected.