How Regularization work’s ?

When the model is trying too hard to fit noise in the data ,i.e model is Overfitting the data .  This drops the accuracy of model  with huge margin , to avoid this we use Regularizaton.

Let’s understand why overfitting is not good ,

  1. The model learns from noise in data.
  2. It looses its ability to generalize.

It is must that you prevent your model to overfit the data.Overfitting occurs when the model fails to understand the underlying structure of the data .This happens due to various reasons ,

  1. The  criterion used to select the model is not same which is used for judging its performance.
  2. overtraining of the model.
  3. too many non-linear polynomials in models.
  4. No Regularization.
  5. High variance in model, (for illustration see figure)
The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the black line. Source : Wikipedia

Before moving ahead in this article it is recommended that you first understand any one regression model.

How Does Regularization avoids Overfitting

To continue let’s take a example of our Logistic regression  model and try to understand what it is trying to do . The Loss function of logistic regression is given as follows,

Min(\Sigma _{i=0}^{n}(log(1+e^{(-y_{{i}} \ast W^{t}.\ x_{{i}})}))  )

Let’s modify our loss function by putting z=y_{{i}} \ast W^{t}.\ x_{{i}} ,now our loss function becomes

Min  (\Sigma _{i=0}^{n}(log(1+e^{-z_{{i}}})))

So our model to work most accurately we want our z to be positive i.e the model should predict class label most accurately , as if our z is positive the value of e^{-z} becomes close to zero and we know the log(x) where X \approx 1 is 0. This all explanation makes sense as this is our minimization objective.

But before claiming that this model is correct let’s examine some cases.

  1. What if some data points are noise i.e  points which do not follow underlying structure of data
  2. what if some of your data points are corrupted  i.e someone entered wrong or some mishandling.

To avoid above mentioned problems it is must that your model should generalize well rather than focuses of fitting itself on one data. That’s where the regularization comes into the equation. The idea of regularization is simple it tries to avoid the weights  W becoming too large or too small . Let’s understand with help of logistic loss function,

min( \Sigma _{{i=0}}^{n}(log (1+e^{-z_{i}}))+ \lambda\ \ast\ w_{{}}^{2})

This is formally called L2 Regularization . here \lambda is l2 regularization hyperparameter which decides how much to penalize the weights in order to prevent overfitting . Let’s understand using cases as how it is able to control the order of magnitude of weights .

case 1: if W \rightarrow \ +inf \ or \ -inf , then the loss-term will become very small but the square of w becomes very large and this is against minimization objective and the whole term will try to reduce the magnitude of w’s .

case 2: if W \rightarrow \ inf\ small, the value loss-term will become large ,which is also against minimization objective.

So there will be a tug-of-war  situation which will tries to prevent w’s either becoming too large or too small.

There is also another version of regularization known as L1 regularization but the main difference between L1 and L2 regularization is that L1 regularization introduces sparsity which is sometimes needed or sometime not. The whole concept of regularization is a kind of hack by doing we will prevent w’s  either becoming too large or too small.  It is a must that you always check for overfitting in your model before deploying it in production. We will be covering many such interesting topics in machine learning so stay tuned.

Happy Machine Learning”

Send a Message