Optimizers are core part of any machine learning system as the final goal of any machine learning system is to optimize the loss function i.e to minimize the loss function . There are many Optimizers which are developed during this entire 50 yrs time of research , Comparing all the optimizers is beyond the scope of this article , so we will cover some of the most popular optimizers.

if You are unfamiliar with optimization or want to refresh some of the basic concepts click here

The goal of this article to cover the mathematical intuition of following Optimizers:

- Gradient Descent
- Stochastic gradient descent(SGD)
- SGD + momentum
- Nesterov accelerated gradient(Nag)
- Adagrad

Gradient Descent

One the oldest and most popular optimizer, the working of gradient descent is as follows ,

- Take the whole dataset at once
- Update the gradients over the entire dataset

The mathematical intuition behind gradient descent is quite simple ,

- Check the rate of change of the function at a point by calculating its derivative at that point
- update by multiplying the change with a learning rate (the jump over the function curve) and then by subtracting it by its previous value .

Consider the image below for better understanding.

Stochastic Gradient descent

This is a variation of Gradient Desent and Stochastic itself has two variants:

- Point SGD.
- Mini batch SGD.

**Point SGD**

The difference between regular Gradient descent and Point SGD is ,in GD we update the gradient over our entire dataset so the update gives us a very confident gradient i.e the direction in which we update our weights but in case of point SGD we update our weights at each point in our dataset.

**Mini batch SGD**

The difference between mini batch sgd and point sgd is simple , in mini batch Sgd we take data in form of batches , like first 100 or 200 to update the weights.

This has its advantage too,

- In GD we have to put all our dataset at once in memory to calculate the gradients over it ,this most often gets unfeasible as our dataset size grows larger.

But we cannot ignore the side effects of this approach, as if we are not taking our whole dataset at once this means the direction of the gradient which we get is not always directed towards global minima ,as a result the convergence of SGD is very slow when compared to GD and also the convergence has great amount noise.

For more clear understanding consider the figure below

SGD + Momentum

Before jumping to algorithm lets first understand the concept of Momentum . In Momentum simply bring the concept of cumulative sum of all previous gradients in our SGD Equation which directly affects our convergence of Our Loss function .

This Results in massive changes when it comes to convergence.Let’s understand why this happens

In Simple Point SGD Let’s suppose we get the direction Say s’ and With only momentum(Which is never used in practical sanerios) we get Direction m’ , so our final direction will be s’ + v’ i.e a direction achieved by vector addition of both the vectors which will gets better and better as the cumulative sum gets bigger ,this approach gives us better idea of where our next gradient will lie but keep it in mind this is only a calculative approximation.

For Geometric understanding consider the image below,

Nesterov accelerated gradient(

The concept behind NAG and SGD+momentum is very similar but both these algorithms were developed independently, In this first we calculate derivative at a point and to get a magnitude and direction and then after reaching that point we update our weights using same cumulative sum of gradients.

There overall result of NAG and SGD + momentum is same but there is huge difference in geometric interpretation of both the algorithm.

For Complete understanding consider the image below.

If you observe carefully all the above equations , our learning rate is constant and not adaptive .Let’s first understand why we adaptive Learning rate ,Consider the image below

In this we are updating the weights but not changing the learning rate as a result we come closer to global minima but cannot touch the global minima because due to non adaptive learning rate we are constantly jumping to and fro off the global minima but not landing on it. This problem must be addressed as the function mentioned above is very simple and can be visualized but in real life machine learning application we have highly complex function which cannot be visualized.Now Lets continue our discussion.

Adagrad

This is a adaptive rate learning rate algorithm . It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. This makes it very useful for sparse data. Adagrad greatly imporves the stability of SGD and it is well suited for training large neural networks with tons of weights to be updated .The update equation of Adagrad is given below,

Here **Gt **sum of square of all previous gradients and epsilon is a very small positive value to prevent division by zero exception , and as the gradient grows larger for that particular weight the overall result of division becomes small which result in small learning rate for that particular weight.The main disadvantage of Adagrad is its sum of the squared gradients in the denominator: Since every added term is positive, the sum keeps growing larger during training. Thismakes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge.

The Above mentioned algorithms are arrange in sequence of their development.In the part 2 of Optimizers we will cover some very advance and new optimization technique so stay tuned.