# Part 1 – Comparing Optimizers Optimizers are core part of any machine learning system as the final goal of any machine learning system is to optimize the loss function i.e to minimize the loss function . There are many Optimizers which are developed during this entire 50 yrs time of research , Comparing all the optimizers is beyond the scope of this article , so we will cover some of the most popular  optimizers.

if You are unfamiliar with optimization or want to refresh some of the basic concepts click here

The goal of this article to cover the mathematical intuition of following Optimizers:

2. Stochastic gradient descent(SGD)
3. SGD + momentum
4. Nesterov accelerated gradient(Nag)

One the oldest and most popular optimizer, the working of gradient descent is as follows ,

1. Take the whole dataset at once
2. Update the gradients over the entire dataset The mathematical intuition behind gradient descent is quite simple ,

1. Check the rate of change of the function at a point by calculating its derivative at that point
2. update by multiplying the change with a learning rate (the jump over the function curve) and then by subtracting it by its previous value .

Consider the image below for better understanding. This is a variation of Gradient Desent  and Stochastic itself has two variants:

1. Point SGD.
2. Mini batch SGD.

Point SGD

The difference between regular Gradient descent and Point SGD is ,in GD we update the gradient over our entire dataset so the update gives us a very confident gradient i.e the direction in which we update our weights but in case of point SGD we update our weights at each point in our dataset. Mini batch SGD

The difference between mini batch sgd and point sgd is simple , in mini batch Sgd we take data in form of batches , like first 100 or 200 to update the weights.

This has its advantage too,

1. In GD we have to put all our dataset at once in memory to calculate the gradients over it ,this most often gets unfeasible as our dataset size grows larger.

But we cannot ignore the side effects of this approach, as if we are not taking our whole dataset at once this means the direction of the gradient which we get is not always directed towards global minima ,as a result the convergence of SGD is very slow when compared to GD and also the convergence has great amount noise.

For more clear understanding consider the figure below SGD + Momentum

Before jumping to algorithm lets first understand the concept of Momentum . In Momentum simply bring the  concept of cumulative  sum of all previous gradients  in our SGD Equation which directly affects our convergence of Our Loss function .

This Results in massive changes when it comes to convergence.Let’s understand why this happens

In Simple Point SGD Let’s suppose we get the direction Say s’ and With only momentum(Which is never used in practical sanerios) we get Direction m’ , so our final direction will be s’ + v’ i.e a direction achieved by vector addition of both the vectors which will gets better and better as the cumulative sum gets bigger ,this approach gives us better idea of where our next gradient will lie but keep it in mind this is only a calculative approximation.

For Geometric understanding consider the image below, The concept behind NAG and SGD+momentum is very similar but both these algorithms were developed independently, In this first we calculate derivative at a point and to get a magnitude and direction and then after reaching that point we update our weights using same cumulative sum of gradients.

There overall result of NAG and SGD + momentum is same but there is huge difference in geometric interpretation of both the algorithm.

For Complete understanding consider the image below. If you observe carefully all the above equations , our learning rate is constant and not adaptive .Let’s first understand why we adaptive Learning rate ,Consider the image below

In this we are updating the weights but not changing the learning rate as a result we come closer to global minima but cannot touch the global minima because due to non adaptive learning rate we are constantly jumping to and fro off the global minima but not landing on it. This problem must be addressed as the function mentioned above is very simple and can be visualized but in real life machine learning application we have highly complex function which cannot be visualized.Now Lets continue our discussion. 