Part -2 Comparing Optimizers

This is the part 2 of the Optimizers series, for part 1 click here .In this article we will continue our discussion on optimizers  and will cover some of the most advance optimizers present.So by not wasting any more time let’s straight jump into discussion.

Goal of this article is to cover the mathematical intuition of following Optimizers,

  1. Adadelta
  2. RmsProp
  3. Adam
  4. Adamax


This algorithm is a extension to Adagrad , if you remember from our discussion of Adagrad there we mentioned that Adagrad has a disadvantage when it comes to reduce the learning rate as the Gt(Exponential decay average) grows much larger the learning rate becomes extremely small and therefore stopping it from learning new things.Adadelta aims to solve this problem by introducing time stamp based averaging , in simple terms value less the previous gradients and value more just next previous gradient,let’s understand in terms of equation,

 E[g^{{2}}]_{{t}} = \gamma *  E[g^{{2}}]_{{t-1}} + (1- \gamma ) * {g_{t}}^{2}

In above equation all we are trying to do is introduce Exponential decay average rather than simple average
So, our equation for Adadelta becomes,

  w_{t+1}=w_{t}-\frac{\eta}{\sqrt{E[ g^{{2}}]t + \epsilon}}* \bigtriangledown J(x,y)_{{wt}}

this update to Adagrad solves the major problem of learning rate diminishing.


The idea behind RMsProp and adadelta is same they both uses the technique of Exponential decay average but RmsProp never gets published ,it was discussed by Geoffery hinton in one of his lectures where he claimed that the most tested and best value of \gamma is 0.9 and that of \eta is 0.001. Rest all is same as Adadelta even the update equationof Rmsprop is same as that of Adadelta.

Adam (Adaptive Moment Estimation)

This is one of the most used Optimizers in present DeepLearning applications in 2018.The idea behind Adam is instead of storing Eda of square of {g_{{t}}}^{2} why not store eda(Exponential decay average) of gt itself. The ‘Moment’ inside the name is because of ,In statistics:

Mean :- this is called first order moment
Variance:- is called Second order moment

So eda {g_{{t}}}^{} can be roughly thought of as mean and eda of {g_{{t}}}^{2} can be roughly thought of as variance but don’t get me wrong this is not the exact explanation as this variance is un-centered,so going through whole statistics will deviate us from our topic.Now coming back to our discussion , we have introduced mean and variance which will be used in our update equations,

m_{{t+1}} = \beta_{1} *m_{{t}} + (1-\beta_{1}) *g_{{t}}  mean at time t

v_{{t+1}} = \beta_{1} *v_{{t}} + (1-\beta_{1}) *g_{{t}}^{2}}}  variance at time t

Now combining these two equation to state our order equation,

m_{{}}^{'} = \frac{m_{{t}}}{1-\beta_{1}} first order moment

v_{{}}^{'} = \frac{v_{{t}}}{1-\beta_{2}} second order moment

So, our update equation for Adam is ,

   w_{{t+1}} = w_{t}  -\frac{\eta}{\sqrt[{}]{{v_{{}}}^{'} }-\epsilon} * m_{{}}}^{'}

This algorithm tries to estimate the moments of the gradients to move forward in optimization.


If all the above equations seems too difficult  to code then don’t worry all major deeplearning frameworks like tensorflow ,keras or pytorch comes pre-defined  with the implementation of all these optimizers you just have to mentioned it, but understanding the math’s behind these optimizers is very necessary as then only you can harness their true powers.



“Happy Machine Learning”



Send a Message