The aim of this article is to provide a comprehensive understanding of some popular Activation functions used today.
In the pre Deep Learning era that is some where around 1990’s and early 2000’s there where two most popular activation function present ,
But in the Deep learning era started somewhere around 2006 Relu(Rectified Linear unit) activation function is the most used activation function in any of neural network architecture weather it is CNN ,RNN Capsule’s etc.
In this article we will cover mathematical intuition , advantages ,disadvantages of following activation functions :-
- Relu ,and some variations of Relu
If you have seen or used Logistic regression chances are you are already familiar with it ,but if not then don’t worry.Sigmoid activation is nothing but the sigmoid function itself applied in combination with linear algebra. There many mathematical reason why we use sigmoid as our non-linear activation function.
Consider a binary classification problem ,where the decision plane is defined using w’s and the two classes lie on either side of the plane and our task is whenever a new point comes we have to tell to which class that point belongs . We can do this easily as if Z= > 0 then it belongs to positive class and if Z= < 0 then it belongs to negative class and if Z= = 0 then the point lies on decision plane itself, but There is catch what if the a point in the training data is very far away from the actual decision boundary then in that case it can shift the entire decision plane from the right position to a wrong position on the plane.So there must be some way to reduce the effect of the distance on the decision plane the answer is sigmoid function .consider this diagram,
If the value on x axis grows initially the function value also grows linearly but after certain distance on X the corresponding Y value gets squashed i.e it no longer increases linearly but stays same.This property of sigmoid is very helpful as now we pass our entire Z in the sigmoid function and its outputs a vlaue between 0,1 .Another reason for using sigmoid function is its output ,as it provides a output a probabilistic interpretation .
- Provides a probabilistic interpretation
- Derivative of sigmoid is itself in form of sigmoid
- Vanishing gradient problem when used in neural network
- Cannot be used in a neural network with more than 3 or 4 layers .
Tanh function or sometimes called Hyperbolic function used as activation function for many algorithms ,reason for using tanh is same as it also squashes, but in case of tanh the output ranges form 1,-1 and the curve passes through 0 where as in case of sigmoid the function passes through 0.
- Derivative of tanh is itself is in form of tanh
- Same as that of sigmoid
- No probabilistic interpretation
There are couple of disadvantages with sigmoid and tanh but the main disadvantage is vanishing gradients problem as because of that for almost 20+ yrs we were unable to train neural network with more than 3-4 hidden layers and calculating derivative of these function is not straight forward .So to overcome these problems we come up with a fairly straight forward activation function called the ReLu activation function.
It works like this ,plot of relu looks like this
the value of relu for all values of Z equal to or less than 0 is 0 and for all positive value is Z itself.
- Solves the problem of vanishing gradients
- calculating derivative of Relu is extremely easily
- Derivative of Relu at X=0 is undefined
- Dead Activation
To Overcome the disadvantage of Relu we use sometimes either softplus function(Given in the image above) or noisy Relu or in case of dead activation we use Leaky relu. Finding the best activation function is not a easy task and also it is a very active area of Research in machine Learning .Apart from other there are many activation functions present. All you have to remember is
“input times weight add a bias ,activate”