Weight initialization is one of the major factors that directly affects the convergence of the model i.e how accurately the optimizer minimize the loss function. Before we even get started I am assuming that the reader are aware, as what are neural networks, what are hidden layers, how they are connected etc. Though I will provide a abstract overview of working of a neural network. We will be only considering MLP architecture.
The basic workflow of neural network is,
- Initialize all the weights and biases of the network
- Forward pass
- Compute the loss
- Calculate the gradients of the loss function
- backpropagate the loss
- update the weights.
Before we even discuss techniques to initialize the weights let’s first discuss what happens if we don’t initialize the weights properly.
- Initialize all weights to zero : This is one of the worst mistake you can possibly do while initializing as doing this will make your neural network work only as a linear model rather than a non-linear as making all the weights as 0 means only the biases will be responsible for the direction of the decision boundary.
- Very small values of weights with sigmoid activation : This is one of the cause of vanishing gradients problem i.e the weights itself are so small that makes the gradients almost negligible makes it difficult for the network to learn something and eventually stopping the minimization of loss function.
- Exploding gradients problem : In this situation the value of the gradients itself becomes so large that the model never converges to global minima as it keeps oscillating over global minima or sometimes instead of converging the direction of the gradients is away from the minima making it impossible to to reach there.
Understanding the need of smart initializing is important as then after you will know by not doing smart initializing of weights can cause serious problems.
Techniques of weight initialization
This one is most commonly followed by beginners as generating random numbers from a normal distribution typically with mean equals 0 and stadndard-dev 0.2 works well in most cases and in-fact in most cases you will not face any problem except that by using this technique some network might converges slowly . The weights can following this technique can be initialize quite easily using just numpy.
weights = sigma * np.random.randn(Size_L,Size_L-1) * mean
the downside of using this technique is that if the value of mean and sigma is not chosen wisely can sometime cause the problem of vanishing gradients or exploding gradients.
Initializing weights keeping relu activation in mind:
Initializing weight keeping what activation you are using is a good start as what non-linear activation you use is greatly influenced by the weights you initialize.
For relu you can use weights coming from normal distribution with mean of 1 and variance of sqrt(2/size of previous hidden layer).
sigma = np.sqrt(2/Size_L-1) weights = sigma * np.random.randn(Size_L,Size_L-1)
If using tanh activation:
In case of tanh you can use Xavier initialization, this is similar to above initialization except in place of 2 it is 1 in xavier.
sigma = np.sqrt(1/Size_L-1) weights = sigma * np.random.randn(Size_L,Size_L-1)
Another initialization which is modification of xavier is,
sigma = np.sqrt(2/Size_L-1+Size_L) weights = sigma * np.random.randn(Size_L,Size_L-1)
These initialization keeps the weights in range neither very small nor big, thus it avoid the possibility of slow convergence and due to the fact the are not big enough to create the problem of exploding gradients.