The chase of perfection in the business world is never ending and on the contrary, it’s even taking the aggressive approach. To attain the optimum level of perfection in the workflow is possible to get with the well-structured system. And, this well-organized workflow of the companies is known as lean development.
Google Search Engine Is Hackers Next Favorite Tool
Well, there isn’t any doubt that Google Search Engine is the king of all the other search engines available out there. There are numerous search engines available these days, but 30 million people daily trust Google to search for information. That’s because Google offers dynamic search results plus the best web crawling service.
This week I was doing some simple experiments with MLP, trying out different architecture and on different learning rates with random batch normalization and sometimes droput. There was no such definite goal of this experiment except to observe the behavior of Neural networks when we change learning rates on different architectures.
The goal of this article is to share some of simple findings from my experiment, there is no sequence for these test and are completely based on intuition.I have only used keras as to make implementation much easier.
This was first architecture so I decided not use any batch normalization or dropouts. The architecture was quite simple lets look at implementation,
import keras from keras.datasets import mnist import keras.backend as K from keras.models import Sequential from keras.layers import Dense, Dropout , BatchNormalization (x_train,y_train), (x_test,y_test) = mnist.load_data() x_train = x_train.reshape(x_train.shape,784) x_test = x_test.reshape(x_test.shape,784)
y_train = keras.utils.to_categorical(y_train,num_classes=10) y_test = keras.utils.to_categorical(y_test, num_classes=10) n_hidden_1 = 10 n_hidden_2 = 7 n_hidden_3 = 3 model = Sequential() model.add(Dense(784,input_shape=(784,),activation='relu')) model.add(Dense(n_hidden_1,activation='relu')) model.add(Dense(n_hidden_2,activation='relu')) model.add(Dense(n_hidden_3,activation='relu')) model.add(Dense(10,activation='softmax')) model.compile(loss = keras.losses.categorical_crossentropy,optimizer=keras.optimizers.Adam(lr=0.01),metrics=['accuracy']) model.fit(x_train,y_train,epochs=4) score = model.evaluate(x_test,y_test)
This is beauty of keras the code itself is self explanatory. The output of this code,
Epoch 1/10 60000/60000 [==============================] - 5s 87us/step - loss: 2.3320 - acc: 0.1110 Epoch 2/10 60000/60000 [==============================] - 5s 79us/step - loss: 2.3024 - acc: 0.1099 Epoch 3/10 60000/60000 [==============================] - 5s 80us/step - loss: 2.3024 - acc: 0.1095 Epoch 4/10 60000/60000 [==============================] - 5s 81us/step - loss: 2.3024 - acc: 0.1104
As you can see after the loss decrease from 2.33 to 2.30 it remains constant and did not further decreased, I don’t find any mathematical reason apart from this one due to the fact we have taken learning rate quite high take the optimizer either at some saddle point or a local minima from where it cannot move in any direction further. The results were quite similar while using SGD also which is a poor optimizer than adam.
Epoch 1/10 60000/60000 [==============================] - 5s 79us/step - loss: 2.3052 - acc: 0.1119 Epoch 2/10 60000/60000 [==============================] - 4s 74us/step - loss: 2.3013 - acc: 0.1124 Epoch 3/10 60000/60000 [==============================] - 4s 74us/step - loss: 2.3013 - acc: 0.1124 Epoch 4/10 60000/60000 [==============================] - 5s 77us/step - loss: 2.3013 - acc: 0.1124
In this i have tried something unusual, rather than having small first hidden layer and increasing the number of neurons for further hidden layers,
import keras from keras.datasets import mnist import keras.backend as K from keras.models import Sequential from keras.layers import Dense, Dropout , BatchNormalization (x_train,y_train), (x_test,y_test) = mnist.load_data() x_train = x_train.reshape(x_train.shape,784) x_test = x_test.reshape(x_test.shape,784) y_train = keras.utils.to_categorical(y_train,num_classes=10) y_test = keras.utils.to_categorical(y_test, num_classes=10)
n_hidden_1 = 32 n_hidden_2 = 64 n_hidden_3 = 128 model = Sequential() model.add(Dense(784,input_shape=(784,),activation='relu')) model.add(Dense(n_hidden_1,activation='relu')) model.add(Dense(n_hidden_2,activation='relu')) model.add(Dense(n_hidden_3,activation='relu')) model.add(Dense(10,activation='softmax')) model.compile(loss = keras.losses.categorical_crossentropy,optimizer=keras.optimizers.SGD(lr=0.01),metrics=['accuracy']) model.fit(x_train,y_train,epochs=10) score = model.evaluate(x_test,y_test)
Epoch 1/10 60000/60000 [==============================] - 6s 96us/step - loss: 14.4698 - acc: 0.1022 Epoch 2/10 60000/60000 [==============================] - 5s 89us/step - loss: 14.4711 - acc: 0.1022 Epoch 3/10 60000/60000 [==============================] - 5s 88us/step - loss: 14.4711 - acc: 0.1022 Epoch 4/10 60000/60000 [==============================] - 5s 90us/step - loss: 14.4711 - acc: 0.1022
This output was not surprising even the cost increased rather than decreasing couple of reason for this,
- Learning rate is high so it might stuck at some local minima or saddle point
- The reason for why cost increased is because of not just one,
- Sgd don’t take all the points in consideration to find the next best direction
- hidden layers neurons are responsible for finding hidden patterns in data so less number of neurons means less number of hidden patters recognized and due to the fact subsequent layers findings depends upon previous layers so if not including large number of neurons in first layers affects performance but this might not be the possible explanation for above result.
In this i have only used 2 layers with same number of neurons in it with dropout of 0.5 and learning rate 0.0001
import keras from keras.datasets import mnist import keras.backend as K from keras.models import Sequential from keras.layers import Dense, Dropout , BatchNormalization (x_train,y_train), (x_test,y_test) = mnist.load_data() x_train = x_train.reshape(x_train.shape,784) x_test = x_test.reshape(x_test.shape,784)
y_train = keras.utils.to_categorical(y_train,num_classes=10) y_test = keras.utils.to_categorical(y_test, num_classes=10) n_hidden_1 = 10 n_hidden_2 = 10 model = Sequential() model.add(Dense(784,input_shape=(784,),activation='relu')) model.add(Dense(n_hidden_1,activation='relu')) model.add(Dropout(0.5)) model.add(Dense(n_hidden_2,activation='relu')) model.add(Dense(10,activation='softmax'))
model.compile(loss = keras.losses.categorical_crossentropy,optimizer=keras.optimizers.SGD(lr=0.0001),metrics=['accuracy']) model.fit(x_train,y_train,epochs=6)
the output of this is quite what expected,
Epoch 1/10 60000/60000 [==============================] - 22s 366us/step - loss: 2.4244 - acc: 0.1302 Epoch 2/10 60000/60000 [==============================] - 22s 361us/step - loss: 2.1681 - acc: 0.1766 Epoch 3/10 60000/60000 [==============================] - 22s 364us/step - loss: 2.1075 - acc: 0.2006 Epoch 4/10 60000/60000 [==============================] - 22s 361us/step - loss: 2.0722 - acc: 0.2170 Epoch 5/10 60000/60000 [==============================] - 21s 358us/step - loss: 2.0553 - acc: 0.2234 Epoch 6/10 60000/60000 [==============================] - 22s 359us/step - loss: 2.0408 - acc: 0.2253
The loss is decreasing with each epoch as the learning rate is low so the network is learning and not skipping global minima(just an observation). This work quite similar with adam.
Epoch 1/10 60000/60000 [==============================] - 41s 679us/step - loss: 2.4292 - acc: 0.1399 Epoch 2/10 60000/60000 [==============================] - 40s 660us/step - loss: 2.2645 - acc: 0.1314 Epoch 3/10 60000/60000 [==============================] - 40s 662us/step - loss: 2.2330 - acc: 0.1459 Epoch 4/10 60000/60000 [==============================] - 40s 663us/step - loss: 2.1963 - acc: 0.1633 Epoch 5/10 60000/60000 [==============================] - 40s 671us/step - loss: 2.1695 - acc: 0.1762 Epoch 6/10 60000/60000 [==============================] - 40s 664us/step - loss: 2.1702 - acc: 0.1813
The observations from above experiment
- A good learning rate is in between .001 – .00001.
- Use Adam for quite stable results.
- size of first hidden layer should be large and decrease it as the network grows deeper
- Use relu to avoid problems of diminishing gradients in deep mlp’s.
- Must use dropouts to avoid over fitting.
“How good your machine learning model predict depends upon what data you use to train it.”
If the data is itself bad don’t expect good result, now the immediate question arises “How you define bad data ?”, the answer of this question is quite difficult as the quality of data depends upon many factors but among all factors the choice of features used to create data is most important. The choice of the features must be align with objective of the problem. Choosing features is not a type of task where one should only depend upon intuition. Feature selection is a statistical technique which used to find relevant features.
The goal of this article is to answer following questions.
- What is feature selection ?
- What are filter methods for feature selection ?
- What are wrapper methods for feature selection ?
- What are embedded methods for feature selection ?
What is feature selection ?
Feature selection refers to attribute or variable selection for a problem. This must be performed before feeding data to an algorithm because of following reasons,
- Higher the dimension of the data slower the process of training model.
- Some features may be completely irrelevant to predict output.
- It affects the accuracy of model .
- In some cases overfitting can also be avoided.
What are filter methods for feature selection ?
Filter selection methods involves methods in which we use some statistical tests in order to find the relevance of the features with respect to output. These tests are performed without considering the algorithm of the model. The features which are selected, are purely dependent upon their statistical test scores.
Some of the most popular statistical tests performed are,
- Co-variance : Measuring co-variance by taking a subset of features is often a good, as by doing this we can get a fairly good idea as how are features behaves with respect to each other.
- Pearson correlation: Though the co-variance provides good information about correlation between variables but the disadvantage with using Co-variance is there is no upper-bound on co-variance score this makes it little difficult to judge how much the features are aligned with each other for that we use Pearson correlation.
- LDA : Linear discriminant analysis is one of the techniques where we try to select features which will be based as how good they separates one of two classes.
What are wrapper methods for feature selection ?
The feature selection techniques that comes under this heading, selects features which depends upon what type of algorithm you use for your model. The idea is simple select a single feature or a subset of feature and feed it into model and store accuracy, after certain number of combinations which ever subset performs the best choose it.
The techniques which comes under this heading is,
- Forward feature selection : The idea is simple first choose a single feature which gives the best performance then keep adding features on top of it and checks the performance after every addition and stop this process if the performance does not improve or improve slightly which is not very significant.
- Backward Elimination : The idea of this technique is quite opposite as that of forward feature selection, in this technique we first start with all features and start eliminating features one by one till there is no significant decrement in performance of the model.
- Recursive selection: This is not a very computationally friendly technique as it first select select best features and worst features at each stage and then find the best features from left over and checks the performance, it does this process recursively. Then after all features are finished it ranks the features according to which they are eliminated. Then the algorithm chooses top ranks features.
There is no all in one package which implements all these techniques but there are some packages which can help sklearn, boruta etc, for rest of techniques implement it by yourself.
Embedded feature selection
These methods of feature selection are built on top of filter and wrapper feature selection. One of the most popular example of embedded feature selection is regularization. They often provides constraint over the objective function which avoids the problem of coefficients of the weights becoming zero thus helps overfitting of the model. These methods are completely algorithm dependent.
The goal of this article is to answer three simple question,
- What is batch normalization ?
- Why we need batch normalization ?
- How to use it in a neural network ?
What is batch normalization ?
Batch normalization refers to the normalization done over a batch of data in between of hidden layers. Consider the diagram below for complete understanding.
The normalization is performed on the activation values of previous hidden layer then it is passed to next hidden layer. The normalization performed in this step is same as the normalization performed before input layer, subtract the mean then divide it by its standard-deviation.
Why do we need batch normalization ?
Before reading the answer of this question you must be familiar with working of neural network.
So before explaining batch normalization let’s first understand why we do normalization before input layer, consider a situation where we have a dataset of numeric features the scale of the features given below,
f1 : 0-1
f2 : 200-1000
f3 -4 – 200
now if we try to train the network on this data without normalizing there will be following implications,
- The features which corresponds to high values will affects the output most irrespective how important it is,
- Trained coefficients over this data will not be able to generalize very well
- If any query comes which have different scale for these features then chances are the output for that query will be wrong.
By observing the implications mentioned above we can say following about normalization,
- normalization brings the scale of each features within a range of -1 to 1
- After normalization it very unlikely that the output is affected by the scale of features
- From second point we can also infer that it makes the prediction more dependent on feature importance rather than on its scale
- This also enhances the ability of the network to generalize well
Now let’s dive into batch normalization, From the demo of above given neural network where it contain 3 hidden layers, one input layer, and one output layer with a batch normalization layer in between of hidden layer 2nd and 3rd. So why to we need to normalize the output of the hidden layer 2, consider this explanation lets say we are training this network over some data then the amount of the variance experienced by each layer increases as we built the neural network deeper and deeper so the variance experienced by the input layer is less than what is experienced by 3rd hidden layer, due to this fact the training of neural network tends to become very difficult. There are some others factors which are also affected due to this,
- Due to this high variance we have to set the learning rate low, as if we choose a large learning rate chances are the neural network will skip patters during training
- This high variance also affect the performance of the network over test data.
In order to minimize the effect of this variance which is generated when the network grows deeper we use batch normalization. This means we can use batch normalization multiple times ? the answer is yes.
How to use it in a neural network ?
Any deep-learning framework can be used to implement a batch normalization layer, we will use tensorflow for this explanation.
Calculate the mean and variance of current batch
batch_mean, batch_var = tf.nn.moments(hidden_layer_2,) # here  indicate the axis on which we are calculating mean and std-dev
gamma = tf.Variable(tf.ones([hidden_layer_2])) x = tf.variable(tf.ones([hidden_layer_2])) beta = tf.Variable(tf.zeros([hidden_layer_2])) hidden_layer_2 = tf.nn.batch_normalization(batch_mean, batch_var, beta, gamma, variance_epsilon=1e-05)
You must be thinking what are all these parameters, this all is done because the batch normalization function in tensorflow does some thing different, take a look at the equation below,
This is what the function does internally so make make normalization simple I am multiplying it by one and adding zero to make the equation something like this,
which is simple normalization equation. Now you completely understand batch normalization and how to implement it.
Setting up environment can sometime be a difficult process, and choosing between different can be confusing. So in this article we will cover how to setup your own data science on your machine, this article is not operating system dependent, procedures followed in this article can be used for any operating system.
Instead of comparing all the different environment for data science purposes, we will cover complete setting up of anaconda environment which is by far most used environment among data scientist, by saying that it does not means other environment is not powerful.
It is recommended that you download only 64bit version.
There are two popular major versions of python 2. and python 3, for data science purpose we mostly use python 3. It is recommended that you install 3.6 with any patch number for now.
Downloading python 3.6 for windows : https://www.python.org/downloads/
Downloading python 3.6 for linux : https://www.python.org/downloads/source/, though linux users have many choices over type of installation, like rpm tar etc. Choose according to your own operating system.
Downloading python 3.6 for mac: https://www.python.org/downloads/mac-osx/
while installing you must check add to path file,
Downloading anaconda and installing it is quite easy process after downloading it from here https://www.anaconda.com/download/ , just double click on the setup and you are done.
after installation chances are it will become the default python interpreter in case of mac and linux , so it is recommended that you install it in separate directory.
After installing launch anaconda navigator, by clicking anaconda navigator from start menu,
after anaconda launches you will see a dashboard,
, There are couple of ways launching these applications but the simplest way is by running them from anaconda navigator.
Most of the time as a beginner you will spend time on jupyter notebook. Jupyter is a ipython notebook i.e a editor with html features in it. go ahead and launch jupyter notebook.
jupyter notebook will be launch in a browser window, the default workspace of jupyter notebook is itself users account. Working with jupyter notebook is extremely easy just try different options present.
There is another important application present Spyder, it is more of a matlab like tool for python with simple design.
Anaconda comes with almost all basic necessary data science library pre-installed but if you wish to install some additional libraries then click on anaconda prompt from start menu.
it will open a command prompt window with anaconda default environment or which ever environment you have choose. Using anaconda prompt you can install any library you require using conda package manager.
For current installed libraries, just type conda list
After following all above procedure you must be able to start working on data science problems using anaconda. There might be instances where you will face difficulty installing it, like .net framework error or c++ libraries not present or some root permissions not present etc. In those situations don’t get frustrated just leave the comment below or google it, at the end tackling error is what programmers do their entire life.
Random numbers are very important part of machine learning, they are the staring point for the predicting output in presence of input data. In order to understand the working of coefficients correctly you must understand importance of randomness and how to generate it.
After reading this article you will know,
- Importance of randomness.
- Its use in machine learning.
- how to generate it.
Random number are more robust against any kind of biases which make them very useful in cases where we want the whole system to be robust and will not contain any biases, though it is extremely difficult to eliminate human bias from algorithm but by using random numbers it makes it less prone to data bias.
If you are unaware of what is a bias, it can be thought of as a distraction or a force which tends to deviate the algorithm from the actual goal or sometimes it forces the algorithm to attain the goal but only from certain direction rather than allowing the algorithm to cover all possible paths i.e increase the power of algorithm in generalization.
How does randomness help ?
Consider a situation where we initialize all the coefficients of a neural network to zero rather than initializing using a random number in this case the network will itself behave as a linear model instead of a non-linear one. This is just one of many multiple scenarios where random numbers plays a great importance.
one other use case of randomness is, when we shuffle the train and test data with help of a random shuffler. This is extremely important to shuffle the data as data arrange in some order on the basis of some feature makes the algorithm to perform poor.
How to inject randomness in our model ?
Injecting randomness in model weather in form of random initialization of weights or random shuffle is done by using Pseudorandom number generator.
A random generator can be thought of as a function that generates numbers, generating true random numbers is beyond the scope of this article and neither used in machine learning, instead we use pseudorandom number, random number which generated using a deterministic process.
Randomness can also be generated by sampling a distribution more on this later.
Pseudorandom number generator are often functions which you call by either providing a seed or value to use it for start generating sequence, and if the seed value is not provided by the function caller then it takes time or data as its current seed value. Though the random number is generated using a deterministic approach and in sequence but it does not mean that predicting sequence is deterministic.
Generating random number in python
There is a module name random in python which is used to generate random numbers see e.g below,
import random print("A random integer between 1-100 ",random.randint(0,101))
This code will generate a random integer within the range 1-100.
If you are in machine learning then you must have used one of the most popular numerical computation library known as numpy. It contains a module random which is responsible for generating random numbers, as you must be thinking why to use numpy when python itself has a random number generator then answer to your question is numpy is extremely tuned up library for almost all numerical computation weather it matrix multiplication or broadcasting operations etc. using numpy for numerical has its own benefits. Generating random number using numpy is a straight forward process.
import numpy print('A random integer between 1-100',np.random.randint(0,101))
generating a random integer between 0-100.
import numpy print('A random integer from a normal distribution with mean 0 and std dev 1',np.random.normal(loc=0,scale=1,size=1))
generating a random integer from a normal distribution with mean 0 and std-dev 1.
Random numbers are extremely important while working with machine learning problems as they increase the generalization ability of the model. We will be covering many such interesting topics on machine learning so stay tuned.
Hypothesis testing is probably the most confusing topic in whole of statistics, even experienced machine learning engineers suffers from lack of crisp understanding of this topic. In this tutorial we will try to answer following question,
- What is hypothesis testing ?
- What it is use for ?
What is hypothesis testing ?
The assumption that the input data posses a specific structure and on that we predict the outcome. Then we test our outcome if it align with our assumption then the assumption is accepted otherwise rejected. The assumption here is hypothesis and the procedure of testing the hypothesis is hypothesis testing. The whole idea is of testing a given quantity provided that the quantity follows a assumption. The whole idea of hypothesis testing is based upon proof by contradiction method.
One concrete example of assumption can be,
- Assumption that the data follows normal distribution.
“The assumption of that the data follows normal distribution is called null hypothesis or H0, H0 is the assumption which the statistical test holds initially, It can be interpreted as default assumption.“
“The alternate assumption or violation of null hypothesis is called first hypothesis or alternate hypothesis or H1.“
null hypothesis (H0: The default assumption of statistical test and it is not rejected if the assumption has some level of significance.
First hypothesis H1: The default assumption does not holds and it is rejected or it is alternate assumption of default assumption.
It is recommended to understand the definition of H0 and H1 correctly.
This is one of the most misunderstood part in whole hypothesis testing. After applying statistical test on the data by taking an assumption, the statistical test returns what we call a p-value, on the basis of our p-value we either reject or accept our null hypothesis. It is done by comparing our p-value with the significance level.
p-value: The probability of observing a result of statistical test, given our null-hypothesis is true.
I am quite sure you haven’t understood the above definition. Let’s take an example, suppose we have two group of 20 peoples, group A and group B and we want to answer that what is the difference between the height of peoples in two group. One simple statistical test maybe difference between mean of the weight of two groups. here
H0 : There is no difference between the heights of the people of two group.
H1 : There is a difference between the height of the people of two group.
So our p-value can be defined as “The probability of observing difference of given our null hypothesis is true.”
p-value > significance level or threshold : We accept our null hypothesis or failed to reject our null hypothesis.
p-value =< significance level or threshold: We reject our null hypothesis or failed to accept our null hypothesis.
A common misinterpretation of p-value is that it is probability of null hypothesis being true or false.
So according to above example of weights of people in group,
if lets say p-value is 0.8 for difference of 12kg, if my null hypothesis is true which is there is no difference between weights of two group of peoples , then we can say that even if we are observing a p-value of 0.8 for the difference of 12kg, given our null hypothesis is true this may be happening because of less no of peoples and we accept our null hypothesis. or we have failed to reject our null hypothesis.
Understanding the above statement is slightly trickier, so read the above statement very carefully.
What it is used for ?
Sometimes during statistical analysis of the data we make some assumption before applying any algorithms, to make sure we are applying right algorithm on the data it is very important that our assumption is correct. A simple example of assumption can be ‘the data follows binomial distribution’ if this is true then we can confidently apply naive bayes on the given data as naive bayes assumes that the data either follows binomial distribution or multinomial distribution.
Understanding hypothesis testing is very important and in many places it is explained incorrectly so if you have any doubts please post in comment section below.
Weight initialization is one of the major factors that directly affects the convergence of the model i.e how accurately the optimizer minimize the loss function. Before we even get started I am assuming that the reader are aware, as what are neural networks, what are hidden layers, how they are connected etc. Though I will provide a abstract overview of working of a neural network. We will be only considering MLP architecture.
The basic workflow of neural network is,
- Initialize all the weights and biases of the network
- Forward pass
- Compute the loss
- Calculate the gradients of the loss function
- backpropagate the loss
- update the weights.
Before we even discuss techniques to initialize the weights let’s first discuss what happens if we don’t initialize the weights properly.
- Initialize all weights to zero : This is one of the worst mistake you can possibly do while initializing as doing this will make your neural network work only as a linear model rather than a non-linear as making all the weights as 0 means only the biases will be responsible for the direction of the decision boundary.
- Very small values of weights with sigmoid activation : This is one of the cause of vanishing gradients problem i.e the weights itself are so small that makes the gradients almost negligible makes it difficult for the network to learn something and eventually stopping the minimization of loss function.
- Exploding gradients problem : In this situation the value of the gradients itself becomes so large that the model never converges to global minima as it keeps oscillating over global minima or sometimes instead of converging the direction of the gradients is away from the minima making it impossible to to reach there.
Understanding the need of smart initializing is important as then after you will know by not doing smart initializing of weights can cause serious problems.
Techniques of weight initialization
This one is most commonly followed by beginners as generating random numbers from a normal distribution typically with mean equals 0 and stadndard-dev 0.2 works well in most cases and in-fact in most cases you will not face any problem except that by using this technique some network might converges slowly . The weights can following this technique can be initialize quite easily using just numpy.
weights = sigma * np.random.randn(Size_L,Size_L-1) * mean
the downside of using this technique is that if the value of mean and sigma is not chosen wisely can sometime cause the problem of vanishing gradients or exploding gradients.
Initializing weights keeping relu activation in mind:
Initializing weight keeping what activation you are using is a good start as what non-linear activation you use is greatly influenced by the weights you initialize.
For relu you can use weights coming from normal distribution with mean of 1 and variance of sqrt(2/size of previous hidden layer).
sigma = np.sqrt(2/Size_L-1) weights = sigma * np.random.randn(Size_L,Size_L-1)
If using tanh activation:
In case of tanh you can use Xavier initialization, this is similar to above initialization except in place of 2 it is 1 in xavier.
sigma = np.sqrt(1/Size_L-1) weights = sigma * np.random.randn(Size_L,Size_L-1)
Another initialization which is modification of xavier is,
sigma = np.sqrt(2/Size_L-1+Size_L) weights = sigma * np.random.randn(Size_L,Size_L-1)
These initialization keeps the weights in range neither very small nor big, thus it avoid the possibility of slow convergence and due to the fact the are not big enough to create the problem of exploding gradients.