Effects of hyperparameters on mlp’s
This week I was doing some simple experiments with MLP, trying out different architecture and on different learning rates with random batch normalization and sometimes droput. There was no such definite goal of this experiment except to observe the behavior of Neural networks when we change learning rates on different architectures.
The goal of this article is to share some of simple findings from my experiment, there is no sequence for these test and are completely based on intuition.I have only used keras as to make implementation much easier.
First architecture
This was first architecture so I decided not use any batch normalization or dropouts. The architecture was quite simple lets look at implementation,
import keras from keras.datasets import mnist import keras.backend as K from keras.models import Sequential from keras.layers import Dense, Dropout , BatchNormalization (x_train,y_train), (x_test,y_test) = mnist.load_data() x_train = x_train.reshape(x_train.shape[0],784) x_test = x_test.reshape(x_test.shape[0],784)
y_train = keras.utils.to_categorical(y_train,num_classes=10) y_test = keras.utils.to_categorical(y_test, num_classes=10) n_hidden_1 = 10 n_hidden_2 = 7 n_hidden_3 = 3 model = Sequential() model.add(Dense(784,input_shape=(784,),activation='relu')) model.add(Dense(n_hidden_1,activation='relu')) model.add(Dense(n_hidden_2,activation='relu')) model.add(Dense(n_hidden_3,activation='relu')) model.add(Dense(10,activation='softmax')) model.compile(loss = keras.losses.categorical_crossentropy,optimizer=keras.optimizers.Adam(lr=0.01),metrics=['accuracy']) model.fit(x_train,y_train,epochs=4) score = model.evaluate(x_test,y_test)
This is beauty of keras the code itself is self explanatory. The output of this code,
Epoch 1/10 60000/60000 [==============================] - 5s 87us/step - loss: 2.3320 - acc: 0.1110 Epoch 2/10 60000/60000 [==============================] - 5s 79us/step - loss: 2.3024 - acc: 0.1099 Epoch 3/10 60000/60000 [==============================] - 5s 80us/step - loss: 2.3024 - acc: 0.1095 Epoch 4/10 60000/60000 [==============================] - 5s 81us/step - loss: 2.3024 - acc: 0.1104
As you can see after the loss decrease from 2.33 to 2.30 it remains constant and did not further decreased, I don’t find any mathematical reason apart from this one due to the fact we have taken learning rate quite high take the optimizer either at some saddle point or a local minima from where it cannot move in any direction further. The results were quite similar while using SGD also which is a poor optimizer than adam.
Epoch 1/10 60000/60000 [==============================] - 5s 79us/step - loss: 2.3052 - acc: 0.1119 Epoch 2/10 60000/60000 [==============================] - 4s 74us/step - loss: 2.3013 - acc: 0.1124 Epoch 3/10 60000/60000 [==============================] - 4s 74us/step - loss: 2.3013 - acc: 0.1124 Epoch 4/10 60000/60000 [==============================] - 5s 77us/step - loss: 2.3013 - acc: 0.1124
Second Architecture
In this i have tried something unusual, rather than having small first hidden layer and increasing the number of neurons for further hidden layers,
code below,
import keras from keras.datasets import mnist import keras.backend as K from keras.models import Sequential from keras.layers import Dense, Dropout , BatchNormalization (x_train,y_train), (x_test,y_test) = mnist.load_data() x_train = x_train.reshape(x_train.shape[0],784) x_test = x_test.reshape(x_test.shape[0],784) y_train = keras.utils.to_categorical(y_train,num_classes=10) y_test = keras.utils.to_categorical(y_test, num_classes=10)
n_hidden_1 = 32 n_hidden_2 = 64 n_hidden_3 = 128 model = Sequential() model.add(Dense(784,input_shape=(784,),activation='relu')) model.add(Dense(n_hidden_1,activation='relu')) model.add(Dense(n_hidden_2,activation='relu')) model.add(Dense(n_hidden_3,activation='relu')) model.add(Dense(10,activation='softmax')) model.compile(loss = keras.losses.categorical_crossentropy,optimizer=keras.optimizers.SGD(lr=0.01),metrics=['accuracy']) model.fit(x_train,y_train,epochs=10) score = model.evaluate(x_test,y_test)
output,
Epoch 1/10 60000/60000 [==============================] - 6s 96us/step - loss: 14.4698 - acc: 0.1022 Epoch 2/10 60000/60000 [==============================] - 5s 89us/step - loss: 14.4711 - acc: 0.1022 Epoch 3/10 60000/60000 [==============================] - 5s 88us/step - loss: 14.4711 - acc: 0.1022 Epoch 4/10 60000/60000 [==============================] - 5s 90us/step - loss: 14.4711 - acc: 0.1022
This output was not surprising even the cost increased rather than decreasing couple of reason for this,
- Learning rate is high so it might stuck at some local minima or saddle point
- The reason for why cost increased is because of not just one,
- Sgd don’t take all the points in consideration to find the next best direction
- hidden layers neurons are responsible for finding hidden patterns in data so less number of neurons means less number of hidden patters recognized and due to the fact subsequent layers findings depends upon previous layers so if not including large number of neurons in first layers affects performance but this might not be the possible explanation for above result.
Third Architecture,
In this i have only used 2 layers with same number of neurons in it with dropout of 0.5 and learning rate 0.0001
import keras from keras.datasets import mnist import keras.backend as K from keras.models import Sequential from keras.layers import Dense, Dropout , BatchNormalization (x_train,y_train), (x_test,y_test) = mnist.load_data() x_train = x_train.reshape(x_train.shape[0],784) x_test = x_test.reshape(x_test.shape[0],784)
y_train = keras.utils.to_categorical(y_train,num_classes=10) y_test = keras.utils.to_categorical(y_test, num_classes=10) n_hidden_1 = 10 n_hidden_2 = 10 model = Sequential() model.add(Dense(784,input_shape=(784,),activation='relu')) model.add(Dense(n_hidden_1,activation='relu')) model.add(Dropout(0.5)) model.add(Dense(n_hidden_2,activation='relu')) model.add(Dense(10,activation='softmax'))
model.compile(loss = keras.losses.categorical_crossentropy,optimizer=keras.optimizers.SGD(lr=0.0001),metrics=['accuracy']) model.fit(x_train,y_train,epochs=6)
the output of this is quite what expected,
Epoch 1/10 60000/60000 [==============================] - 22s 366us/step - loss: 2.4244 - acc: 0.1302 Epoch 2/10 60000/60000 [==============================] - 22s 361us/step - loss: 2.1681 - acc: 0.1766 Epoch 3/10 60000/60000 [==============================] - 22s 364us/step - loss: 2.1075 - acc: 0.2006 Epoch 4/10 60000/60000 [==============================] - 22s 361us/step - loss: 2.0722 - acc: 0.2170 Epoch 5/10 60000/60000 [==============================] - 21s 358us/step - loss: 2.0553 - acc: 0.2234 Epoch 6/10 60000/60000 [==============================] - 22s 359us/step - loss: 2.0408 - acc: 0.2253
The loss is decreasing with each epoch as the learning rate is low so the network is learning and not skipping global minima(just an observation). This work quite similar with adam.
Epoch 1/10 60000/60000 [==============================] - 41s 679us/step - loss: 2.4292 - acc: 0.1399 Epoch 2/10 60000/60000 [==============================] - 40s 660us/step - loss: 2.2645 - acc: 0.1314 Epoch 3/10 60000/60000 [==============================] - 40s 662us/step - loss: 2.2330 - acc: 0.1459 Epoch 4/10 60000/60000 [==============================] - 40s 663us/step - loss: 2.1963 - acc: 0.1633 Epoch 5/10 60000/60000 [==============================] - 40s 671us/step - loss: 2.1695 - acc: 0.1762 Epoch 6/10 60000/60000 [==============================] - 40s 664us/step - loss: 2.1702 - acc: 0.1813
The observations from above experiment
- A good learning rate is in between .001 – .00001.
- Use Adam for quite stable results.
- size of first hidden layer should be large and decrease it as the network grows deeper
- Use relu to avoid problems of diminishing gradients in deep mlp’s.
- Must use dropouts to avoid over fitting.