Effects of hyperparameters on mlp’s

This week I was doing some simple experiments with MLP, trying out different architecture and on different learning rates with random batch normalization and sometimes droput. There was no such definite goal of this experiment except to observe the behavior of Neural networks when we change learning rates on different architectures.

The goal of this article is to share some of simple findings from my experiment, there is no sequence for these test and are completely based on intuition.I have only used keras as to make implementation much easier.

First architecture

This was first architecture so I decided not use any batch normalization or dropouts. The architecture was quite simple lets look at implementation,


import keras
from keras.datasets import mnist
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Dropout , BatchNormalization

(x_train,y_train), (x_test,y_test) = mnist.load_data()
x_train = x_train.reshape(x_train.shape[0],784)
x_test = x_test.reshape(x_test.shape[0],784)
y_train = keras.utils.to_categorical(y_train,num_classes=10)
y_test = keras.utils.to_categorical(y_test, num_classes=10)

n_hidden_1 = 10
n_hidden_2 = 7
n_hidden_3 = 3

model = Sequential()
model.add(Dense(784,input_shape=(784,),activation='relu'))
model.add(Dense(n_hidden_1,activation='relu'))
model.add(Dense(n_hidden_2,activation='relu'))
model.add(Dense(n_hidden_3,activation='relu'))
model.add(Dense(10,activation='softmax'))

model.compile(loss = keras.losses.categorical_crossentropy,optimizer=keras.optimizers.Adam(lr=0.01),metrics=['accuracy'])
model.fit(x_train,y_train,epochs=4)

score = model.evaluate(x_test,y_test)

This is beauty of keras the code itself is self explanatory. The output of this code,


Epoch 1/10
60000/60000 [==============================] - 5s 87us/step - loss: 2.3320 - acc: 0.1110
Epoch 2/10
60000/60000 [==============================] - 5s 79us/step - loss: 2.3024 - acc: 0.1099
Epoch 3/10
60000/60000 [==============================] - 5s 80us/step - loss: 2.3024 - acc: 0.1095
Epoch 4/10
60000/60000 [==============================] - 5s 81us/step - loss: 2.3024 - acc: 0.1104

As you can see after the loss decrease from 2.33 to 2.30 it remains constant and did not further decreased, I don’t find any mathematical reason apart from this one due to the fact we have taken learning rate quite high take the optimizer either at some saddle point or a local minima from where it cannot move in any direction further. The results were quite similar while using SGD also which is a poor optimizer than adam.


Epoch 1/10
60000/60000 [==============================] - 5s 79us/step - loss: 2.3052 - acc: 0.1119
Epoch 2/10
60000/60000 [==============================] - 4s 74us/step - loss: 2.3013 - acc: 0.1124
Epoch 3/10
60000/60000 [==============================] - 4s 74us/step - loss: 2.3013 - acc: 0.1124
Epoch 4/10
60000/60000 [==============================] - 5s 77us/step - loss: 2.3013 - acc: 0.1124

Second Architecture

In this i have tried something unusual, rather than having small first hidden layer and increasing the number of neurons  for further hidden layers,

code below,


import keras
from keras.datasets import mnist
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Dropout , BatchNormalization

(x_train,y_train), (x_test,y_test) = mnist.load_data()
x_train = x_train.reshape(x_train.shape[0],784)
x_test = x_test.reshape(x_test.shape[0],784)

y_train = keras.utils.to_categorical(y_train,num_classes=10)
y_test = keras.utils.to_categorical(y_test, num_classes=10)
n_hidden_1 = 32
n_hidden_2 = 64
n_hidden_3 = 128

model = Sequential()
model.add(Dense(784,input_shape=(784,),activation='relu'))
model.add(Dense(n_hidden_1,activation='relu'))
model.add(Dense(n_hidden_2,activation='relu'))
model.add(Dense(n_hidden_3,activation='relu'))
model.add(Dense(10,activation='softmax'))

model.compile(loss = keras.losses.categorical_crossentropy,optimizer=keras.optimizers.SGD(lr=0.01),metrics=['accuracy'])
model.fit(x_train,y_train,epochs=10)

score = model.evaluate(x_test,y_test)

output,


Epoch 1/10
60000/60000 [==============================] - 6s 96us/step - loss: 14.4698 - acc: 0.1022
Epoch 2/10
60000/60000 [==============================] - 5s 89us/step - loss: 14.4711 - acc: 0.1022
Epoch 3/10
60000/60000 [==============================] - 5s 88us/step - loss: 14.4711 - acc: 0.1022
Epoch 4/10
60000/60000 [==============================] - 5s 90us/step - loss: 14.4711 - acc: 0.1022

This output was not surprising even the cost increased rather than decreasing couple of reason for this,

  • Learning rate is high so it might stuck at some local minima  or saddle point
  • The reason for why cost increased is because of not just one,
    • Sgd don’t take all the points in consideration to find the next best direction
    • hidden layers neurons are responsible for finding hidden patterns in data so less number of neurons means less number of hidden patters recognized and due to the fact subsequent layers findings depends upon previous layers so if not including large number of neurons in first layers affects performance but this might not be the possible explanation for above result.
Third Architecture,

In this i have only used 2 layers with same number of neurons in it with dropout of 0.5 and learning rate 0.0001


import keras
from keras.datasets import mnist
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Dropout , BatchNormalization

(x_train,y_train), (x_test,y_test) = mnist.load_data()
x_train = x_train.reshape(x_train.shape[0],784)
x_test = x_test.reshape(x_test.shape[0],784)
y_train = keras.utils.to_categorical(y_train,num_classes=10)
y_test = keras.utils.to_categorical(y_test, num_classes=10)

n_hidden_1 = 10
n_hidden_2 = 10


model = Sequential()
model.add(Dense(784,input_shape=(784,),activation='relu'))
model.add(Dense(n_hidden_1,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(n_hidden_2,activation='relu'))
model.add(Dense(10,activation='softmax'))
model.compile(loss = keras.losses.categorical_crossentropy,optimizer=keras.optimizers.SGD(lr=0.0001),metrics=['accuracy'])
model.fit(x_train,y_train,epochs=6)

the output of this is quite what expected,


Epoch 1/10
60000/60000 [==============================] - 22s 366us/step - loss: 2.4244 - acc: 0.1302
Epoch 2/10
60000/60000 [==============================] - 22s 361us/step - loss: 2.1681 - acc: 0.1766
Epoch 3/10
60000/60000 [==============================] - 22s 364us/step - loss: 2.1075 - acc: 0.2006
Epoch 4/10
60000/60000 [==============================] - 22s 361us/step - loss: 2.0722 - acc: 0.2170
Epoch 5/10
60000/60000 [==============================] - 21s 358us/step - loss: 2.0553 - acc: 0.2234
Epoch 6/10
60000/60000 [==============================] - 22s 359us/step - loss: 2.0408 - acc: 0.2253

The loss is decreasing with each epoch as the learning rate is low so the network is learning and not skipping global minima(just an observation). This work quite similar with adam.


Epoch 1/10
60000/60000 [==============================] - 41s 679us/step - loss: 2.4292 - acc: 0.1399
Epoch 2/10
60000/60000 [==============================] - 40s 660us/step - loss: 2.2645 - acc: 0.1314
Epoch 3/10
60000/60000 [==============================] - 40s 662us/step - loss: 2.2330 - acc: 0.1459
Epoch 4/10
60000/60000 [==============================] - 40s 663us/step - loss: 2.1963 - acc: 0.1633
Epoch 5/10
60000/60000 [==============================] - 40s 671us/step - loss: 2.1695 - acc: 0.1762
Epoch 6/10
60000/60000 [==============================] - 40s 664us/step - loss: 2.1702 - acc: 0.1813

The observations from above experiment

  • A good learning rate is in between .001 – .00001.
  • Use Adam for quite stable results.
  • size of first hidden layer should be large and decrease it as the network grows deeper
  • Use relu to avoid problems of diminishing gradients in deep mlp’s.
  • Must use dropouts to avoid over fitting.
Send a Message