Dropout Regularization in Deep Learning Models With Keras

A simple and powerful regularization technique for neural networks and deep learning models is dropout.

In this post you will discover the dropout regularization technique and how to apply it to your models in Python with Keras.

After reading this post you will know:

  • How the dropout regularization technique works.
  • How to use dropout on your input layers.
  • How to use dropout on your hidden layers.
  • How to tune the dropout level on your problem.

Let’s get started.

  • Update Oct/2016: Updated examples for Keras 1.1.0, TensorFlow 0.10.0 and scikit-learn v0.18.
  • Update Mar/2017: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.
Dropout Regularization in Deep Learning Models With Keras

Dropout Regularization in Deep Learning Models With Keras
Photo by Trekking Rinjani, some rights reserved.

Dropout Regularization For Neural Networks

Dropout is a regularization technique for neural network models proposed by Srivastava, et al. in their 2014 paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting (download the PDF).

Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

As a neural network learns, neuron weights settle into their context within the network. Weights of neurons are tuned for specific features providing some specialization. Neighboring neurons become to rely on this specialization, which if taken too far can result in a fragile model too specialized to the training data. This reliant on context for a neuron during training is referred to complex co-adaptations.

You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.

The effect is that the network becomes less sensitive to the specific weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to overfit the training data.

Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Dropout Regularization in Keras

Dropout is easily implemented by randomly selecting nodes to be dropped-out with a given probability (e.g. 20%) each weight update cycle. This is how Dropout is implemented in Keras. Dropout is only used during the training of a model and is not used when evaluating the skill of the model.

Next we will explore a few different ways of using Dropout in Keras.

The examples will use the Sonar dataset. This is a binary classification problem where the objective is to correctly identify rocks and mock-mines from sonar chirp returns. It is a good test dataset for neural networks because all of the input values are numerical and have the same scale.

The dataset can be downloaded from the UCI Machine Learning repository. You can place the sonar dataset in your current working directory with the file name sonar.csv.

We will evaluate the developed models using scikit-learn with 10-fold cross validation, in order to better tease out differences in the results.

There are 60 input values and a single output value and the input values are standardized before being used in the network. The baseline neural network model has two hidden layers, the first with 60 units and the second with 30. Stochastic gradient descent is used to train the model with a relatively low learning rate and momentum.

The the full baseline model is listed below.

Running the example generates an estimated classification accuracy of 86%.

Using Dropout on the Visible Layer

Dropout can be applied to input neurons called the visible layer.

In the example below we add a new Dropout layer between the input (or visible layer) and the first hidden layer. The dropout rate is set to 20%, meaning one in 5 inputs will be randomly excluded from each update cycle.

Additionally, as recommended in the original paper on Dropout, a constraint is imposed on the weights for each hidden layer, ensuring that the maximum norm of the weights does not exceed a value of 3. This is done by setting the kernel_constraint argument on the Dense class when constructing the layers.

The learning rate was lifted by one order of magnitude and the momentum was increase to 0.9. These increases in the learning rate were also recommended in the original Dropout paper.

Continuing on from the baseline example above, the code below exercises the same network with input dropout.

Running the example provides a small drop in classification accuracy, at least on a single test run.

Using Dropout on Hidden Layers

Dropout can be applied to hidden neurons in the body of your network model.

In the example below Dropout is applied between the two hidden layers and between the last hidden layer and the output layer. Again a dropout rate of 20% is used as is a weight constraint on those layers.

We can see that for this problem and for the chosen network configuration that using dropout in the hidden layers did not lift performance. In fact, performance was worse than the baseline.

It is possible that additional training epochs are required or that further tuning is required to the learning rate.

Tips For Using Dropout

The original paper on Dropout provides experimental results on a suite of standard machine learning problems. As a result they provide a number of useful heuristics to consider when using dropout in practice.

  • Generally, use a small dropout value of 20%-50% of neurons with 20% providing a good starting point. A probability too low has minimal effect and a value too high results in under-learning by the network.
  • Use a larger network. You are likely to get better performance when dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.
  • Use dropout on incoming (visible) as well as hidden units. Application of dropout at each layer of the network has shown good results.
  • Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of 10 to 100 and use a high momentum value of 0.9 or 0.99.
  • Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint on the size of network weights such as max-norm regularization with a size of 4 or 5 has been shown to improve results.

More Resources on Dropout

Below are some resources that you can use to learn more about dropout in neural network and deep learning models.

Summary

In this post, you discovered the dropout regularization technique for deep learning models. You learned:

  • What dropout is and how it works.
  • How you can use dropout on your own deep learning models.
  • Tips for getting the best results from dropout on your own models.

Do you have any questions about dropout or about this post? Ask your questions in the comments and I will do my best to answer.

Frustrated With Your Progress In Deep Learning?

Deep Learning with Python

 What If You Could Develop A Network in Minutes

…with just a few lines of Python

Discover how in my new Ebook: Deep Learning With Python

It covers self-study tutorials and end-to-end projects on topics like:
Multilayer PerceptronsConvolutional Nets and Recurrent Neural Nets, and more…

Finally Bring Deep Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

47 Responses to Dropout Regularization in Deep Learning Models With Keras

  1. sheinis September 7, 2016 at 6:27 am #

    Hi,

    thanks for the very useful examples!!
    Question: the goal of the dropouts is to reduce the risk of overfitting right?
    I am wondering whether the accuracy is the best way to measure this; here you are already doing a cross validation, which is by itself a way to reduce overfitting. As you are performing cross-validation *and* dropout, isn’t this somewhat overkill? Maybe the drop in accuracy is actually a drop in amount of information available?

    • Jason Brownlee September 7, 2016 at 10:31 am #

      Great question.

      Yes dropout is a technique to reduce the overfitting of the network to the training data.

      k-fold cross-validation is a robust technique to estimate the skill of a model. It is well suited to determine whether a specific network configuration has over or under fit the problem.

      You could also look at diagnostic plots of loss over epoch on the training and validation datasets to determine how overlearning has been affected by different dropout configurations.

  2. Aditya September 23, 2016 at 8:44 am #

    Very good post . Just one question why the need for increasing the learning rate in combination with setting max norm value?

    • Jason Brownlee September 24, 2016 at 8:01 am #

      Great question. Perhaps less nodes being updated with dropout requires more change/update each batch.

  3. Star October 13, 2016 at 1:17 pm #

    The lstm performs well among the training dataset, while does not do well in the testing dataset, i.e. prediction. Could you give me some advice for this problem?

    • Jason Brownlee October 14, 2016 at 8:57 am #

      It sounds like overlearning.

      Consider using a regularization technique like dropout discussed in this post above.

  4. Yuanliang Meng November 2, 2016 at 7:06 am #

    Some people mentioned that applying dropout on the LSTM units often leads to bad results. An example is here: https://arxiv.org/abs/1508.03720

    I wonder if anyone has any comment on this.

    • Jason Brownlee November 2, 2016 at 9:10 am #

      Thanks for the link Yuanliang. I do often see worse results. Although, I often see benefit of dropout on the dense layer before output.

      My advice is to experiment on your problem and see.

  5. Happy December 6, 2016 at 1:50 am #

    Hi,
    First of all, thanks to you for making machine learning fun to learn.

    I have a query related to drop-outs.
    Can we use drop-out even in case we have selected the optimizer as adam and not sgd?
    In the examples, sgd is being used and also in the tips section, it has been mentioned ”
    Use a large learning rate with decay and a large momentum.” As far as I see adam does not have the momentum. So what should be the parameter to adam if we use dropouts.

    keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)

    • Jason Brownlee December 6, 2016 at 9:52 am #

      Yes, you can use dropout with other optimization algorithms. I would suggest experimenting with the parameters and see how to balance learning and regularization provided by dropout.

  6. dalila January 27, 2017 at 7:58 am #

    In addition to sampling, how does one deal with rare events when building a Deep Learning Model?
    For shallow Machine Learning, I can add a utility or a cost function, but here I’m to see if a more elegant approach has been developed.

  7. Junaid Effendi January 30, 2017 at 8:06 am #

    Hi Jason,
    Why Dropout will work in these cases? It can lower down the computations but how will it impact the increase… i have been experimenting with the dropout on ANN and now on RNN (LSTM), I am using the dropout in LSTM only in the input and output not between the recurrent layers.. but the accuracy remains the same for both validation and training data set…

    Any comments ?

    • Jason Brownlee February 1, 2017 at 10:19 am #

      Hi Junaid,

      I have not used dropout on RNNs myself.

      Perhaps you need more drop-out and less training to impact the skill or generalization capability of your network.

  8. Shin April 15, 2017 at 11:48 pm #

    Is the dropout layer stored in the model when it is stored?..

    If so why?.. it doesn’t make sense to have a dropout layer in a model? besides when training?

    • Jason Brownlee April 16, 2017 at 9:28 am #

      It is a function with no weights. There is nothing to store, other than the fact that it exists in a specific point in the network topology.

  9. QW April 21, 2017 at 12:53 am #

    Is it recommended in real practice to fix a random seed for training?

  10. Punit Singh June 11, 2017 at 4:21 am #

    I am a beginner in Machine Learning and trying to learn neural networks from your blog.
    In this post, I understand all the concepts of dropout, but the use of:
    from sklearn.model_selection import cross_val_score
    from sklearn.model_selection import StratifiedKFold
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import Pipeline
    is making code difficult for me to understand. Would you suggest any way so that I can code without using these?

    • Jason Brownlee June 11, 2017 at 8:28 am #

      Yes, you can use keras directly. I offer many tutorials on the topic, try the search feature at the top of the page.

  11. Yitzhak June 18, 2017 at 8:23 pm #

    thanks Jason, this is so useful !!

  12. Corina July 10, 2017 at 6:18 pm #

    Hi Jason,

    Thanks for the awesome materials you provide!
    I have a question, I saw that when using dropout for the hidden layers, you applied it for all of them.
    My question is, if dropout is applied to the hidden layers then, should it be applied to all of them? Or better yet how do we choose where to apply the dropout?
    Thanks ! 🙂

    • Jason Brownlee July 11, 2017 at 10:28 am #

      Great question. I would recommend testing every variation you can think of for your network and see what works best on your specific problem.

  13. Dikshika July 10, 2017 at 8:28 pm #

    My cat dog classifier with Keras is over-fitting for Dog. How do I make it unbiased?

    • Jason Brownlee July 11, 2017 at 10:30 am #

      Consider augmentation on images in the cat class in order to fit a more robust model.

      • Dikshika July 17, 2017 at 2:54 pm #

        I have already augmented the train data set. But it’s not helping. Here is my code snippet.
        It classifies appx 246 out of 254 dogs and 83 out of 246 cats correctly.

  14. Dikshika July 17, 2017 at 2:55 pm #

    from keras.preprocessing.image import ImageDataGenerator
    from keras.models import Sequential
    from keras.layers import Conv2D, MaxPooling2D
    from keras.layers import Activation, Dropout, Flatten, Dense
    from keras import backend as K

    # dimensions of our images.
    img_width, img_height = 150, 150

    train_data_dir = r’E:\\Interns ! Projects\\Positive Integers\\CatDogKeras\\data\\train’
    validation_data_dir = r’E:\\Interns ! Projects\\Positive Integers\\CatDogKeras\\data\\validation’
    nb_train_samples = 18000
    nb_validation_samples = 7000
    epochs = 20
    batch_size = 144

    if K.image_data_format() == ‘channels_first’:
    input_shape = (3, img_width, img_height)
    else:
    input_shape = (img_width, img_height, 3)

    model = Sequential()
    model.add(Conv2D(32, (3, 3), input_shape=input_shape))
    model.add(Activation(‘relu’))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(64, (3, 3)))
    model.add(Activation(‘relu’))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(64, (3, 3)))
    model.add(Activation(‘relu’))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Flatten())
    model.add(Dense(128))
    model.add(Activation(‘relu’))
    model.add(Dropout(0.5))
    model.add(Dense(1))
    model.add(Activation(‘sigmoid’))

    model.compile(loss=’binary_crossentropy’,
    optimizer=’rmsprop’,
    metrics=[‘accuracy’])

    # this is the augmentation configuration we will use for training
    train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

    # this is the augmentation configuration we will use for testing:
    # only rescaling
    test_datagen = ImageDataGenerator(rescale=1. / 255)

    train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode=’binary’)

    validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode=’binary’)

    model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=nb_validation_samples // batch_size)

    input(“Press enter to exit”)

    model.save_weights(‘first_try_v2.h5’)

    model.save(‘DogCat_v2.h5’)

  15. Dikshika July 17, 2017 at 6:17 pm #

    Also, is it possible to get the probability of each training sample after the last epoch?

    • Jason Brownlee July 18, 2017 at 8:42 am #

      Yes, make a probability prediction for each sample at the end of each epoch:

      • Dikshika July 18, 2017 at 2:48 pm #

        Thanks a lot. This blog and your suggestions have been really helpful.

        • Jason Brownlee July 18, 2017 at 5:01 pm #

          You’re welcome, I’m glad to hear that.

          • Dikshika July 18, 2017 at 9:05 pm #

            I am struck here. I am using binary cross entropy. I want to see probabilities as the actual ones between 0 and 1. But I am getting only maximum probabilities, ie, 0 or 1. For both test and training samples.

          • Jason Brownlee July 19, 2017 at 8:22 am #

            Use softmax on the output layer and call predict_proba() to get probabilities.

  16. Thong Bui August 17, 2017 at 4:09 am #

    Thanks for the great insights on how dropout works. I have 1 question: what is the difference between adding a dropout layer (like your examples here) and setting the dropout parameter of a layer, for example:

    model.add(SimpleRNN(…, dropout=0.5))
    model.add(LSTM(…, dropout=0.5))

    Thanks again for sharing your knowledge with us.

    Thong Bui

  17. Guillaume August 31, 2017 at 11:29 pm #

    Hello Jason,

    Thanks for your tutorials 🙂

    I have a question considering the implementation of the dropout.

    I am using an LSTM to predict values of a sin wave. Without, the NN is able to catch quite correctly the frequency and the amplitude of the signal.

    However, implementing dropout like this:

    model = Sequential()
    model.add(LSTM(neuron, input_shape=(1,1)))
    model.add(Dropout(0.5))
    model.add(Dense(1))

    does not lead to the same results as with:

    model = Sequential()
    model.add(LSTM(neuron, input_shape=(1,1), dropout=0.5))
    model.add(Dense(1))

    In the first case, the results are also great. But in the second, the amplitude is reduce by 1/4 of its original value..
    Any idea why ?

    Thank you !

    • Jason Brownlee September 1, 2017 at 6:48 am #

      I would think that they are the same thing, I guess my intuition is wrong.

      I’m not sure what is going on.

  18. James September 27, 2017 at 3:15 am #

    Hi Jason,

    Thanks for all your posts, they are great!

    My main question is a general one about searching for optimal hyper-parameters; is there a methodology you prefer (i.e. sklearn’s grid/random search methods)? Or do you generally just plug and chug?

    In addition, I found this code online and had a number of questions on best practices that I think everyone here could benefit from:

    ”’
    model = Sequential()

    # Input layer with dimension 1 and hidden layer i with 128 neurons.
    model.add(Dense(128, input_dim=1, activation=’relu’))
    # Dropout of 20% of the neurons and activation layer.
    model.add(Dropout(.2))
    model.add(Activation(“linear”))
    # Hidden layer j with 64 neurons plus activation layer.
    model.add(Dense(64, activation=’relu’))
    model.add(Activation(“linear”))
    # Hidden layer k with 64 neurons.
    model.add(Dense(64, activation=’relu’))
    # Output Layer.
    model.add(Dense(1))

    # Model is derived and compiled using mean square error as loss
    # function, accuracy as metric and gradient descent optimizer.
    model.compile(loss=’mse’, optimizer=’adam’, metrics=[“accuracy”])

    # Training model with train data. Fixed random seed:
    numpy.random.seed(3)
    model.fit(X_train, y_train, nb_epoch=256, batch_size=2, verbose=2) ”’

    1) I was under the impression that the input layer should be the number of features (i.e. columns – 1) in the data, but this code defines it as 1.

    2) defining the activation function twice for each layer seems odd to me, but maybe I am misunderstanding the code, but doesn’t this just overwrite the previously defined activation function.

    3) For regression problems, shouldn’t the last activation function (before the output layer) be linear?

    Source: http://gonzalopla.com/deep-learning-nonlinear-regression/#comment-290

    Thanks again for all the great posts!
    James

  19. Azim September 28, 2017 at 2:11 pm #

    Hi Jason, Thanks for the nicely articulated blog. I have a question. Is it that dropout is not applied on the output layer, where we have used softmax function? If so, what is the rationale behind this?

    Regards,

    Azim

    • Jason Brownlee September 28, 2017 at 4:46 pm #

      No, we don’t use dropout on output only on input and hidden layers.

      The rationale is that we do not want to corrupt the output from the model and in turn the calculation of error.

  20. Alex October 2, 2017 at 7:53 am #

    great explanation, Jason!

  21. Animesh October 6, 2017 at 10:36 am #

    How do I plot this code. I have tried various things but get a different error each time.
    What is the correct syntax to plot this code?

    • Jason Brownlee October 6, 2017 at 11:05 am #

      What do you mean plot the code?

      You can run the code by copying it and pasting it into a new file, saving it with a .py extension and running it with the Python interpreter.

      If you are new to Python, I recommend learning some basics of the language first.

Leave a Reply