Dropout Regularization in Deep Learning Models with Keras

By Jason Brownlee on August 6, 2022 in Deep Learning 133

Dropout is a simple and powerful regularization technique for neural networks and deep learning models.

In this post, you will discover the Dropout regularization technique and how to apply it to your models in Python with Keras.

After reading this post, you will know:

How the Dropout regularization technique works
How to use Dropout on your input layers
How to use Dropout on your hidden layers
How to tune the dropout level on your problem

Kick-start your project with my new book Deep Learning With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Jun/2016: First published
Update Oct/2016: Updated for Keras 1.1.0, TensorFlow 0.10.0 and scikit-learn v0.18
Update Mar/2017: Updated for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0
Update Sep/2019: Updated for Keras 2.2.5 API
Update Jul/2022: Updated for TensorFlow 2.x API and SciKeras

Dropout regularization in deep learning models with Keras
Photo by Trekking Rinjani, some rights reserved.

Dropout Regularization for Neural Networks

Dropout is a regularization technique for neural network models proposed by Srivastava et al. in their 2014 paper “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” (download the PDF).

Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass, and any weight updates are not applied to the neuron on the backward pass.

As a neural network learns, neuron weights settle into their context within the network. Weights of neurons are tuned for specific features, providing some specialization. Neighboring neurons come to rely on this specialization, which, if taken too far, can result in a fragile model too specialized for the training data. This reliance on context for a neuron during training is referred to as complex co-adaptations.

You can imagine that if neurons are randomly dropped out of the network during training, other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.

The effect is that the network becomes less sensitive to the specific weights of neurons. This, in turn, results in a network capable of better generalization and less likely to overfit the training data.

Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Dropout Regularization in Keras

Dropout is easily implemented by randomly selecting nodes to be dropped out with a given probability (e.g., 20%) in each weight update cycle. This is how Dropout is implemented in Keras. Dropout is only used during the training of a model and is not used when evaluating the skill of the model.

Next, let’s explore a few different ways of using Dropout in Keras.

The examples will use the Sonar dataset. This is a binary classification problem that aims to correctly identify rocks and mock-mines from sonar chirp returns. It is a good test dataset for neural networks because all the input values are numerical and have the same scale.

The dataset can be downloaded from the UCI Machine Learning repository. You can place the sonar dataset in your current working directory with the file name sonar.csv.

You will evaluate the developed models using scikit-learn with 10-fold cross validation in order to tease out differences in the results better.

There are 60 input values and a single output value. The input values are standardized before being used in the network. The baseline neural network model has two hidden layers, the first with 60 units and the second with 30. Stochastic gradient descent is used to train the model with a relatively low learning rate and momentum.

The full baseline model is listed below:

# Baseline Model on the Sonar Dataset
from pandas import read_csv
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# load dataset
dataframe = read_csv("sonar.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:60].astype(float)
Y = dataset[:,60]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

# baseline
def create_baseline():
	# create model
	model = Sequential()
	model.add(Dense(60, input_shape=(60,), activation='relu'))
	model.add(Dense(30,  activation='relu'))
	model.add(Dense(1, activation='sigmoid'))
	# Compile model
	sgd = SGD(learning_rate=0.01, momentum=0.8)
	model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
	return model

estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(model=create_baseline, epochs=300, batch_size=16, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

# Baseline Model on the Sonar Dataset

from pandas import read_csv

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.optimizers import SGD

from scikeras.wrappers import KerasClassifier

from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

# load dataset

dataframe = read_csv("sonar.csv", header=None)

dataset = dataframe.values

# split into input (X) and output (Y) variables

X = dataset[:,0:60].astype(float)

Y = dataset[:,60]

# encode class values as integers

encoder = LabelEncoder()

encoder.fit(Y)

encoded_Y = encoder.transform(Y)

# baseline

def create_baseline():

# create model

model = Sequential()

model.add(Dense(60, input_shape=(60,), activation='relu'))

model.add(Dense(30, activation='relu'))

model.add(Dense(1, activation='sigmoid'))

# Compile model

sgd = SGD(learning_rate=0.01, momentum=0.8)

model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])

return model

estimators = []

estimators.append(('standardize', StandardScaler()))

estimators.append(('mlp', KerasClassifier(model=create_baseline, epochs=300, batch_size=16, verbose=0)))

pipeline = Pipeline(estimators)

kfold = StratifiedKFold(n_splits=10, shuffle=True)

results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)

print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example generates an estimated classification accuracy of 86%.

Baseline: 86.04% (4.58%)

1	Baseline: 86.04% (4.58%)

Using Dropout on the Visible Layer

Dropout can be applied to input neurons called the visible layer.

In the example below, a new Dropout layer between the input (or visible layer) and the first hidden layer was added. The dropout rate is set to 20%, meaning one in five inputs will be randomly excluded from each update cycle.

Additionally, as recommended in the original paper on Dropout, a constraint is imposed on the weights for each hidden layer, ensuring that the maximum norm of the weights does not exceed a value of 3. This is done by setting the kernel_constraint argument on the Dense class when constructing the layers.

The learning rate was lifted by one order of magnitude, and the momentum was increased to 0.9. These increases in the learning rate were also recommended in the original Dropout paper.

Continuing from the baseline example above, the code below exercises the same network with input dropout:

# Example of Dropout on the Sonar Dataset: Visible Layer
from pandas import read_csv
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.constraints import MaxNorm
from tensorflow.keras.optimizers import SGD
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# load dataset
dataframe = read_csv("sonar.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:60].astype(float)
Y = dataset[:,60]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

# dropout in the input layer with weight constraint
def create_model():
	# create model
	model = Sequential()
	model.add(Dropout(0.2, input_shape=(60,)))
	model.add(Dense(60, activation='relu', kernel_constraint=MaxNorm(3)))
	model.add(Dense(30, activation='relu', kernel_constraint=MaxNorm(3)))
	model.add(Dense(1, activation='sigmoid'))
	# Compile model
	sgd = SGD(learning_rate=0.1, momentum=0.9)
	model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
	return model

estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(model=create_model, epochs=300, batch_size=16, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Visible: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

# Example of Dropout on the Sonar Dataset: Visible Layer

from pandas import read_csv

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.layers import Dropout

from tensorflow.keras.constraints import MaxNorm

from tensorflow.keras.optimizers import SGD

from scikeras.wrappers import KerasClassifier

from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

# load dataset

dataframe = read_csv("sonar.csv", header=None)

dataset = dataframe.values

# split into input (X) and output (Y) variables

X = dataset[:,0:60].astype(float)

Y = dataset[:,60]

# encode class values as integers

encoder = LabelEncoder()

encoder.fit(Y)

encoded_Y = encoder.transform(Y)

# dropout in the input layer with weight constraint

def create_model():

# create model

model = Sequential()

model.add(Dropout(0.2, input_shape=(60,)))

model.add(Dense(60, activation='relu', kernel_constraint=MaxNorm(3)))

model.add(Dense(30, activation='relu', kernel_constraint=MaxNorm(3)))

model.add(Dense(1, activation='sigmoid'))

# Compile model

sgd = SGD(learning_rate=0.1, momentum=0.9)

model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])

return model

estimators = []

estimators.append(('standardize', StandardScaler()))

estimators.append(('mlp', KerasClassifier(model=create_model, epochs=300, batch_size=16, verbose=0)))

pipeline = Pipeline(estimators)

kfold = StratifiedKFold(n_splits=10, shuffle=True)

results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)

print("Visible: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Running the example provides a slight drop in classification accuracy, at least on a single test run.

Visible: 83.52% (7.68%)

1	Visible: 83.52% (7.68%)

Using Dropout on Hidden Layers

Dropout can be applied to hidden neurons in the body of your network model.

In the example below, Dropout is applied between the two hidden layers and between the last hidden layer and the output layer. Again a dropout rate of 20% is used as is a weight constraint on those layers.

# Example of Dropout on the Sonar Dataset: Hidden Layer
from pandas import read_csv
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.constraints import MaxNorm
from tensorflow.keras.optimizers import SGD
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# load dataset
dataframe = read_csv("sonar.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:60].astype(float)
Y = dataset[:,60]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

# dropout in hidden layers with weight constraint
def create_model():
	# create model
	model = Sequential()
	model.add(Dense(60, input_shape=(60,), activation='relu', kernel_constraint=MaxNorm(3)))
	model.add(Dropout(0.2))
	model.add(Dense(30, activation='relu', kernel_constraint=MaxNorm(3)))
	model.add(Dropout(0.2))
	model.add(Dense(1, activation='sigmoid'))
	# Compile model
	sgd = SGD(learning_rate=0.1, momentum=0.9)
	model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
	return model

estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(model=create_model, epochs=300, batch_size=16, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Hidden: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

# Example of Dropout on the Sonar Dataset: Hidden Layer

from pandas import read_csv

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.layers import Dropout

from tensorflow.keras.constraints import MaxNorm

from tensorflow.keras.optimizers import SGD

from scikeras.wrappers import KerasClassifier

from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

# load dataset

dataframe = read_csv("sonar.csv", header=None)

dataset = dataframe.values

# split into input (X) and output (Y) variables

X = dataset[:,0:60].astype(float)

Y = dataset[:,60]

# encode class values as integers

encoder = LabelEncoder()

encoder.fit(Y)

encoded_Y = encoder.transform(Y)

# dropout in hidden layers with weight constraint

def create_model():

# create model

model = Sequential()

model.add(Dense(60, input_shape=(60,), activation='relu', kernel_constraint=MaxNorm(3)))

model.add(Dropout(0.2))

model.add(Dense(30, activation='relu', kernel_constraint=MaxNorm(3)))

model.add(Dropout(0.2))

model.add(Dense(1, activation='sigmoid'))

# Compile model

sgd = SGD(learning_rate=0.1, momentum=0.9)

model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])

return model

estimators = []

estimators.append(('standardize', StandardScaler()))

estimators.append(('mlp', KerasClassifier(model=create_model, epochs=300, batch_size=16, verbose=0)))

pipeline = Pipeline(estimators)

kfold = StratifiedKFold(n_splits=10, shuffle=True)

results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)

print("Hidden: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

You can see that for this problem and the chosen network configuration, using Dropout in the hidden layers did not lift performance. In fact, performance was worse than the baseline.

It is possible that additional training epochs are required or that further tuning is required to the learning rate.

Hidden: 83.59% (7.31%)

1	Hidden: 83.59% (7.31%)

Dropout in Evaluation Mode

Dropout will randomly reset some of the input to zero. If you wonder what happens after you have finished training, the answer is nothing! In Keras, a layer can tell if the model is running in training mode or not. The Dropout layer will randomly reset some input only when the model runs for training. Otherwise, the Dropout layer works as a scaler to multiply all input by a factor such that the next layer will see input similar in scale. Precisely, if the dropout rate is $r$, the input will be scaled by a factor of $1-r$.

Tips for Using Dropout

The original paper on Dropout provides experimental results on a suite of standard machine learning problems. As a result, they provide a number of useful heuristics to consider when using Dropout in practice.

Generally, use a small dropout value of 20%-50% of neurons, with 20% providing a good starting point. A probability too low has minimal effect, and a value too high results in under-learning by the network.
Use a larger network. You are likely to get better performance when Dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.
Use Dropout on incoming (visible) as well as hidden units. Application of Dropout at each layer of the network has shown good results.
Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of 10 to 100 and use a high momentum value of 0.9 or 0.99.
Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint on the size of network weights, such as max-norm regularization, with a size of 4 or 5 has been shown to improve results.

More Resources on Dropout

Below are resources you can use to learn more about Dropout in neural networks and deep learning models.

Dropout: A Simple Way to Prevent Neural Networks from Overfitting (original paper)
Improving neural networks by preventing co-adaptation of feature detectors
How does the dropout method work in deep learning? on Quora
Keras Training and Evaluation with Built-in Methods from TensorFlow documentation

Summary

In this post, you discovered the Dropout regularization technique for deep learning models. You learned:

What Dropout is and how it works
How you can use Dropout on your own deep learning models.
Tips for getting the best results from Dropout on your own models.

Do you have any questions about Dropout or this post? Ask your questions in the comments, and I will do my best to answer.

133 Responses to Dropout Regularization in Deep Learning Models with Keras

sheinis September 7, 2016 at 6:27 am #

Hi,

thanks for the very useful examples!!
Question: the goal of the dropouts is to reduce the risk of overfitting right?
I am wondering whether the accuracy is the best way to measure this; here you are already doing a cross validation, which is by itself a way to reduce overfitting. As you are performing cross-validation *and* dropout, isn’t this somewhat overkill? Maybe the drop in accuracy is actually a drop in amount of information available?

Reply
- Jason Brownlee September 7, 2016 at 10:31 am #
  
  Great question.
  
  Yes dropout is a technique to reduce the overfitting of the network to the training data.
  
  k-fold cross-validation is a robust technique to estimate the skill of a model. It is well suited to determine whether a specific network configuration has over or under fit the problem.
  
  You could also look at diagnostic plots of loss over epoch on the training and validation datasets to determine how overlearning has been affected by different dropout configurations.
  
  Reply
Aditya September 23, 2016 at 8:44 am #

Very good post . Just one question why the need for increasing the learning rate in combination with setting max norm value?

Reply
- Jason Brownlee September 24, 2016 at 8:01 am #
  
  Great question. Perhaps less nodes being updated with dropout requires more change/update each batch.
  
  Reply
Star October 13, 2016 at 1:17 pm #

The lstm performs well among the training dataset, while does not do well in the testing dataset, i.e. prediction. Could you give me some advice for this problem?

Reply
- Jason Brownlee October 14, 2016 at 8:57 am #
  
  It sounds like overlearning.
  
  Consider using a regularization technique like dropout discussed in this post above.
  
  Reply
Yuanliang Meng November 2, 2016 at 7:06 am #

Some people mentioned that applying dropout on the LSTM units often leads to bad results. An example is here: https://arxiv.org/abs/1508.03720

I wonder if anyone has any comment on this.

Reply
- Jason Brownlee November 2, 2016 at 9:10 am #
  
  Thanks for the link Yuanliang. I do often see worse results. Although, I often see benefit of dropout on the dense layer before output.
  
  My advice is to experiment on your problem and see.
  
  Reply
Happy December 6, 2016 at 1:50 am #

Hi,
First of all, thanks to you for making machine learning fun to learn.

I have a query related to drop-outs.
Can we use drop-out even in case we have selected the optimizer as adam and not sgd?
In the examples, sgd is being used and also in the tips section, it has been mentioned ”
Use a large learning rate with decay and a large momentum.” As far as I see adam does not have the momentum. So what should be the parameter to adam if we use dropouts.

keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)

Reply
- Jason Brownlee December 6, 2016 at 9:52 am #
  
  Yes, you can use dropout with other optimization algorithms. I would suggest experimenting with the parameters and see how to balance learning and regularization provided by dropout.
  
  Reply
dalila January 27, 2017 at 7:58 am #

In addition to sampling, how does one deal with rare events when building a Deep Learning Model?
For shallow Machine Learning, I can add a utility or a cost function, but here I’m to see if a more elegant approach has been developed.

Reply
- Jason Brownlee January 27, 2017 at 12:26 pm #
  
  You may also oversample the rare events, perhaps with data augmentation (randomness) to make them look different.
  
  There’s some discussion of handling imbalanced datasets here that might inspire:
  https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
  
  Reply
Junaid Effendi January 30, 2017 at 8:06 am #

Hi Jason,
Why Dropout will work in these cases? It can lower down the computations but how will it impact the increase… i have been experimenting with the dropout on ANN and now on RNN (LSTM), I am using the dropout in LSTM only in the input and output not between the recurrent layers.. but the accuracy remains the same for both validation and training data set…

Any comments ?

Reply
- Jason Brownlee February 1, 2017 at 10:19 am #
  
  Hi Junaid,
  
  I have not used dropout on RNNs myself.
  
  Perhaps you need more drop-out and less training to impact the skill or generalization capability of your network.
  
  Reply
Shin April 15, 2017 at 11:48 pm #

Is the dropout layer stored in the model when it is stored?..

If so why?.. it doesn’t make sense to have a dropout layer in a model? besides when training?

Reply
- Jason Brownlee April 16, 2017 at 9:28 am #
  
  It is a function with no weights. There is nothing to store, other than the fact that it exists in a specific point in the network topology.
  
  Reply
QW April 21, 2017 at 12:53 am #

Is it recommended in real practice to fix a random seed for training?

Reply
- Jason Brownlee April 21, 2017 at 8:37 am #
  
  I see two options, to fix the random seed and make the code reproducible or to run the model n times (30?) and take the mean performance from all runs.
  
  The latter is more robust.
  
  For more on the stochastic nature of machine learning algorithms like neural nets see this post:
  https://machinelearningmastery.com/randomness-in-machine-learning/
  
  Reply
Punit Singh June 11, 2017 at 4:21 am #

I am a beginner in Machine Learning and trying to learn neural networks from your blog.
In this post, I understand all the concepts of dropout, but the use of:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
is making code difficult for me to understand. Would you suggest any way so that I can code without using these?

Reply
- Jason Brownlee June 11, 2017 at 8:28 am #
  
  Yes, you can use keras directly. I offer many tutorials on the topic, try the search feature at the top of the page.
  
  Reply
Yitzhak June 18, 2017 at 8:23 pm #

thanks Jason, this is so useful !!

Reply
- Jason Brownlee June 19, 2017 at 8:44 am #
  
  You’re welcome!
  
  Reply
Corina July 10, 2017 at 6:18 pm #

Hi Jason,

Thanks for the awesome materials you provide!
I have a question, I saw that when using dropout for the hidden layers, you applied it for all of them.
My question is, if dropout is applied to the hidden layers then, should it be applied to all of them? Or better yet how do we choose where to apply the dropout?
Thanks ! 🙂

Reply
- Jason Brownlee July 11, 2017 at 10:28 am #
  
  Great question. I would recommend testing every variation you can think of for your network and see what works best on your specific problem.
  
  Reply
Dikshika July 10, 2017 at 8:28 pm #

My cat dog classifier with Keras is over-fitting for Dog. How do I make it unbiased?

Reply
- Jason Brownlee July 11, 2017 at 10:30 am #
  
  Consider augmentation on images in the cat class in order to fit a more robust model.
  
  Reply
  - Dikshika July 17, 2017 at 2:54 pm #
    
    I have already augmented the train data set. But it’s not helping. Here is my code snippet.
    It classifies appx 246 out of 254 dogs and 83 out of 246 cats correctly.
    
    Reply
    - Jason Brownlee July 18, 2017 at 8:41 am #
      
      Sorry, I don’t have good advice off the cuff.
      
      Reply
Dikshika July 17, 2017 at 2:55 pm #

from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras import backend as K

# dimensions of our images.
img_width, img_height = 150, 150

train_data_dir = r’E:\\Interns ! Projects\\Positive Integers\\CatDogKeras\\data\\train’
validation_data_dir = r’E:\\Interns ! Projects\\Positive Integers\\CatDogKeras\\data\\validation’
nb_train_samples = 18000
nb_validation_samples = 7000
epochs = 20
batch_size = 144

if K.image_data_format() == ‘channels_first’:
input_shape = (3, img_width, img_height)
else:
input_shape = (img_width, img_height, 3)

model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=input_shape))
model.add(Activation(‘relu’))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3)))
model.add(Activation(‘relu’))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3)))
model.add(Activation(‘relu’))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())
model.add(Dense(128))
model.add(Activation(‘relu’))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation(‘sigmoid’))

model.compile(loss=’binary_crossentropy’,
optimizer=’rmsprop’,
metrics=[‘accuracy’])

# this is the augmentation configuration we will use for training
train_datagen = ImageDataGenerator(
rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)

# this is the augmentation configuration we will use for testing:
# only rescaling
test_datagen = ImageDataGenerator(rescale=1. / 255)

train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode=’binary’)

validation_generator = test_datagen.flow_from_directory(
validation_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode=’binary’)

model.fit_generator(
train_generator,
steps_per_epoch=nb_train_samples // batch_size,
epochs=epochs,
validation_data=validation_generator,
validation_steps=nb_validation_samples // batch_size)

input(“Press enter to exit”)

model.save_weights(‘first_try_v2.h5’)

model.save(‘DogCat_v2.h5’)

Reply
Dikshika July 17, 2017 at 6:17 pm #

Also, is it possible to get the probability of each training sample after the last epoch?

Reply
- Jason Brownlee July 18, 2017 at 8:42 am #
  
  Yes, make a probability prediction for each sample at the end of each epoch:
  
  results = model.predict_proba(X)
  
  1
  
  results = model.predict_proba(X)
  
  Reply
  - Dikshika July 18, 2017 at 2:48 pm #
    
    Thanks a lot. This blog and your suggestions have been really helpful.
    
    Reply
    - Jason Brownlee July 18, 2017 at 5:01 pm #
      
      You’re welcome, I’m glad to hear that.
      
      Reply
      - Dikshika July 18, 2017 at 9:05 pm #
        
        I am struck here. I am using binary cross entropy. I want to see probabilities as the actual ones between 0 and 1. But I am getting only maximum probabilities, ie, 0 or 1. For both test and training samples.
      - Jason Brownlee July 19, 2017 at 8:22 am #
        
        Use softmax on the output layer and call predict_proba() to get probabilities.
Thong Bui August 17, 2017 at 4:09 am #

Thanks for the great insights on how dropout works. I have 1 question: what is the difference between adding a dropout layer (like your examples here) and setting the dropout parameter of a layer, for example:

model.add(SimpleRNN(…, dropout=0.5))
model.add(LSTM(…, dropout=0.5))

Thanks again for sharing your knowledge with us.

Thong Bui

Reply
- Jason Brownlee August 17, 2017 at 6:48 am #
  
  Nothing really, they are equivalent.
  
  Reply
Guillaume August 31, 2017 at 11:29 pm #

Hello Jason,

Thanks for your tutorials 🙂

I have a question considering the implementation of the dropout.

I am using an LSTM to predict values of a sin wave. Without, the NN is able to catch quite correctly the frequency and the amplitude of the signal.

However, implementing dropout like this:

model = Sequential()
model.add(LSTM(neuron, input_shape=(1,1)))
model.add(Dropout(0.5))
model.add(Dense(1))

does not lead to the same results as with:

model = Sequential()
model.add(LSTM(neuron, input_shape=(1,1), dropout=0.5))
model.add(Dense(1))

In the first case, the results are also great. But in the second, the amplitude is reduce by 1/4 of its original value..
Any idea why ?

Thank you !

Reply
- Jason Brownlee September 1, 2017 at 6:48 am #
  
  I would think that they are the same thing, I guess my intuition is wrong.
  
  I’m not sure what is going on.
  
  Reply
James September 27, 2017 at 3:15 am #

Hi Jason,

Thanks for all your posts, they are great!

My main question is a general one about searching for optimal hyper-parameters; is there a methodology you prefer (i.e. sklearn’s grid/random search methods)? Or do you generally just plug and chug?

In addition, I found this code online and had a number of questions on best practices that I think everyone here could benefit from:

”’
model = Sequential()

# Input layer with dimension 1 and hidden layer i with 128 neurons.
model.add(Dense(128, input_dim=1, activation=’relu’))
# Dropout of 20% of the neurons and activation layer.
model.add(Dropout(.2))
model.add(Activation(“linear”))
# Hidden layer j with 64 neurons plus activation layer.
model.add(Dense(64, activation=’relu’))
model.add(Activation(“linear”))
# Hidden layer k with 64 neurons.
model.add(Dense(64, activation=’relu’))
# Output Layer.
model.add(Dense(1))

# Model is derived and compiled using mean square error as loss
# function, accuracy as metric and gradient descent optimizer.
model.compile(loss=’mse’, optimizer=’adam’, metrics=[“accuracy”])

# Training model with train data. Fixed random seed:
numpy.random.seed(3)
model.fit(X_train, y_train, nb_epoch=256, batch_size=2, verbose=2) ”’

1) I was under the impression that the input layer should be the number of features (i.e. columns – 1) in the data, but this code defines it as 1.

2) defining the activation function twice for each layer seems odd to me, but maybe I am misunderstanding the code, but doesn’t this just overwrite the previously defined activation function.

3) For regression problems, shouldn’t the last activation function (before the output layer) be linear?

Source: http://gonzalopla.com/deep-learning-nonlinear-regression/#comment-290

Thanks again for all the great posts!
James

Reply
- Jason Brownlee September 27, 2017 at 5:47 am #
  
  I use grid searching myself. I try to be as systematic as possible, see here:
  https://machinelearningmastery.com/plan-run-machine-learning-experiments-systematically/
  
  Yes, the code you pasted has many problems.
  
  Reply
Azim September 28, 2017 at 2:11 pm #

Hi Jason, Thanks for the nicely articulated blog. I have a question. Is it that dropout is not applied on the output layer, where we have used softmax function? If so, what is the rationale behind this?

Regards,

Azim

Reply
- Jason Brownlee September 28, 2017 at 4:46 pm #
  
  No, we don’t use dropout on output only on input and hidden layers.
  
  The rationale is that we do not want to corrupt the output from the model and in turn the calculation of error.
  
  Reply
Alex October 2, 2017 at 7:53 am #

great explanation, Jason!

Reply
- Jason Brownlee October 2, 2017 at 9:40 am #
  
  Thanks Alex.
  
  Reply
Animesh October 6, 2017 at 10:36 am #

How do I plot this code. I have tried various things but get a different error each time.
What is the correct syntax to plot this code?

Reply
- Jason Brownlee October 6, 2017 at 11:05 am #
  
  What do you mean plot the code?
  
  You can run the code by copying it and pasting it into a new file, saving it with a .py extension and running it with the Python interpreter.
  
  If you are new to Python, I recommend learning some basics of the language first.
  
  Reply
Saman December 16, 2017 at 3:13 pm #

hi
i want to know that emerging any two kernels in convolutional layer is dropout technique?

Reply
- Jason Brownlee December 17, 2017 at 8:49 am #
  
  Sorry, I don’t follow, can you please restate your question?
  
  Reply
amir May 11, 2018 at 7:12 pm #

i run your code on my laptop,, but every time result change.. the deference about 15 %

Reply
- Jason Brownlee May 12, 2018 at 6:28 am #
  
  This is a common question that I answer here:
  https://machinelearningmastery.com/faq/single-faq/why-do-i-get-different-results-each-time-i-run-the-code
  
  Reply
Craig Will June 18, 2018 at 9:43 am #

Hi Jason,
I’ve read a lot of your articles and they are generally pretty good.
I question your statements here:
“You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.

The effect is that the network becomes less sensitive to the specific weights of neurons.”
I don’et think it is correct. The goal is to not create MORE representations but a smaller number of robust representations.
(I’ve never really seen a specific plausible explanation of co-adaptation. It’s all hand-waving.)
Small note: The paper you cite as the “original” paper on dropout is not, it is their 2nd paper. The oriignal one is the one with “co-adaptation” in the title.

Craig Will

Reply
- Jason Brownlee June 18, 2018 at 3:06 pm #
  
  Thanks for the note Craig.
  
  Reply
Guillermo June 21, 2018 at 6:28 am #

Jason, thanks for the example. Apparently it has been working for everyone. However, I get the following error when I run your code

estimators.append((‘mlp’, KerasClassifier(build_fn=create_baseline, epochs=300, batch_size=16, verbose=0)))

NameError: name ‘create_baseline’ is not defined

I was hoping you could help me figure this out, as I haven’t been able to find anything online nor solve it myself

Reply
- Jason Brownlee June 21, 2018 at 6:45 am #
  
  Ensure you have copied all of the code from the tutorial and that the indenting matches.
  
  This might help:
  https://machinelearningmastery.com/faq/single-faq/how-do-i-copy-code-from-a-tutorial
  
  Reply
  - Guillermo June 21, 2018 at 7:42 am #
    
    Thanks for the reply! I’m going to try to use it with my own data now
    
    Reply
    - Jason Brownlee June 21, 2018 at 4:49 pm #
      
      Glad to hear that.
      
      Reply
Gurudev July 21, 2018 at 4:30 am #

Simple and clearly explained …..Thanks for such articles

Reply
- Jason Brownlee July 21, 2018 at 6:39 am #
  
  I’m glad it helped.
  
  Reply
Shabad July 24, 2018 at 4:21 pm #

Hi Jason,

Thanks for your articles. I am learning a lot from them.

Btw I ran your code on the same dataset and I got 81.66% (6.89%) accuracy without the dropout layer and a whooping increase to 87.54% (7.07%) accuracy with just dropout at input layer. What I am not able to understand is why the accuracy increased for the same dataset and same model for me and not for you? Is it overfitting in my case? and how do I test for it?

Thank you in advance.

Reply
- Jason Brownlee July 25, 2018 at 6:12 am #
  
  Nice work.
  
  Dropout reduces overfitting.
  
  Reply
Patsy Coate July 25, 2018 at 3:27 am #

When using dropout, are results reproducible? Can the randomness of which nodes get dropped each time be seeded to produce the same results?

Reply
- Jason Brownlee July 25, 2018 at 6:23 am #
  
  The idea is to evaluate a model many times and calculate the average performance, you can learn more here:
  https://machinelearningmastery.com/evaluate-skill-deep-learning-models/
  
  Reply
Helder August 17, 2018 at 3:04 am #

Hi Jason, very good topic!

About the dropout on the visible layer, in your example it means that in one bach of 60 images, 12 of them (20%) will be set to zero?

Thank you!

Reply
- Jason Brownlee August 17, 2018 at 6:36 am #
  
  It will be input-variable-wise, e.g. per-pixel, not the whole sample.
  
  Reply
Anam Habib August 29, 2018 at 4:26 pm #

Hy Jason,
I have observed in my dataset that when i used dropout to reduce the overfitting of deep learning model then it reduces the overfitting but it has an effect on the accuracy that it decreases the accuracy of my model. So, how can i increase the accuracy of a model in the process of reducing overfitting?

Reply
- Jason Brownlee August 30, 2018 at 6:28 am #
  
  Perhaps try different dropout levels?
  
  Reply
  - Anam August 30, 2018 at 2:44 pm #
    
    Hy Jason,
    If I got 96% training accuracy and 86% testing accuracy then I want to know that is it overfitting or not?
    
    Reply
    - Jason Brownlee August 30, 2018 at 4:53 pm #
      
      Look at learning curves on train/validation sets
      
      Reply
kern September 8, 2018 at 2:17 am #

Can someone please explain the kernel_constraint to me in layman terms?
I get that it works as some king of regularization.
Acc to keras docs, maxnorm(m) will, if the L2-Norm of your weights exceeds m, scale your whole weight matrix by a factor
that reduces the norm to m

So, how does it differ from the regular normalisation?

From http://cs231n.github.io/neural-networks-2/#reg:

Max norm constraints. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight
vector for every neuron and use projected gradient descent to enforce the constraint

how does this bound / scaling on weight matrix work?

Reply
- Jason Brownlee September 8, 2018 at 6:14 am #
  
  Good question.
  
  There is weight decay that pushes all weights in a node to be small, e.g. using L1 or L2 o the vector norm (magnitude). Keras calls this kernel regularization I think.
  
  Then there is weight constraint, which imposes a hard rule on the weights. A common example is max norm that forces the vector norm of the weights to be below a value, like 1, 2, 3. Once exceeded all weights in the node are made small enough to meet the constraint.
  
  It’s a subtle difference. Decay is a penalty in the loss function (soft), constraint is a if-then statement in the optimization procedure (hard).
  
  Does that help?
  
  Reply
Ilya September 20, 2018 at 12:03 pm #

Thank you for posting, this is very helpful

Reply
- Jason Brownlee September 20, 2018 at 2:28 pm #
  
  I’m happy that it helped.
  
  Reply
Gavin November 22, 2018 at 6:59 am #

Hi Jason

Thanks for these awesome tutorials, they are just unbelievable!!

Quick question, when importing the libraries that you need why do you import in the form
“from [moduleX.submoduleY] import [functionalityZ]”

Will “import [moduleX]” not just import the entire library, or are the submodules not accessible in this way? Or are we just doing this to save memory?

Reply
- Jason Brownlee November 22, 2018 at 2:06 pm #
  
  As an engineer, I was taught to only import what is required.
  
  Reply
ravi gurnatham November 26, 2018 at 10:35 am #

Hi Jason,
you mentioned that dropouts are not used while predicting but its not, we use the dropouts as well while predicting.

we simply multiply each output of activation function by the probability rate which we used in every layer.

Reply
- Jason Brownlee November 26, 2018 at 2:02 pm #
  
  Dropout is only used during training.
  
  Reply
Sinan Ozdemir December 5, 2018 at 9:11 am #

Hi Jason,

I applied the above technique and other techniques mentioned in your book to reduce the overfitting which worked fine. However, the model performed very poorly on a dataset that it hadn’t seen before.
Do you think based on your experience, not applying any overfitting reducing methods might be the optimal way to train a dataset? I know this might not be possible to generalize, but do you also think this has something to do with the dataset that we are dealing with. In my case, I am working with healthcare dataset.

Thank you so much for these wonderful tutorials and the books.

Sinan

Reply
- Jason Brownlee December 5, 2018 at 2:23 pm #
  
  Some ideas:
  
  Perhaps the out of sample dataset is not representative?
  Perhaps other regularization methods are needed?
  Perhaps the training process needs tuning too?
  
  Let me know how you go.
  
  Reply
Hamed December 8, 2018 at 6:53 am #

Hello Dear Jason,

Thanks a lot for your great post.
Just I have a question regarding over-fitting.
Could you please let me know how can I know my CNN over fitted?
To the best of my knowledge when there is a considerable difference between training accuracy and validation accuracy over-fitting happened.
I am asking this because I designed a CNN and I set the dropout to 0.5.
My training accuracy was around 99% and my maximum validation accuracy was 89% and when I reduced the dropout to 0.3 my training accuracy was fixed but the validation accuracy surprisingly increased to 95%.
I don’t know I can trust this accuracy or not.
Do you thing over-fitting happened to my CNN or not.

Reply
- Jason Brownlee December 8, 2018 at 7:14 am #
  
  You can review the learning curves during training on the training dataset and validation dataset.
  
  If validation continues to get worse and training continues to get better, you are overfit.
  
  Reply
Vandana March 2, 2019 at 5:51 am #

Hi Jason,

Can we use drop out before the first convolution layer? In the examples you have shown here, drop out is done prior to dense layer. But I found a piece of code where they have used drop out before the 1st convolution layer.

https://github.com/tzirakis/Multimodal-Emotion-Recognition/blob/master/models.py

If I did not interpret this wrong, in this code, inside the ‘audio_model’, ‘net’ is having first layer as drop out followed by a conv2D layer.

Reply
- Jason Brownlee March 2, 2019 at 9:36 am #
  
  Perhaps try it and compare results?
  
  Reply
Shahzeb Haider April 28, 2019 at 6:14 pm #

Hi Jason, setting dropout to zero has any impact on our neural network or not? I mean just adding a layer though value set to zero makes any impact?

Reply
- Jason Brownlee April 29, 2019 at 8:18 am #
  
  A dropout rate of 0.0 will have no effect in Keras.
  
  Reply
Pranith May 5, 2019 at 12:34 am #

Hi, I have trained a sequential model with layers LSTM, Dropout, LSTM, Dropout and a Dense layer. And tuning the hyperparameters I got the drop_out value to be 1. What does it mean? I couldn’t make sense out of it.

Reply
- Jason Brownlee May 5, 2019 at 6:31 am #
  
  In Keras, a dropout of 1, is full/always dropout, which might not work:
  https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/
  
  Reply
JG May 10, 2019 at 7:28 pm #

Hi Jason!

I come back to your wise, wide and deep machine learning knowledge!

My questions are related to understand the limits values of dropout rate, so:

1) when dropout rate is set = 0., is it equivalent to not add the dropout layer? Is it correct?

2) when dropout rate is set = 1. , is it equivalent to break the whole network so the model is not longer able to learn? Is it correct?

Thank you for your job and time!
JG

Reply
- Jason Brownlee May 11, 2019 at 6:11 am #
  
  In the paper dropout is a “retain” percentage, e.g. 80% means dropout 20%.
  
  In Keras, dropout means dropout, and 80% retain is implemented as 20% dropout (0.2).
  
  Therefore, as you say, in keras 0.0 means no dropout, 1.0 means drop everything.
  
  Reply
jg May 11, 2019 at 6:51 pm #

Thks Jason ! it is more clear now!

Reply
- Jason Brownlee May 12, 2019 at 6:39 am #
  
  I’m happy to hear that.
  
  Reply
Vanessa December 5, 2019 at 11:07 pm #

HI Jason,

Where can I get the data you used?

Or could you show an example of the type of data, as I’m getting some errors, and I don’t know why. Specially in the way I provide the y to the model, it’s complaining about dimensions.

Reply
- Jason Brownlee December 6, 2019 at 5:17 am #
  
  All datasets are available here:
  https://github.com/jbrownlee/Datasets
  
  Reply
Patxi February 8, 2020 at 3:54 am #

Hi, I have used ModelCheckPoint, the same epoch and batchsize. The best configuration found is at the 86th epoch.

I have plot the loss function and accurary and I see that there is an important gap between the train and the test curve. The loss curve corresponding to the test data increases.

What does it mean?

Thanks!!

Reply
- Jason Brownlee February 8, 2020 at 7:16 am #
  
  This will help:
  https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
  
  Reply
  - Patxi February 10, 2020 at 1:24 am #
    
    Hi Jason,
    
    Thank you for your kick help! According to the loss function, it seems that the problem could be that corresponding to the unrepresentative data set.
    
    I have reduced the train-test Split from 30% to 20%. Accuracy has increased significantly (90%) but the gap between the training loss curve and the validation loss curve remains.
    
    Reply
    - Jason Brownlee February 10, 2020 at 6:32 am #
      
      Perhaps try 50/50 split?
      
      Reply
      - Patxi February 10, 2020 at 7:38 am #
        
        Hi Jason,
        
        first of all, thank you for your patience. The results remains the same: the gap between the train and test loss functions remains, and there is a small drop of accuracy (now it is 86%)
        
        If the training set is smaller (drop from 80% to 50%), how can it be more representative?
      - Jason Brownlee February 10, 2020 at 1:19 pm #
        
        Perhaps the model is slightly overfit the training dataset?
Patxi February 15, 2020 at 3:16 am #

Hi Jason,

This is the model that I’m using:

# create model
model = Sequential()
model.add(Dense(60, input_dim=59, kernel_initializer=’uniform’, kernel_constraint=max_norm(3.), activation=’relu’))
model.add(Dropout(0.2))
model.add(Dense(30, kernel_initializer=’uniform’, kernel_constraint=max_norm(1.), activation=’relu’))
model.add(Dropout(0.2))
model.add(Dense(1, kernel_initializer=’uniform’, kernel_constraint=max_norm(1.), activation=’sigmoid’))

# Compile model
epochs = 150
learning_rate = 0.1
decay_rate = learning_rate / epochs
momentum = 0.9
sgd = SGD(lr=learning_rate, momentum=momentum, decay=decay_rate, nesterov=False)
model.compile(loss=’binary_crossentropy’, optimizer=sgd, metrics=[‘accuracy’])

filepath=”weights.best.hdf5″
checkpoint = ModelCheckpoint(filepath, monitor=’val_accuracy’, verbose=1, save_best_only=True, mode=’max’)
callbacks_list = [checkpoint]

# Fit the model
history=model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=300, batch_size=16, callbacks=callbacks_list, verbose=0)

Reply
- Jason Brownlee February 15, 2020 at 6:35 am #
  
  Well done!
  
  Reply
Nitin Sharma February 17, 2020 at 2:50 pm #

Hi Jason,

Can you please let us know, while we have already trained a neural network with dropouts, how that will be taken care when we are using the same model weights predicting (Serving Model) ?

Reply
- Jason Brownlee February 18, 2020 at 6:16 am #
  
  Dropout is only used during training as far I recall.
  
  Reply
  - Nitin Sharma February 18, 2020 at 4:35 pm #
    
    Hi Jaosn, thanks for the reply, but lets say we have 6 neurons dropout and we have a probability of 0.5 so the weightage of signals going down in the next layer will be 1/5 strength , but when we have the prediction network, it will have all 6 neurons passing on the signal, it can cause an additional input magnitude increase, which in turn will have an impact in subsequent layers. So will it not get adjusted accordingly ?
    
    Reply
    - Jason Brownlee February 19, 2020 at 7:57 am #
      
      No, training under dropout causes the nodes to share the load, balance out.
      
      Recall that dropout is probabilistic.
      
      Reply
- Amit January 26, 2022 at 5:21 am #
  
  You can multiply each weight of your dropped out layer with drop_out probability while testing
  
  Reply
Rohit Khanna April 29, 2020 at 7:42 pm #

How to implement droupout on the test set during prediction?
Basically, I want to model the uncertainty in the model prediction on the test set so any advice on how to best do that too.

Reply
- Jason Brownlee April 30, 2020 at 6:42 am #
  
  Dropout is only used during training.
  
  I believe you can force dropout during testing. I might have an example on the blog, I don’t recall sorry.
  
  Reply
tiarrr June 12, 2020 at 9:23 am #

hi jason how to use dropout with 3 different dataset?

Reply
- Jason Brownlee June 12, 2020 at 11:13 am #
  
  Sorry, I don’t understand your question, can you please elaborate?
  
  Reply
manju badiger August 17, 2020 at 5:08 pm #

Hi Jason can we build a Customized drop out layer? if we build customized drop out layer then can we add or change functions in customized drop out layer? So that we can compare with the regular drop out layer ? is it practically possible? or better we can use a regular drop out layer can you give me some good explanation? it would help more

Reply
- Jason Brownlee August 18, 2020 at 5:59 am #
  
  I don’t see why not.
  
  Sorry, I don’t have exampels of creating custom dropout layers. I recommend checking the API and experimenting/prototyping.
  
  Reply

Vishal S. September 28, 2020 at 5:21 am #

Here’s a customized dropout, very simple to implement. See CS231 Stanford lec online and github for more specifics.

class DropoutLayer(nn.Module):
    def __init__(self, p):
        super().__init__()
        self.p = p

    def forward(self, input):
        if self.training:
            u1 = (torch.rand(*input.shape)<self.p) / self.p
            output = input * u1
    
            return output
        else:
            return input

class DropoutLayer(nn.Module):

def __init__(self, p):

super().__init__()

self.p = p

def forward(self, input):

if self.training:

u1 = (torch.rand(*input.shape)<self.p) / self.p

output = input * u1

return output

else:

return input

Jason Brownlee September 28, 2020 at 6:28 am #

It’s for torch, not keras.

Why are you sharing it?

Reply

walid October 24, 2020 at 8:02 pm #

hi jason,
i’m trying to test some of your codes with my data. i tried to use dropout in my code but unfortunately my validation loss is lower than my training loss despite the mse for train and test seems the same 0.007 and 0.008 (without drop).

also the mse with dropout is more than without dropout as mentioned in the figures below.

https://ibb.co/ZKxB7fL

https://ibb.co/N3vpDCZ

i tried to fund the reasons like mentioned here:

https://www.pyimagesearch.com/2019/10/14/why-is-my-validation-loss-lower-than-my-training-loss/

reason1: Regularization applied during training, but not during validation/testing (normally and by default the dropout is not used in validation/testing)

reason2: Training loss is measured during each epoch while validation loss is measured after each epoch

reason3: The validation set may be easier than the training set (or there may be leaks)

I think in my case that the reason 2 is the more logical.

X_train=174200 samples
X_test=85800
y_train=174200
y_test=85800

the code:

X = dataset[:,0:20].astype(float)
y = dataset[:,20:22]
scaler = StandardScaler()
X = scaler.fit_transform(X)
y = scaler.fit_transform(y)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# define the keras model
model = Sequential()
model.add(Dense(20, input_dim=20,kernel_initializer=’normal’))
model.add(LeakyReLU(alpha=0.1))
#model.add(Dropout(0.2))
model.add(Dense(7,kernel_initializer=’normal’))
model.add(LeakyReLU(alpha=0.1))
#model.add(Dropout(0.2))
model.add(Dense(2, activation=’linear’))
opt = SGD(lr=0.01, momentum=0.9)
# compile the keras model
model.compile(loss=’mean_squared_error’, optimizer=opt, metrics=[‘mse’])
# fit the keras model on the dataset
history=model.fit(X, y, validation_data=(X_test, y_test),epochs=25,verbose=0)
# evaluate the model
_, train_mse = model.evaluate(X_train, y_train, verbose=0)
_, test_mse = model.evaluate(X_test, y_test, verbose=0)
print(‘Train: %.3f, Test: %.3f’ % (train_mse, test_mse))
#plot loss during training
pyplot.title(‘Loss / Mean Squared Error’)
pyplot.plot(history.history[‘loss’], label=’train’)
pyplot.plot(history.history[‘val_loss’], label=’test’)
pyplot.legend()
pyplot.show()

Reply
- Jason Brownlee October 25, 2020 at 7:00 am #
  
  This is common when your validation dataset is too small or not representative of the training dataset.
  
  Reply
walid October 25, 2020 at 1:44 pm #

thank you for your answer.
i increase the test_size to 50% but it’s the same thing. the printed train_mse and test_mse are identical(seems good). why the plot of loss function of the test loss is lower than the train?
what do you advise me?

thanks

Reply
- Jason Brownlee October 26, 2020 at 6:48 am #
  
  Perhaps conform you are plotting what you believe you are plotting (debug).
  Perhaps try an alternate sample of your dataset.
  Perhaps try an alternate model.
  
  Reply
  - walid October 27, 2020 at 3:25 pm #
    
    thank you jason for your answer.
    
    i tried to change the dataset.
    
    i tried to increase the number of nodes in the hidden layer to 25.
    
    i also reduce the model. just i kept one hidden layer:
    
    model = Sequential()
    model.add(Dense(20,input_dim=20,kernel_constraint=max_norm(3.),kernel_initializer=’normal’))
    model.add(LeakyReLU(alpha=0.1))
    model.add(Dropout(0.2))
    
    model.add(Dense(2, activation=’linear’))
    opt =Adadelta(lr=0.01)
    
    # compile the keras model
    model.compile(loss=’mean_squared_error’, optimizer=opt, metrics=[‘mse’])
    ….
    
    the obtained result :Train: 0.002, Test: 0.003
    
    but the graph of loss function is unchanged. always the validation loss is lower than my training loss.
    
    the code to plot the loss function is:
    
    pyplot.title(‘Loss / Mean Squared Error’)
    pyplot.plot(history.history[‘loss’], label=’train’)
    pyplot.plot(history.history[‘val_loss’], label=’test’)
    pyplot.legend()
    pyplot.show()
    
    Reply
    - Jason Brownlee October 28, 2020 at 6:42 am #
      
      Nice work. Perhaps continue to try additional changes listed here:
      https://machinelearningmastery.com/start-here/#better
      
      Reply
walid October 27, 2020 at 5:44 pm #

thank you jason.

your codes are very helpfull.

i found the error .

it’s here in my code:

history=model.fit(X, y, validation_data=(X_test, y_test),epochs=25,verbose=0)

X,y==>X_train ,y_train

Reply
- Jason Brownlee October 28, 2020 at 6:43 am #
  
  Nice work!
  
  Reply
Sumesh December 4, 2020 at 5:12 pm #

Hi Jason,

Thanks for the clear explanation. I do have a doubt though. In one of the models that I have created, I’m getting pretty good (~99%) validation accuracy with a minimalistic baseline CNN (just 4 layers of conv+maxpool). However, when I increase it even by 1 layer, the validation does an early-stopping as it tends to plateau. Does this mean the network is going deeper and studying things that aren’t positively contributing to the model? Since it isn’t a case of overfitting (validation and training goes pretty much hand in hand), I’m not inclined to use dropout as well. Does it make sense to augment the data in such cases to see if the accuracy increases? With an accuracy of 99.xxx, I’m not sure if there is a real need to do so. Would like to hear your thoughts on this.

Regards,
James

Reply
- Jason Brownlee December 5, 2020 at 8:04 am #
  
  You’re welcome.
  
  Changing the capacity of the model will often require a corresponding adjustment of the learning hyperparameters (learning rate, batch size, etc.)
  
  Try dropout and see. Try augmentation and see. Experiments are cheap.
  
  An accuracy of 99% on a hold out dataset might suggest your prediction task is trivial and might be solved using simpler methods.
  
  Reply
  - Sumesh December 5, 2020 at 12:13 pm #
    
    Thanks Jason. Are there simpler methods when it comes to running classification for images? I typically opt for CNNs when it comes to images. I could try doing the plain regression or Random Forests to see how it fares though.
    
    Reply
    - Jason Brownlee December 5, 2020 at 1:22 pm #
      
      Yes, try a simple ML algorithm fit on standardized pixels.
      
      Reply
      - Sumesh December 5, 2020 at 3:33 pm #
        
        Thanks for the prompt response, Jason. Will definitely try that out.
Angela July 28, 2021 at 10:35 am #

Hi Jason,

Is there a way to implement dropout in a pretrained network? Thanks!

Reply
- Jason Brownlee July 29, 2021 at 5:08 am #
  
  Sure, you can add layers to the network, but I think you will have to re-connect all layers in the network (a pain).
  
  Reply
Steven Jepsen August 7, 2022 at 6:03 am #

Hi All,

Does anyone know of specific examples where use of Dropout actually improved test scores of a neural network model? I would imagine authors of the original publication may have provided example(s) [I admit to not having read it], but does anyone know of cases outside of the original publication when it actually improved test scores?

Thanks,
Steve

Reply
- James Carmichael August 7, 2022 at 7:01 am #
  
  Hi Steven…The following resource and the results within may be of interest to you:
  
  https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
  
  Reply
Yihun Ayele April 18, 2024 at 6:45 pm #

How to build LSTM AND GRU model without overfitting for dataset 7000 total

Reply
- James Carmichael April 21, 2024 at 10:14 am #
  
  Hi Yihun…Building an LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit) model without overfitting when you have a relatively small dataset, such as 7000 samples, involves careful model design and regularization strategies. Here are some steps and tips on how to approach this:
  
  ### 1. Simplify the Model Architecture
  – **Reduce Complexity**: Start with a simple model. For a small dataset, a complex model with too many parameters can easily overfit. Use one or two recurrent layers initially, with a moderate number of units (e.g., 50-100 units per layer).
  – **Experiment with Both LSTM and GRU**: GRUs are generally simpler (fewer parameters) than LSTMs and might be less prone to overfitting. Compare their performance for your specific dataset.
  
  ### 2. Implement Regularization Techniques
  – **Dropout**: Apply dropout to the inputs and recurrent connections of the LSTM or GRU layers. This randomly drops (sets to zero) a fraction of the output units during training, helping to prevent overfitting.
  – For example, use Dropout layers between recurrent layers and potentially a SpatialDropout1D if your input data is sequential (like time series).
  – **L2 Regularization**: Add L2 regularization to the recurrent layers, which adds a penalty on layer parameters based on their squared magnitude.
  
  ### 3. Optimize the Training Process
  – **Early Stopping**: Use early stopping during training. This technique halts training when the performance on a validation set starts to deteriorate, preventing overtraining on the training data.
  – **Reduce Learning Rate on Plateau**: Lower the learning rate if the validation loss stops improving. This can lead to more refined training steps that might bypass overfitting pitfalls.
  
  ### 4. Data Management
  – **Cross-Validation**: Instead of a simple train-test split, use k-fold cross-validation to maximize your model’s exposure to different data subsets. This improves the generalization ability of your model.
  – **Data Augmentation**: For sequential data, you can implement techniques like windowing, noise injection, or time shifting if appropriate for your problem domain.
  – **Feature Selection**: Limit the number of features to only the most informative ones, reducing model complexity and the likelihood of overfitting.
  
  ### 5. Adjust Model Inputs
  – **Normalization/Standardization**: Scale your input features so that they contribute equally to the learning process, which can help the model learn more effectively and potentially reduce overfitting.
  
  ### Example Code
  Here’s a basic example of building a GRU model with Keras for a sequence classification problem, implementing some of these strategies:
  
  python from keras.models import Sequential from keras.layers import GRU, Dropout, Dense from keras.callbacks import EarlyStopping, ReduceLROnPlateau from keras.regularizers import l2
  model = Sequential() model.add(GRU(50, return_sequences=True, input_shape=(timesteps, features), kernel_regularizer=l2(0.001))) model.add(Dropout(0.2)) model.add(GRU(50, kernel_regularizer=l2(0.001))) model.add(Dropout(0.2)) model.add(Dense(1, activation='sigmoid')) # Adjust the activation and number of neurons based on your output needs model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) early_stopping = EarlyStopping(monitor='val_loss', patience=10) reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5)
  model.fit(x_train, y_train, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stopping, reduce_lr])
  
  ### Final Thoughts
  The key to preventing overfitting, especially with smaller datasets, is to start simple and gradually increase complexity only if needed and justified by improved validation performance. Monitoring, testing, and iterating based on performance metrics are critical steps in fine-tuning your model.
  
  Reply

Navigation

Dropout Regularization in Deep Learning Models with Keras

Dropout Regularization for Neural Networks

Need help with Deep Learning in Python?

Dropout Regularization in Keras

Using Dropout on the Visible Layer

Using Dropout on Hidden Layers

Dropout in Evaluation Mode

Tips for Using Dropout

More Resources on Dropout

Summary

More On This Topic

133 Responses to Dropout Regularization in Deep Learning Models with Keras

Leave a Reply Click here to cancel reply.