Using Dropout Regularization in PyTorch Models

By Adrian Tam on April 8, 2023 in Deep Learning with PyTorch 4

Dropout is a simple and powerful regularization technique for neural networks and deep learning models.

In this post, you will discover the Dropout regularization technique and how to apply it to your models in PyTorch models.

After reading this post, you will know:

How the Dropout regularization technique works
How to use Dropout on your input layers
How to use Dropout on your hidden layers
How to tune the dropout level on your problem

Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code.

Let’s get started.

Using Dropout Regularization in PyTorch Models
Photo by Priscilla Fraire. Some rights reserved.

Overview

This post is divided into six parts; they are

Dropout Regularization for Neural Networks
Dropout Regularization in PyTorch
Using Dropout on the Input Layer
Using Dropout on the Hidden Layers
Dropout in Evaluation Mode
Tips for Using Dropout

Dropout Regularization for Neural Networks

Dropout is a regularization technique for neural network models proposed around 2012 to 2014. It is a layer in the neural network. During training of a neural network model, it will take the output from its previous layer, randomly select some of the neurons and zero them out before passing to the next layer, effectively ignored them. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass, and any weight updates are not applied to the neuron on the backward pass.

When the model is used for inference, dropout layer is just to scale all the neurons constantly to compensate the effect of dropping out during training.

Dropout is destructive but surprisingly can improve the model’s accuracy. As a neural network learns, neuron weights settle into their context within the network. Weights of neurons are tuned for specific features, providing some specialization. Neighboring neurons come to rely on this specialization, which, if taken too far, can result in a fragile model too specialized for the training data. This reliance on context for a neuron during training is referred to as complex co-adaptations.

You can imagine that if neurons are randomly dropped out of the network during training, other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.

The effect is that the network becomes less sensitive to the specific weights of neurons. This, in turn, results in a network capable of better generalization and less likely to overfit the training data.

Want to Get Started With Deep Learning with PyTorch?

Take my free email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Dropout Regularization in PyTorch

You do not need to randomly select elements from a PyTorch tensor to implement dropout manually. The nn.Dropout() layer from PyTorch can be introduced into your model. It is implemented by randomly selecting nodes to be dropped out with a given probability $p$ (e.g., 20%) while in the training loop. In PyTorch, the dropout layer further scale the resulting tensor by a factor of $\dfrac{1}{1-p}$ so the average tensor value is maintained. Thanks to this scaling, the dropout layer operates at inference will be an identify function (i.e., no effect, simply copy over the input tensor as output tensor). You should make sure to turn the model into inference mode when evaluating the the model.

Let’s see how to use nn.Dropout() in a PyTorch model.

The examples will use the Sonar dataset. This is a binary classification problem that aims to correctly identify rocks and mock-mines from sonar chirp returns. It is a good test dataset for neural networks because all the input values are numerical and have the same scale.

The dataset can be downloaded from the UCI Machine Learning repository. You can place the sonar dataset in your current working directory with the file name sonar.csv.

You will evaluate the developed models using scikit-learn with 10-fold cross validation in order to tease out differences in the results better.

There are 60 input values and a single output value. The input values are standardized before being used in the network. The baseline neural network model has two hidden layers, the first with 60 units and the second with 30. Stochastic gradient descent is used to train the model with a relatively low learning rate and momentum.

The full baseline model is listed below:

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold

# Read data
data = pd.read_csv("sonar.csv", header=None)
X = data.iloc[:, 0:60]
y = data.iloc[:, 60]

# Label encode the target from string to integer
encoder = LabelEncoder()
encoder.fit(y)
y = encoder.transform(y)

# Convert to 2D PyTorch tensors
X = torch.tensor(X.values, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# Define PyTorch model
class SonarModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(60, 60)
        self.act1 = nn.ReLU()
        self.layer2 = nn.Linear(60, 30)
        self.act2 = nn.ReLU()
        self.output = nn.Linear(30, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.act1(self.layer1(x))
        x = self.act2(self.layer2(x))
        x = self.sigmoid(self.output(x))
        return x

# Helper function to train the model and return the validation result
def model_train(model, X_train, y_train, X_val, y_val,
                n_epochs=300, batch_size=16):
    loss_fn = nn.BCELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.8)
    batch_start = torch.arange(0, len(X_train), batch_size)

    model.train()
    for epoch in range(n_epochs):
        for start in batch_start:
            X_batch = X_train[start:start+batch_size]
            y_batch = y_train[start:start+batch_size]
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # evaluate accuracy after training
    model.eval()
    y_pred = model(X_val)
    acc = (y_pred.round() == y_val).float().mean()
    acc = float(acc)
    return acc

# run 10-fold cross validation
kfold = StratifiedKFold(n_splits=10, shuffle=True)
accuracies = []
for train, test in kfold.split(X, y):
    # create model, train, and get accuracy
    model = SonarModel()
    acc = model_train(model, X[train], y[train], X[test], y[test])
    print("Accuracy: %.2f" % acc)
    accuracies.append(acc)

# evaluate the model
mean = np.mean(accuracies)
std = np.std(accuracies)
print("Baseline: %.2f%% (+/- %.2f%%)" % (mean*100, std*100))

import numpy as np

import pandas as pd

import torch

import torch.nn as nn

import torch.optim as optim

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import StratifiedKFold

# Read data

data = pd.read_csv("sonar.csv", header=None)

X = data.iloc[:, 0:60]

y = data.iloc[:, 60]

# Label encode the target from string to integer

encoder = LabelEncoder()

encoder.fit(y)

y = encoder.transform(y)

# Convert to 2D PyTorch tensors

X = torch.tensor(X.values, dtype=torch.float32)

y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# Define PyTorch model

class SonarModel(nn.Module):

def __init__(self):

super().__init__()

self.layer1 = nn.Linear(60, 60)

self.act1 = nn.ReLU()

self.layer2 = nn.Linear(60, 30)

self.act2 = nn.ReLU()

self.output = nn.Linear(30, 1)

self.sigmoid = nn.Sigmoid()

def forward(self, x):

x = self.act1(self.layer1(x))

x = self.act2(self.layer2(x))

x = self.sigmoid(self.output(x))

return x

# Helper function to train the model and return the validation result

def model_train(model, X_train, y_train, X_val, y_val,

n_epochs=300, batch_size=16):

loss_fn = nn.BCELoss()

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.8)

batch_start = torch.arange(0, len(X_train), batch_size)

model.train()

for epoch in range(n_epochs):

for start in batch_start:

X_batch = X_train[start:start+batch_size]

y_batch = y_train[start:start+batch_size]

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

optimizer.zero_grad()

loss.backward()

optimizer.step()

# evaluate accuracy after training

model.eval()

y_pred = model(X_val)

acc = (y_pred.round() == y_val).float().mean()

acc = float(acc)

return acc

# run 10-fold cross validation

kfold = StratifiedKFold(n_splits=10, shuffle=True)

accuracies = []

for train, test in kfold.split(X, y):

# create model, train, and get accuracy

model = SonarModel()

acc = model_train(model, X[train], y[train], X[test], y[test])

print("Accuracy: %.2f" % acc)

accuracies.append(acc)

# evaluate the model

mean = np.mean(accuracies)

std = np.std(accuracies)

print("Baseline: %.2f%% (+/- %.2f%%)" % (mean*100, std*100))

Running the example generates an estimated classification accuracy of 82%.

Accuracy: 0.81
Accuracy: 0.81
Accuracy: 0.76
Accuracy: 0.86
Accuracy: 0.81
Accuracy: 0.90
Accuracy: 0.86
Accuracy: 0.95
Accuracy: 0.65
Accuracy: 0.80
Baseline: 82.12% (+/- 7.78%)

Accuracy: 0.81

Accuracy: 0.76

Accuracy: 0.86

Accuracy: 0.81

Accuracy: 0.90

Accuracy: 0.86

Accuracy: 0.95

Accuracy: 0.65

Accuracy: 0.80

Baseline: 82.12% (+/- 7.78%)

Using Dropout on the Input Layer

Dropout can be applied to input neurons called the visible layer.

In the example below, a new Dropout layer between the input and the first hidden layer was added. The dropout rate is set to 20%, meaning one in five inputs will be randomly excluded from each update cycle.

Continuing from the baseline example above, the code below exercises the same network with input dropout:

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold

# Read data
data = pd.read_csv("sonar.csv", header=None)
X = data.iloc[:, 0:60]
y = data.iloc[:, 60]

# Label encode the target from string to integer
encoder = LabelEncoder()
encoder.fit(y)
y = encoder.transform(y)

# Convert to 2D PyTorch tensors
X = torch.tensor(X.values, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# Define PyTorch model, with dropout at input
class SonarModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.dropout = nn.Dropout(0.2)
        self.layer1 = nn.Linear(60, 60)
        self.act1 = nn.ReLU()
        self.layer2 = nn.Linear(60, 30)
        self.act2 = nn.ReLU()
        self.output = nn.Linear(30, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.dropout(x)
        x = self.act1(self.layer1(x))
        x = self.act2(self.layer2(x))
        x = self.sigmoid(self.output(x))
        return x

# Helper function to train the model and return the validation result
def model_train(model, X_train, y_train, X_val, y_val,
                n_epochs=300, batch_size=16):
    loss_fn = nn.BCELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.8)
    batch_start = torch.arange(0, len(X_train), batch_size)

    model.train()
    for epoch in range(n_epochs):
        for start in batch_start:
            X_batch = X_train[start:start+batch_size]
            y_batch = y_train[start:start+batch_size]
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # evaluate accuracy after training
    model.eval()
    y_pred = model(X_val)
    acc = (y_pred.round() == y_val).float().mean()
    acc = float(acc)
    return acc

# run 10-fold cross validation
kfold = StratifiedKFold(n_splits=10, shuffle=True)
accuracies = []
for train, test in kfold.split(X, y):
    # create model, train, and get accuracy
    model = SonarModel()
    acc = model_train(model, X[train], y[train], X[test], y[test])
    print("Accuracy: %.2f" % acc)
    accuracies.append(acc)

# evaluate the model
mean = np.mean(accuracies)
std = np.std(accuracies)
print("Baseline: %.2f%% (+/- %.2f%%)" % (mean*100, std*100))

import numpy as np

import pandas as pd

import torch

import torch.nn as nn

import torch.optim as optim

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import StratifiedKFold

# Read data

data = pd.read_csv("sonar.csv", header=None)

X = data.iloc[:, 0:60]

y = data.iloc[:, 60]

# Label encode the target from string to integer

encoder = LabelEncoder()

encoder.fit(y)

y = encoder.transform(y)

# Convert to 2D PyTorch tensors

X = torch.tensor(X.values, dtype=torch.float32)

y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# Define PyTorch model, with dropout at input

class SonarModel(nn.Module):

def __init__(self):

super().__init__()

self.dropout = nn.Dropout(0.2)

self.layer1 = nn.Linear(60, 60)

self.act1 = nn.ReLU()

self.layer2 = nn.Linear(60, 30)

self.act2 = nn.ReLU()

self.output = nn.Linear(30, 1)

self.sigmoid = nn.Sigmoid()

def forward(self, x):

x = self.dropout(x)

x = self.act1(self.layer1(x))

x = self.act2(self.layer2(x))

x = self.sigmoid(self.output(x))

return x

# Helper function to train the model and return the validation result

def model_train(model, X_train, y_train, X_val, y_val,

n_epochs=300, batch_size=16):

loss_fn = nn.BCELoss()

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.8)

batch_start = torch.arange(0, len(X_train), batch_size)

model.train()

for epoch in range(n_epochs):

for start in batch_start:

X_batch = X_train[start:start+batch_size]

y_batch = y_train[start:start+batch_size]

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

optimizer.zero_grad()

loss.backward()

optimizer.step()

# evaluate accuracy after training

model.eval()

y_pred = model(X_val)

acc = (y_pred.round() == y_val).float().mean()

acc = float(acc)

return acc

# run 10-fold cross validation

kfold = StratifiedKFold(n_splits=10, shuffle=True)

accuracies = []

for train, test in kfold.split(X, y):

# create model, train, and get accuracy

model = SonarModel()

acc = model_train(model, X[train], y[train], X[test], y[test])

print("Accuracy: %.2f" % acc)

accuracies.append(acc)

# evaluate the model

mean = np.mean(accuracies)

std = np.std(accuracies)

print("Baseline: %.2f%% (+/- %.2f%%)" % (mean*100, std*100))

Running the example provides a slight drop in classification accuracy, at least on a single test run.

Accuracy: 0.62
Accuracy: 0.90
Accuracy: 0.76
Accuracy: 0.62
Accuracy: 0.67
Accuracy: 0.86
Accuracy: 0.90
Accuracy: 0.86
Accuracy: 0.90
Accuracy: 0.85
Baseline: 79.40% (+/- 11.20%)

Accuracy: 0.62

Accuracy: 0.90

Accuracy: 0.76

Accuracy: 0.62

Accuracy: 0.67

Accuracy: 0.86

Accuracy: 0.90

Accuracy: 0.86

Accuracy: 0.90

Accuracy: 0.85

Baseline: 79.40% (+/- 11.20%)

Using Dropout on Hidden Layers

Dropout can be applied to hidden neurons in the body of your network model. This is more common.

In the example below, Dropout is applied between the two hidden layers and between the last hidden layer and the output layer. Again a dropout rate of 20% is used:

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold

# Read data
data = pd.read_csv("sonar.csv", header=None)
X = data.iloc[:, 0:60]
y = data.iloc[:, 60]

# Label encode the target from string to integer
encoder = LabelEncoder()
encoder.fit(y)
y = encoder.transform(y)

# Convert to 2D PyTorch tensors
X = torch.tensor(X.values, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# Define PyTorch model, with dropout at hidden layers
class SonarModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(60, 60)
        self.act1 = nn.ReLU()
        self.dropout1 = nn.Dropout(0.2)
        self.layer2 = nn.Linear(60, 30)
        self.act2 = nn.ReLU()
        self.dropout2 = nn.Dropout(0.2)
        self.output = nn.Linear(30, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.act1(self.layer1(x))
        x = self.dropout1(x)
        x = self.act2(self.layer2(x))
        x = self.dropout2(x)
        x = self.sigmoid(self.output(x))
        return x

# Helper function to train the model and return the validation result
def model_train(model, X_train, y_train, X_val, y_val,
                n_epochs=300, batch_size=16):
    loss_fn = nn.BCELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.8)
    batch_start = torch.arange(0, len(X_train), batch_size)

    model.train()
    for epoch in range(n_epochs):
        for start in batch_start:
            X_batch = X_train[start:start+batch_size]
            y_batch = y_train[start:start+batch_size]
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # evaluate accuracy after training
    model.eval()
    y_pred = model(X_val)
    acc = (y_pred.round() == y_val).float().mean()
    acc = float(acc)
    return acc

# run 10-fold cross validation
kfold = StratifiedKFold(n_splits=10, shuffle=True)
accuracies = []
for train, test in kfold.split(X, y):
    # create model, train, and get accuracy
    model = SonarModel()
    acc = model_train(model, X[train], y[train], X[test], y[test])
    print("Accuracy: %.2f" % acc)
    accuracies.append(acc)

# evaluate the model
mean = np.mean(accuracies)
std = np.std(accuracies)
print("Baseline: %.2f%% (+/- %.2f%%)" % (mean*100, std*100))

import numpy as np

import pandas as pd

import torch

import torch.nn as nn

import torch.optim as optim

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import StratifiedKFold

# Read data

data = pd.read_csv("sonar.csv", header=None)

X = data.iloc[:, 0:60]

y = data.iloc[:, 60]

# Label encode the target from string to integer

encoder = LabelEncoder()

encoder.fit(y)

y = encoder.transform(y)

# Convert to 2D PyTorch tensors

X = torch.tensor(X.values, dtype=torch.float32)

y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# Define PyTorch model, with dropout at hidden layers

class SonarModel(nn.Module):

def __init__(self):

super().__init__()

self.layer1 = nn.Linear(60, 60)

self.act1 = nn.ReLU()

self.dropout1 = nn.Dropout(0.2)

self.layer2 = nn.Linear(60, 30)

self.act2 = nn.ReLU()

self.dropout2 = nn.Dropout(0.2)

self.output = nn.Linear(30, 1)

self.sigmoid = nn.Sigmoid()

def forward(self, x):

x = self.act1(self.layer1(x))

x = self.dropout1(x)

x = self.act2(self.layer2(x))

x = self.dropout2(x)

x = self.sigmoid(self.output(x))

return x

# Helper function to train the model and return the validation result

def model_train(model, X_train, y_train, X_val, y_val,

n_epochs=300, batch_size=16):

loss_fn = nn.BCELoss()

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.8)

batch_start = torch.arange(0, len(X_train), batch_size)

model.train()

for epoch in range(n_epochs):

for start in batch_start:

X_batch = X_train[start:start+batch_size]

y_batch = y_train[start:start+batch_size]

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

optimizer.zero_grad()

loss.backward()

optimizer.step()

# evaluate accuracy after training

model.eval()

y_pred = model(X_val)

acc = (y_pred.round() == y_val).float().mean()

acc = float(acc)

return acc

# run 10-fold cross validation

kfold = StratifiedKFold(n_splits=10, shuffle=True)

accuracies = []

for train, test in kfold.split(X, y):

# create model, train, and get accuracy

model = SonarModel()

acc = model_train(model, X[train], y[train], X[test], y[test])

print("Accuracy: %.2f" % acc)

accuracies.append(acc)

# evaluate the model

mean = np.mean(accuracies)

std = np.std(accuracies)

print("Baseline: %.2f%% (+/- %.2f%%)" % (mean*100, std*100))

You can see that in this case, adding dropout layer improved the accuracy a bit.

Accuracy: 0.86
Accuracy: 1.00
Accuracy: 0.86
Accuracy: 0.90
Accuracy: 0.90
Accuracy: 0.86
Accuracy: 0.81
Accuracy: 0.81
Accuracy: 0.70
Accuracy: 0.85
Baseline: 85.50% (+/- 7.36%)

Accuracy: 0.86

Accuracy: 1.00

Accuracy: 0.86

Accuracy: 0.90

Accuracy: 0.86

Accuracy: 0.81

Accuracy: 0.70

Accuracy: 0.85

Baseline: 85.50% (+/- 7.36%)

Dropout in Evaluation Mode

Dropout will randomly reset some of the input to zero. If you wonder what happens after you have finished training, the answer is nothing! The PyTorch dropout layer should run like an identity function when the model is in evaluation mode. That’s why you have model.eval() before you evaluate the model. This is important because the goal of dropout layer is to make sure the network learn enough clues about the input for the prediction, rather than depend on a rare phenomenon in the data. But on inference, you should provide as much information as possible to the model.

Tips for Using Dropout

The original paper on Dropout provides experimental results on a suite of standard machine learning problems. As a result, they provide a number of useful heuristics to consider when using Dropout in practice.

Generally, use a small dropout value of 20%-50% of neurons, with 20% providing a good starting point. A probability too low has minimal effect, and a value too high results in under-learning by the network.
Use a larger network. You are likely to get better performance when Dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.
Use Dropout on incoming (visible) as well as hidden units. Application of Dropout at each layer of the network has shown good results.
Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of 10 to 100 and use a high momentum value of 0.9 or 0.99.
Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint on the size of network weights, such as max-norm regularization, with a size of 4 or 5 has been shown to improve results.

Summary

In this post, you discovered the Dropout regularization technique for deep learning models. You learned:

What Dropout is and how it works
How you can use Dropout on your own deep learning models.
Tips for getting the best results from Dropout on your own models.

4 Responses to Using Dropout Regularization in PyTorch Models

Ante February 21, 2023 at 4:44 pm #

Thanks, great tutorial.

I am curious how one can use dropout in the INFERENCE stage. Any idea?
The reason for this dropout would be to effectively train only a SINGLE model, but at the same time you wish to have an ENSEMBLE of models (by having different neurons dropped) and use their predictions to estimate the uncertainty of the model predictions.

- Adrian Tam March 15, 2023 at 5:44 am #
  
  It is not usually done but if you insist, PyTorch has the model.eval() and model.train() to switch between training and inference mode. For many layers they are just the same but drop out layer will toggle between random drop or no drop.
  
Ganesh March 8, 2023 at 6:38 pm #

Thanks a great article. Can you please share how KFoldStratified would be run if we use the dataloader and Datasets?

- James Carmichael March 9, 2023 at 9:39 am #
  
  Hi Ganesh…best practices can be found here:
  
  https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/

Navigation

Using Dropout Regularization in PyTorch Models

Overview

Dropout Regularization for Neural Networks

Want to Get Started With Deep Learning with PyTorch?

Dropout Regularization in PyTorch

Using Dropout on the Input Layer

Using Dropout on Hidden Layers

Dropout in Evaluation Mode

Tips for Using Dropout

Further Readings

Summary

Get Started on Deep Learning with PyTorch!

Learn how to build deep learning models

Kick-start your deep learning journey with hands-on exercises

More On This Topic

4 Responses to Using Dropout Regularization in PyTorch Models

Leave a Reply Click here to cancel reply.