How to Improve Deep Learning Model Robustness by Adding Noise

Adding noise to an underconstrained neural network model with a small training dataset can have a regularizing effect and reduce overfitting.

Keras supports the addition of Gaussian noise via a separate layer called the GaussianNoise layer. This layer can be used to add noise to an existing model.

In this tutorial, you will discover how to add noise to deep learning models in Keras in order to reduce overfitting and improve model generalization.

After completing this tutorial, you will know:

  • Noise can be added to a neural network model via the GaussianNoise layer.
  • The GaussianNoise can be used to add noise to input values or between hidden layers.
  • How to add a GaussianNoise layer in order to reduce overfitting in a Multilayer Perceptron model for classification.

Let’s get started.

How to Improve Deep Learning Model Robustness by Adding Noise

How to Improve Deep Learning Model Robustness by Adding Noise
Photo by Michael Mueller, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Noise Regularization in Keras
  2. Noise Regularization in Models
  3. Noise Regularization Case Study

Noise Regularization in Keras

Keras supports the addition of noise to models via the GaussianNoise layer.

This is a layer that will add noise to inputs of a given shape. The noise has a mean of zero and requires that a standard deviation of the noise be specified as a parameter. For example:

The output of the layer will have the same shape as the input, with the only modification being the addition of noise to the values.

Noise Regularization in Models

The GaussianNoise can be used in a few different ways with a neural network model.

Firstly, it can be used as an input layer to add noise to input variables directly. This is the traditional use of noise as a regularization method in neural networks.

Below is an example of defining a GaussianNoise layer as an input layer for a model that takes 2 input variables.

Noise can also be added between hidden layers in the model. Given the flexibility of Keras, the noise can be added before or after the use of the activation function. It may make more sense to add it before the activation; nevertheless, both options are possible.

Below is an example of a GaussianNoise layer that adds noise to the linear output of a Dense layer before a rectified linear activation function (ReLU), perhaps a more appropriate use of noise between hidden layers.

Noise can also be added after the activation function, much like using a noisy activation function. One downside of this usage is that the resulting values may be out-of-range from what the activation function may normally provide. For example, a value with added noise may be less than zero, whereas the relu activation function will only ever output values 0 or larger.

Let’s take a look at how noise regularization can be used with some common network types.

MLP Noise Regularization

The example below adds noise between two Dense fully connected layers.

CNN Noise Regularization

The example below adds noise after a pooling layer in a convolutional network.

RNN Dropout Regularization

The example below adds noise between an LSTM recurrent layer and a Dense fully connected layer.

Now that we have seen how to add noise to neural network models, let’s look at a case study of adding noise to an overfit model to reduce generalization error.

Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Noise Regularization Case Study

In this section, we will demonstrate how to use noise regularization to reduce overfitting of an MLP on a simple binary classification problem.

This example provides a template for applying noise regularization to your own neural network for classification and regression problems.

Binary Classification Problem

We will use a standard binary classification problem that defines two two-dimensional concentric circles of observations, one semi-circle for each class.

Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “circles” dataset because of the shape of the observations in each class when plotted.

We can use the make_circles() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.

We can plot the dataset where the two variables are taken as x and y coordinates on a graph and the class value is taken as the color of the observation.

The complete example of generating the dataset and plotting it is listed below.

Running the example creates a scatter plot showing the concentric circles shape of the observations in each class. We can see the noise in the dispersal of the points making the circles less obvious.

Scatter Plot of Circles Dataset with Color Showing the Class Value of Each Sample

Scatter Plot of Circles Dataset with Color Showing the Class Value of Each Sample

This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.

We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset, a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.

Overfit Multilayer Perceptron

We can develop an MLP model to address this binary classification problem.

The model will have one hidden layer with more nodes than may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.

Before we define the model, we will split the dataset into train and test sets, using 30 examples to train the model and 70 to evaluate the fit model’s performance.

Next, we can define the model.

The hidden layer uses 500 nodes in the hidden layer and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1. The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.

The defined model is then fit on the training data for 4,000 epochs and the default batch size of 32.

We will also use the test dataset as a validation dataset.

We can evaluate the performance of the model on the test dataset and report the result.

Finally, we will plot the performance of the model on both the train and test set each epoch.

If the model does indeed overfit the training dataset, we would expect the line plot of accuracy on the training set to continue to increase and the test set to rise and then fall again as the model learns statistical noise in the training dataset.

We can tie all of these pieces together; the complete example is listed below.

Running the example reports the model performance on the train and test datasets.

We can see that the model has better performance on the training dataset than the test dataset, one possible sign of overfitting.

Your specific results may vary given the stochastic nature of the neural network and the training algorithm. Because the model is severely overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.

A figure is created showing line plots of the model accuracy on the train and test sets.

We can see that expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.

Line Plots of Accuracy on Train and Test Datasets While Training Showing an Overfit

Line Plots of Accuracy on Train and Test Datasets While Training Showing an Overfit

MLP With Input Layer Noise

The dataset is defined by points that have a controlled amount of statistical noise.

Nevertheless, because the dataset is small, we can add further noise to the input values. This will have the effect of creating more samples or resampling the domain, making the structure of the input space artificially smoother. This may make the problem easier to learn and improve generalization performance.

We can add a GaussianNoise layer as the input layer. The amount of noise must be small. Given that the input values are within the range [0, 1], we will add Gaussian noise with a mean of 0.0 and a standard deviation of 0.01, chosen arbitrarily.

The complete example with this change is listed below.

Running the example reports the model performance on the train and test datasets.

Your results will vary, given both the stochastic nature of the learning algorithm and the stochastic nature of the noise added to the model. Try running the example a few times.

In this case, we may see a small lift in performance of the model on the test dataset, with no negative impact on the training dataset.

We clearly see the impact of the added noise on the evaluation of the model during training as graphed on the line plot. The noise cases the accuracy of the model to jump around during training, possibly due to the noise introducing points that conflict with true points from the training dataset.

Perhaps a lower input noise standard deviation would be more appropriate.

The model still shows a pattern of being overfit, with a rise and then fall in test accuracy over training epochs.

Line Plot of Train and Test Accuracy With Input Layer Noise

Line Plot of Train and Test Accuracy With Input Layer Noise

MLP With Hidden Layer Noise

An alternative approach to adding noise to the input values is to add noise between the hidden layers.

This can be done by adding noise to the linear output of the layer (weighted sum) before the activation function is applied, in this case a rectified linear activation function. We can also use a larger standard deviation for the noise as the model is less sensitive to noise at this level given the presumably larger weights from being overfit. We will use a standard deviation of 0.1, again, chosen arbitrarily.

The complete example with Gaussian noise between the hidden layers is listed below.

Running the example reports the model performance on the train and test datasets.

Your results will vary, given both the stochastic nature of the learning algorithm and the stochastic nature of the noise added to the model. Try running the example a few times.

In this case, we can see a marked increase in the performance of the model on the hold out test set.

We can also see from the line plot of accuracy over training epochs that the model no longer appears to show the properties of being overfit.

Line Plot of Train and Test Accuracy With Hidden Layer Noise

Line Plot of Train and Test Accuracy With Hidden Layer Noise

We can also experiment and add the noise after the outputs of the first hidden layer pass through the activation function.

The complete example is listed below.

Running the example reports the model performance on the train and test datasets.

Surprisingly, we see little difference in the performance of the model.

Again, we can see from the line plot of accuracy over training epochs that the model no longer shows sign of overfitting.

Line Plot of Train and Test Accuracy With Hidden Layer Noise (alternate)

Line Plot of Train and Test Accuracy With Hidden Layer Noise (alternate)

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

  • Repeated Evaluation. Update the example to use repeated evaluation of the model with and without noise and report performance as the mean and standard deviation over repeats.
  • Grid Search Standard Deviation. Develop a grid search in order to discover the amount of noise that reliably results in the best performing model.
  • Input and Hidden Noise. Update the example to introduce noise at both the input and hidden layers of the model.

If you explore any of these extensions, I’d love to know.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this tutorial, you discovered how to add noise to deep learning models in Keras in order to reduce overfitting and improve model generalization.

Specifically, you learned:

  • Noise can be added to a neural network model via the GaussianNoise layer.
  • The GaussianNoise can be used to add noise to input values or between hidden layers.
  • How to add a GaussianNoise layer in order to reduce overfitting in a Multilayer Perceptron model for classification.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Better Deep Learning Models Today!

Better Deep Learning

Train Faster, Reduce Overftting, and Ensembles

…with just a few lines of python code

Discover how in my new Ebook:
Better Deep Learning

It provides self-study tutorials on topics like: weight decay, batch normalization, dropout, model stacking and much more…

Bring better deep learning to your projects!

Skip the Academics. Just Results.

Click to learn more.


9 Responses to How to Improve Deep Learning Model Robustness by Adding Noise

  1. Nitin Panwar December 14, 2018 at 6:10 pm #

    Thanks Jason, nicely explained. Really enjoyed it.

  2. maedeh February 22, 2019 at 1:34 pm #

    Thanks, but I think we only have Gaussian noise layer. if I want to apply some attacks like cropping, do we have any layer in keras for this? do you have any suggestion for this? I look forward to hearing from you.

    • Jason Brownlee February 22, 2019 at 2:48 pm #

      Good question, generally no, you can use a custom data generator and perform random crops to images before they are fed into the model.

  3. Michael April 12, 2019 at 5:23 am #

    Hi Jason, what do you think about backward pass when you add noise to either weights or activations? For example, when adding noise to activations (which serve as layer inputs), to calculate weight gradients for that layer, you multiply incoming gradient by these activations. Would you use the original activations, or the distorted ones? Or when backpropagating errors we multiply them by transposed weight matrices in each layer, again, would you use the original weights or distorted ones?

    • Jason Brownlee April 12, 2019 at 7:56 am #

      Hmmm.

      I have not see it often, except with models like GANs and stochastic label smoothing – required only because training GANs is so unstable.

      If you have an idea, try it. It has never been easier with such amazing tools!

      • Michael April 12, 2019 at 8:48 am #

        It actually does not seem easy to me. For example, say we want to add noise to activations (inputs to second layer), and then update weights of that second layer. Standard autodiff in either TF or Pytorch would pass upstream gradients right through the noise addition op, to be multiplied by the original second layer inputs. But how can I change this so that they get multiplied by the distorted inputs? I don’t think the distorted inputs are being preserved for the backward pass.

        In this case, I think the tools actually make it harder to experiment.

        • Michael April 12, 2019 at 8:51 am #

          Or, if it is the distorted inputs that are being preserved by autodiff, then how do I skip them and pass the gradients to the original ones?

          • Jason Brownlee April 12, 2019 at 2:40 pm #

            The model does not see distorted inputs, it sees inputs/outputs/activations. It just so happens that you’ve distorted them with noise. Updates happen per normal.

            Perhaps I don’t follow the nuance of what you’re trying to implement.

Leave a Reply