How to Improve Deep Learning Model Robustness by Adding Noise

Last Updated on

Adding noise to an underconstrained neural network model with a small training dataset can have a regularizing effect and reduce overfitting.

Keras supports the addition of Gaussian noise via a separate layer called the GaussianNoise layer. This layer can be used to add noise to an existing model.

In this tutorial, you will discover how to add noise to deep learning models in Keras in order to reduce overfitting and improve model generalization.

After completing this tutorial, you will know:

  • Noise can be added to a neural network model via the GaussianNoise layer.
  • The GaussianNoise can be used to add noise to input values or between hidden layers.
  • How to add a GaussianNoise layer in order to reduce overfitting in a Multilayer Perceptron model for classification.

Discover how to train faster, reduce overfitting, and make better predictions with deep learning models in my new book, with 26 step-by-step tutorials and full source code.

Let’s get started.

How to Improve Deep Learning Model Robustness by Adding Noise

How to Improve Deep Learning Model Robustness by Adding Noise
Photo by Michael Mueller, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Noise Regularization in Keras
  2. Noise Regularization in Models
  3. Noise Regularization Case Study

Noise Regularization in Keras

Keras supports the addition of noise to models via the GaussianNoise layer.

This is a layer that will add noise to inputs of a given shape. The noise has a mean of zero and requires that a standard deviation of the noise be specified as a parameter. For example:

The output of the layer will have the same shape as the input, with the only modification being the addition of noise to the values.

Noise Regularization in Models

The GaussianNoise can be used in a few different ways with a neural network model.

Firstly, it can be used as an input layer to add noise to input variables directly. This is the traditional use of noise as a regularization method in neural networks.

Below is an example of defining a GaussianNoise layer as an input layer for a model that takes 2 input variables.

Noise can also be added between hidden layers in the model. Given the flexibility of Keras, the noise can be added before or after the use of the activation function. It may make more sense to add it before the activation; nevertheless, both options are possible.

Below is an example of a GaussianNoise layer that adds noise to the linear output of a Dense layer before a rectified linear activation function (ReLU), perhaps a more appropriate use of noise between hidden layers.

Noise can also be added after the activation function, much like using a noisy activation function. One downside of this usage is that the resulting values may be out-of-range from what the activation function may normally provide. For example, a value with added noise may be less than zero, whereas the relu activation function will only ever output values 0 or larger.

Let’s take a look at how noise regularization can be used with some common network types.

MLP Noise Regularization

The example below adds noise between two Dense fully connected layers.

CNN Noise Regularization

The example below adds noise after a pooling layer in a convolutional network.

RNN Dropout Regularization

The example below adds noise between an LSTM recurrent layer and a Dense fully connected layer.

Now that we have seen how to add noise to neural network models, let’s look at a case study of adding noise to an overfit model to reduce generalization error.

Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Noise Regularization Case Study

In this section, we will demonstrate how to use noise regularization to reduce overfitting of an MLP on a simple binary classification problem.

This example provides a template for applying noise regularization to your own neural network for classification and regression problems.

Binary Classification Problem

We will use a standard binary classification problem that defines two two-dimensional concentric circles of observations, one semi-circle for each class.

Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “circles” dataset because of the shape of the observations in each class when plotted.

We can use the make_circles() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.

We can plot the dataset where the two variables are taken as x and y coordinates on a graph and the class value is taken as the color of the observation.

The complete example of generating the dataset and plotting it is listed below.

Running the example creates a scatter plot showing the concentric circles shape of the observations in each class. We can see the noise in the dispersal of the points making the circles less obvious.

Scatter Plot of Circles Dataset with Color Showing the Class Value of Each Sample

Scatter Plot of Circles Dataset with Color Showing the Class Value of Each Sample

This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.

We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset, a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.

Overfit Multilayer Perceptron

We can develop an MLP model to address this binary classification problem.

The model will have one hidden layer with more nodes than may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.

Before we define the model, we will split the dataset into train and test sets, using 30 examples to train the model and 70 to evaluate the fit model’s performance.

Next, we can define the model.

The hidden layer uses 500 nodes in the hidden layer and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1. The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.

The defined model is then fit on the training data for 4,000 epochs and the default batch size of 32.

We will also use the test dataset as a validation dataset.

We can evaluate the performance of the model on the test dataset and report the result.

Finally, we will plot the performance of the model on both the train and test set each epoch.

If the model does indeed overfit the training dataset, we would expect the line plot of accuracy on the training set to continue to increase and the test set to rise and then fall again as the model learns statistical noise in the training dataset.

We can tie all of these pieces together; the complete example is listed below.

Running the example reports the model performance on the train and test datasets.

We can see that the model has better performance on the training dataset than the test dataset, one possible sign of overfitting.

Your specific results may vary given the stochastic nature of the neural network and the training algorithm. Because the model is severely overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.

A figure is created showing line plots of the model accuracy on the train and test sets.

We can see that expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.

Line Plots of Accuracy on Train and Test Datasets While Training Showing an Overfit

Line Plots of Accuracy on Train and Test Datasets While Training Showing an Overfit

MLP With Input Layer Noise

The dataset is defined by points that have a controlled amount of statistical noise.

Nevertheless, because the dataset is small, we can add further noise to the input values. This will have the effect of creating more samples or resampling the domain, making the structure of the input space artificially smoother. This may make the problem easier to learn and improve generalization performance.

We can add a GaussianNoise layer as the input layer. The amount of noise must be small. Given that the input values are within the range [0, 1], we will add Gaussian noise with a mean of 0.0 and a standard deviation of 0.01, chosen arbitrarily.

The complete example with this change is listed below.

Running the example reports the model performance on the train and test datasets.

Your results will vary, given both the stochastic nature of the learning algorithm and the stochastic nature of the noise added to the model. Try running the example a few times.

In this case, we may see a small lift in performance of the model on the test dataset, with no negative impact on the training dataset.

We clearly see the impact of the added noise on the evaluation of the model during training as graphed on the line plot. The noise cases the accuracy of the model to jump around during training, possibly due to the noise introducing points that conflict with true points from the training dataset.

Perhaps a lower input noise standard deviation would be more appropriate.

The model still shows a pattern of being overfit, with a rise and then fall in test accuracy over training epochs.

Line Plot of Train and Test Accuracy With Input Layer Noise

Line Plot of Train and Test Accuracy With Input Layer Noise

MLP With Hidden Layer Noise

An alternative approach to adding noise to the input values is to add noise between the hidden layers.

This can be done by adding noise to the linear output of the layer (weighted sum) before the activation function is applied, in this case a rectified linear activation function. We can also use a larger standard deviation for the noise as the model is less sensitive to noise at this level given the presumably larger weights from being overfit. We will use a standard deviation of 0.1, again, chosen arbitrarily.

The complete example with Gaussian noise between the hidden layers is listed below.

Running the example reports the model performance on the train and test datasets.

Your results will vary, given both the stochastic nature of the learning algorithm and the stochastic nature of the noise added to the model. Try running the example a few times.

In this case, we can see a marked increase in the performance of the model on the hold out test set.

We can also see from the line plot of accuracy over training epochs that the model no longer appears to show the properties of being overfit.

Line Plot of Train and Test Accuracy With Hidden Layer Noise

Line Plot of Train and Test Accuracy With Hidden Layer Noise

We can also experiment and add the noise after the outputs of the first hidden layer pass through the activation function.

The complete example is listed below.

Running the example reports the model performance on the train and test datasets.

Surprisingly, we see little difference in the performance of the model.

Again, we can see from the line plot of accuracy over training epochs that the model no longer shows sign of overfitting.

Line Plot of Train and Test Accuracy With Hidden Layer Noise (alternate)

Line Plot of Train and Test Accuracy With Hidden Layer Noise (alternate)

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

  • Repeated Evaluation. Update the example to use repeated evaluation of the model with and without noise and report performance as the mean and standard deviation over repeats.
  • Grid Search Standard Deviation. Develop a grid search in order to discover the amount of noise that reliably results in the best performing model.
  • Input and Hidden Noise. Update the example to introduce noise at both the input and hidden layers of the model.

If you explore any of these extensions, I’d love to know.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this tutorial, you discovered how to add noise to deep learning models in Keras in order to reduce overfitting and improve model generalization.

Specifically, you learned:

  • Noise can be added to a neural network model via the GaussianNoise layer.
  • The GaussianNoise can be used to add noise to input values or between hidden layers.
  • How to add a GaussianNoise layer in order to reduce overfitting in a Multilayer Perceptron model for classification.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Better Deep Learning Models Today!

Better Deep Learning

Train Faster, Reduce Overftting, and Ensembles

…with just a few lines of python code

Discover how in my new Ebook:
Better Deep Learning

It provides self-study tutorials on topics like: weight decay, batch normalization, dropout, model stacking and much more…

Bring better deep learning to your projects!

Skip the Academics. Just Results.

Click to learn more.


19 Responses to How to Improve Deep Learning Model Robustness by Adding Noise

  1. Nitin Panwar December 14, 2018 at 6:10 pm #

    Thanks Jason, nicely explained. Really enjoyed it.

  2. maedeh February 22, 2019 at 1:34 pm #

    Thanks, but I think we only have Gaussian noise layer. if I want to apply some attacks like cropping, do we have any layer in keras for this? do you have any suggestion for this? I look forward to hearing from you.

    • Jason Brownlee February 22, 2019 at 2:48 pm #

      Good question, generally no, you can use a custom data generator and perform random crops to images before they are fed into the model.

  3. Michael April 12, 2019 at 5:23 am #

    Hi Jason, what do you think about backward pass when you add noise to either weights or activations? For example, when adding noise to activations (which serve as layer inputs), to calculate weight gradients for that layer, you multiply incoming gradient by these activations. Would you use the original activations, or the distorted ones? Or when backpropagating errors we multiply them by transposed weight matrices in each layer, again, would you use the original weights or distorted ones?

    • Jason Brownlee April 12, 2019 at 7:56 am #

      Hmmm.

      I have not see it often, except with models like GANs and stochastic label smoothing – required only because training GANs is so unstable.

      If you have an idea, try it. It has never been easier with such amazing tools!

      • Michael April 12, 2019 at 8:48 am #

        It actually does not seem easy to me. For example, say we want to add noise to activations (inputs to second layer), and then update weights of that second layer. Standard autodiff in either TF or Pytorch would pass upstream gradients right through the noise addition op, to be multiplied by the original second layer inputs. But how can I change this so that they get multiplied by the distorted inputs? I don’t think the distorted inputs are being preserved for the backward pass.

        In this case, I think the tools actually make it harder to experiment.

        • Michael April 12, 2019 at 8:51 am #

          Or, if it is the distorted inputs that are being preserved by autodiff, then how do I skip them and pass the gradients to the original ones?

          • Jason Brownlee April 12, 2019 at 2:40 pm #

            The model does not see distorted inputs, it sees inputs/outputs/activations. It just so happens that you’ve distorted them with noise. Updates happen per normal.

            Perhaps I don’t follow the nuance of what you’re trying to implement.

  4. Borja May 4, 2019 at 10:47 pm #

    Hello Jason,

    I was wondering, if a layer of noise is added to the model architecture, would it then apply that noise to every test input as well? How would you go about training a model with noise, and then training with clean inputs?

    • Jason Brownlee May 5, 2019 at 6:29 am #

      It depends. Input or output noise is usually turned off, sometimes it is left on a test time. Noise within the model is sometime left on. Perhaps eval with/without at test time and compare.

      If you wanted, you could reformulate the final model without the noise layer.

  5. Robert May 8, 2019 at 11:14 am #

    Hi Jason,

    Great article, I have a question regarding the use of Gaussian Noise over some input that has been previously padded (with 0’s for example). Do you think the loss in the training could get worse in this case? An example could be padding different length inputs like speech spectrograms in order for them to have the same shape.

    • Jason Brownlee May 8, 2019 at 2:12 pm #

      Hmm, good question.

      Yes, noise over padding sounds like a bad idea.

      There are many ways to get noise into the system, get creative and test a suite of approaches.

  6. Nestak August 6, 2019 at 1:01 am #

    Hi! Is there also a simple way to tinker/augment the contrast? Something like model.add(Contrast(0.1))?

  7. Nestak August 8, 2019 at 4:20 am #

    I want to to add some noise to the neural network I am using for the classification of jpg images. So, the input for my neural network are arrays of the pixels, that I have already normalized to be in the range 0 to 1. I wanted to do as in your suggestion:

    model.add(MaxPooling2D())
    model.add(GaussianNoise(x))

    But I am concerned that the GaussianNoise might make my data go outside the range 0 to 1 and spoil the training. Is this a valid concern or am I safe? Does it depend on the value x to be used in model.add(GaussianNoise(x)) and what x value would you use? Thanks

    • Jason Brownlee August 8, 2019 at 6:36 am #

      It should be fine, perhaps test it and evaluate the effects?

      Alternately, you could create your own custom layer to achieve exactly what you want.

  8. JG August 14, 2019 at 4:36 am #

    Hi Jason,

    Thank you for this tutorial!.

    I have been playing with this tutorial adding other options to the script in order to experiment with them in a kind of “grid search”. Here it is my report.

    – I define my models with keras model class API instead of Sequential: But I do not expect any impact on results (!).

    – I set up the model (as you) but also I used other “high level” model constructor such as ‘KerasClassifier’ and ‘cross_val_score’ (for Kfold statical analysis) from Sklearn library, taken from other tutorial from you. In general the ‘cross_val_score’ got less average accuracy (69% mean accuracy) in front of 85.7% accuracy for model on test input. I understand it.
    But curiously I got in general better results when I use the KerasClassiffier (e.g. 84.3%).

    And I do not understand why I got better results on kerasclassifier than in my “manual” API class model if I am using the same “validation_split” in both cases (70% for test 30% for input training).

    – I got the same validation training results of some kind of “sinusoidal loss curve” (going down and up but with the long trend going up even when I re-train up to 8000 epochs ). And same effect on validation accuracy but little downing trend). All these cases applying with not adding gaussian noise.

    – I observed that X input data coming from “make_circles of sklearn are between -1.06… and + 1.06 …so I decided to normalize or standardize the input data (with MinMaxScaler and StandardScaler from sklearn and from yours tutorials. In general I got a little better performance on ‘cross-val-score. ( It is increased up t0 to 72% mean accuracy) , but better for my kerasClassifier (up to 88.6% accuracy) but a little worst for my “manual model” around 77% Accuracy on test.

    -the bing results sensitivity is when I decided to permute the 70% test and 30% training input for 30% test and 70% training (more natural exploitation of data). In this case I got 83% mean accuracy on cross_val_score with a sigma of 10.7% and 96.7 accuracy from Kerasclassifier and 90% accuracy for my manual model. it is clear the reason in this scenario.

    -Also I performed Dropout layers and weight constraint regularization (taken from your tutorials) but the results are not so much different.

    – I apply of course the Gaussian noise layer (after input or before output layer) , And clearly I obtain the right trend in terms of validation loss training curve (disappearing the loss increase in validation during training epochs increase), but I do get similar accuracy for my manual model and a little better for the scross-val-score constructor. I Observed that are very sensitivity to the sigma (estandard deviation figure) apply to the gaussian noise layer.

    – Even I apply everything for regularization altogether in a kind of ‘totum revolutum’ (dropout layer + gaussian noise + weight constraint regularization ) plus input data scaler … I get accuracy around 50% (not learning at all) so it is clear that I need more control for every of these tools…:-)

    – As a summary I do not get so much impact on accuracy results when apply gaussian noise layer (but of course better behavior on loss and accuracy training curves) when using gaussian noise layer (even when using both of them layer after input and before output at the same time)…probably because sigma noise (standard deviation) has to be better fit …

    thank you for your tutorial Jason

    • Jason Brownlee August 14, 2019 at 6:50 am #

      Wonderful experimentation, thanks for sharing.

      This would be valuable stuff if you write it up and shared it – valuable as in it shows how systematic and curious one must be to really dive into these techniques.

      Adding noise was really popular in the 90s, less so now that we have dropout. Yet, I see it popup in big modern gan models, so it’s still around and useful.

Leave a Reply