How to Reduce Overfitting of a Deep Learning Model with Weight Regularization

Weight regularization provides an approach to reduce the overfitting of a deep learning neural network model on the training data and improve the performance of the model on new data, such as the holdout test set.

There are multiple types of weight regularization, such as L1 and L2 vector norms, and each requires a hyperparameter that must be configured.

In this tutorial, you will discover how to apply weight regularization to improve the performance of an overfit deep learning neural network in Python with Keras.

After completing this tutorial, you will know:

  • How to use the Keras API to add weight regularization to an MLP, CNN, or LSTM neural network.
  • Examples of weight regularization configurations used in books and recent research papers.
  • How to work through a case study for identifying an overfit model and improving test performance using weight regularization.

Let’s get started.

How to Reduce Overfitting in Deep Learning With Weight Regularization

How to Reduce Overfitting in Deep Learning With Weight Regularization
Photo by Seabamirum, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Weight Regularization in Keras
  2. Examples of Weight Regularization
  3. Weight Regularization Case Study

Weight Regularization API in Keras

Keras provides a weight regularization API that allows you to add a penalty for weight size to the loss function.

Three different regularizer instances are provided; they are:

  • L1: Sum of the absolute weights.
  • L2: Sum of the squared weights.
  • L1L2: Sum of the absolute and the squared weights.

The regularizers are provided under keras.regularizers and have the names l1, l2 and l1_l2. Each takes the regularizer hyperparameter as an argument. For example:

By default, no regularizer is used in any layers.

A weight regularizer can be added to each layer when the layer is defined in a Keras model.

This is achieved by setting the kernel_regularizer argument on each layer. A separate regularizer can also be used for the bias via the bias_regularizer argument, although this is less often used.

Let’s look at some examples.

Weight Regularization for Dense Layers

The example below sets an l2 regularizer on a Dense fully connected layer:

Weight Regularization for Convolutional Layers

Like the Dense layer, the Convolutional layers (e.g. Conv1D and Conv2D) also use the kernel_regularizer and bias_regularizer arguments to define a regularizer.

The example below sets an l2 regularizer on a Conv2D convolutional layer:

Weight Regularization for Recurrent Layers

Recurrent layers like the LSTM offer more flexibility in regularizing the weights.

The input, recurrent, and bias weights can all be regularized separately via the kernel_regularizer, recurrent_regularizer, and bias_regularizer arguments.

The example below sets an l2 regularizer on an LSTM recurrent layer:

Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Examples of Weight Regularization

It can be helpful to look at some examples of weight regularization configurations reported in the literature.

It is important to select and tune a regularization technique specific to your network and dataset, although real examples can also give an idea of common configurations that may be a useful starting point.

Recall that 0.1 can be written in scientific notation as 1e-1 or 1E-1 or as an exponential 10^-1, 0.01 as 1e-2 or 10^-2 and so on.

Examples of MLP Weight Regularization

Weight regularization was borrowed from penalized regression models in statistics.

The most common type of regularization is L2, also called simply “weight decay,” with values often on a logarithmic scale between 0 and 0.1, such as 0.1, 0.001, 0.0001, etc.

Reasonable values of lambda [regularization hyperparameter] range between 0 and 0.1.

— Page 144, Applied Predictive Modeling, 2013.

The classic text on Multilayer Perceptrons “Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks” provides a worked example demonstrating the impact of weight decay by first training a model without any regularization, then steadily increasing the penalty. They demonstrate graphically that weight decay has the effect of improving the resulting decision function.

… net was trained […] with weight decay increasing from 0 to 1E-5 at 1200 epochs, to 1E-4 at 2500 epochs, and to 1E-3 at 400 epochs. […] The surface is smoother and transitions are more gradual

— Page 270, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

This is an interesting procedure that may be worth investigating. The authors also comment on the difficulty of predicting the effect of weight decay on a problem.

… it is difficult to predict ahead of time what value is needed to achieve desired results. The value of 0.001 was chosen arbitrarily because it is a typically cited round number

— Page 270, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

Examples of CNN Weight Regularization

Weight regularization does not seem widely used in CNN models, or if it is used, its use is not widely reported.

L2 weight regularization with very small regularization hyperparameters such as (e.g. 0.0005 or 5 x 10^−4) may be a good starting point.

Alex Krizhevsky, et al. from the University of Toronto in their 2012 paper titled “ImageNet Classification with Deep Convolutional Neural Networks” developed a deep CNN model for the ImageNet dataset, achieving then state-of-the-art results reported:

…and weight decay of 0.0005. We found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error.

Karen Simonyan and Andrew Zisserman from Oxford in their 2015 paper titled “Very Deep Convolutional Networks for Large-Scale Image Recognition” develop a CNN for the ImageNet dataset and report:

The training was regularised by weight decay (the L2 penalty multiplier set to 5 x 10^−4)

Francois Chollet from Google (and author of Keras) in his 2016 paper titled “Xception: Deep Learning with Depthwise Separable Convolutions” reported the weight decay for both the Inception V3 CNN model from Google (not clear from the Inception V3 paper) and the weight decay used in his improved Xception for the ImageNet dataset:

The Inception V3 model uses a weight decay (L2 regularization) rate of 4e−5, which has been carefully tuned for performance on ImageNet. We found this rate to be quite suboptimal for Xception and instead settled for 1e−5.

Examples of LSTM Weight Regularization

It is common to use weight regularization with LSTM models.

An often used configuration is L2 (weight decay) and very small hyperparameters (e.g. 10^−6). It is often not reported what weights are regularized (input, recurrent, and/or bias), although one would assume that both input and recurrent weights are regularized only.

Gabriel Pereyra, et al. from Google Brain in the 2017 paper titled “Regularizing Neural Networks by Penalizing Confident Output Distributions” apply a seq2seq LSTMs models to predicting characters from the Wall Street Journal and report:

All models used weight decay of 10^−6

Barret Zoph and Quoc Le from Google Brain in the 2017 paper titled “Neural Architecture Search with Reinforcement Learning” use LSTMs and reinforcement learning to learn network architectures to best address the CIFAR-10 dataset and report:

weight decay of 1e-4

Ron Weiss, et al. from Google Brain and Nvidia in their 2017 paper titled “Sequence-to-Sequence Models Can Directly Translate Foreign Speech” develop a sequence-to-sequence LSTM for speech translation and report:

L2 weight decay is used with a weight of 10^−6

Weight Regularization Case Study

In this section, we will demonstrate how to use weight regularization to reduce overfitting of an MLP on a simple binary classification problem.

This example provides a template for applying weight regularization to your own neural network for classification and regression problems.

Binary Classification Problem

We will use a standard binary classification problem that defines two semi-circles of observations: one semi-circle for each class.

Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “moons” dataset because of the shape of the observations in each class when plotted.

We can use the make_moons() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.

We can plot the dataset where the two variables are taken as x and y coordinates on a graph and the class value is taken as the color of the observation.

The complete example of generating the dataset and plotting it is listed below.

Running the example creates a scatter plot showing the semi-circle or moon shape of the observations in each class. We can see the noise in the dispersal of the points making the moons less obvious.

Scatter Plot of Moons Dataset With Color Showing the Class Value of Each Sample

Scatter Plot of Moons Dataset With Color Showing the Class Value of Each Sample

This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.

We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset: a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.

Overfit Multilayer Perceptron Model

We can develop an MLP model to address this binary classification problem.

The model will have one hidden layer with more nodes that may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.

Before we define the model, we will split the dataset into train and test sets, using 30 examples to train the model and 70 to evaluate the fit model’s performance.

Next, we can define the model.

The model uses 500 nodes in the hidden layer and the rectified linear activation function.

A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1.

The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.

The defined model is then fit on the training data for 4,000 epochs and the default batch size of 32.

Finally, we can evaluate the performance of the model on the test dataset and report the result.

We can tie all of these pieces together; the complete example is listed below.

Running the example reports the model performance on the train and test datasets.

We can see that the model has better performance on the training dataset than the test dataset, one possible sign of overfitting.

Your specific results may vary given the stochastic nature of the neural network and the training algorithm. Because the model is severely overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.

Another sign of overfitting is a plot of the learning curves of the model for both train and test datasets while training.

An overfit model should show accuracy increasing on both train and test and at some point accuracy drops on the test dataset but continues to rise on the test dataset.

We can update the example to plot these curves. The complete example is listed below.

Running the example creates line plots of the model accuracy on the train and test sets.

We can see an expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.

Line Plots of Accuracy on Train and Test Datasets While Training

Line Plots of Accuracy on Train and Test Datasets While Training

MLP Model With Weight Regularization

We can add weight regularization to the hidden layer to reduce the overfitting of the model to the training dataset and improve the performance on the holdout set.

We will use the L2 vector norm also called weight decay with a regularization parameter (called alpha or lambda) of 0.001, chosen arbitrarily.

This can be done by adding the kernel_regularizer argument to the layer and setting it to an instance of l2.

The updated example of fitting and evaluating the model on the moons dataset with weight regularization is listed below.

Running the example reports the performance of the model on the train and test datasets.

We can see no change in the accuracy on the training dataset and an improvement on the test dataset.

We would expect that the telltale learning curve for overfitting would also have been changed through the use of weight regularization.

Instead of the accuracy of the model on the test set increasing and then decreasing again, we should see it continually rise during training.

The complete example of fitting the model and plotting the train and test learning curves is listed below.

Running the example creates line plots of the train and test accuracy for the model for each epoch during training.

As expected, we see the learning curve on the test dataset rise and then plateau, indicating that the model may not have overfit the training dataset.

Line Plots of Accuracy on Train and Test Datasets While Training Without Overfitting

Line Plots of Accuracy on Train and Test Datasets While Training Without Overfitting

Grid Search Regularization Hyperparameter

Once you can confirm that weight regularization may improve your overfit model, you can test different values of the regularization parameter.

It is a good practice to first grid search through some orders of magnitude between 0.0 and 0.1, then once a level is found, to grid search on that level.

We can grid search through the orders of magnitude by defining the values to test, looping through each and recording the train and test performance.

Once we have all of the values, we can graph the results as a line plot to help spot any patterns in the configurations to the train and test accuracies.

Because parameters jump orders of magnitude (powers of 10), we can create a line plot of the results using a logarithmic scale. The Matplotlib library allows this via the semilogx() function. For example:

The complete example for grid searching weight regularization values on the moon dataset is listed below.

Running the example prints the parameter value and the accuracy on the train and test sets for each evaluated model.

The results suggest that 0.01 or 0.001 may be sufficient and may provide good bounds for further grid searching.

A line plot of the results is also created, showing the increase in test accuracy with larger weight regularization parameter values, at least to a point.

We can see that using the largest value of 0.1 results in a large drop in both train and test accuracy.

Line Plot of Model Accuracy on Train and Test Datasets With Different Weight Regularization Parameters

Line Plot of Model Accuracy on Train and Test Datasets With Different Weight Regularization Parameters

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

  • Try Alternates. Update the example to use L1 or the combined L1L2 methods instead of L2 regularization.
  • Report Weight Norm. Update the example to calculate the magnitude of the network weights and demonstrate that regularization indeed made the magnitude smaller.
  • Regularize Output Layer. Update the example to regularize the output layer of the model and compare the results.
  • Regularize Bias. Update the example to regularize the bias weight and compare the results.
  • Repeated Model Evaluation. Update the example to fit and evaluate the model multiple times and report the mean and standard deviation of model performance.
  • Grid Search Along Order of Magnitude. Update the grid search example to grid search within the best-performing order of magnitude of parameter values.
  • Repeated Regularization of Model. Create a new example to continue the training of a fit model with increasing levels of regularization (e.g. 1E-6, 1E-5, etc.) and see if it results in a better performing model on the test set.

If you explore any of these extensions, I’d love to know.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Posts

API

Summary

In this tutorial, you discovered how to apply weight regularization to improve the performance of an overfit deep learning neural network in Python with Keras.

Specifically, you learned:

  • How to use Keras API to add weight regularization to an MLP, CNN, or LSTM neural network.
  • Examples of weight regularization configurations used in books and recent research papers.
  • How to work through a case study for identifying an overfit model and improving test performance using weight regularization.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Better Deep Learning Models Today!

Better Deep Learning

Train Faster, Reduce Overftting, and Ensembles

…with just a few lines of python code

Discover how in my new Ebook:
Better Deep Learning

It provides self-study tutorials on topics like: weight decay, batch normalization, dropout, model stacking and much more…

Bring better deep learning to your projects!

Skip the Academics. Just Results.

Click to learn more.


6 Responses to How to Reduce Overfitting of a Deep Learning Model with Weight Regularization

  1. JG November 25, 2018 at 9:58 am #

    Hola Jason,

    – I already finished this nice weigh regularization case study with MLP neural model and using make_moons function from sklearn library as the dataset generator. Thank you. Very clear exercise and I could reproduce exactly all yours output numbers !.

    – Regarding some extension of this exercise, as you mention in the beginning that the LSTM and CNN models types also accept keras regularizers, such as l1, l2, l1_l2…I was wondering how could we implement another case study using the same make_moons function but now with LSTM and/or CNN model?
    Just adding a new Convolutional 2D layer (plus the maxpool layer plus flatten layer I guess) before the dense layer of your code? and what about LSTM, just switching directly the dense layer line code by a new regularized LSTM layer? or probably it is more complex and does not work for this moons dataset exercise? It could be convenient if someone could write down these few lines of CNN and/or LSTM model definition of code, if it is thinking helpful…

    thanks

    • Jason Brownlee November 26, 2018 at 6:13 am #

      Thanks, well done!

      CNN and LSTM would not be appropriate for the moons problem. They would require a sequence prediction problem. You could contrive a small sequence prediction problem for testing.

  2. JG November 25, 2018 at 9:27 pm #

    In addition to the previous question, about CNN/LST regularizer code implementation I will be grateful if you can provide us with some rule -recommendation (if anyone exist?) about:

    – how to implement regularizers in several model layers, in the sense of:
    do I have to set strong weight decay (e.g. higher L2 values) in the first layers (the ones that generalize better the problem solution) and weaker weight decay (e.g. lower L2 values) on the last model ayers (the ones more abstracts) ? or vice versa (the opposite way)?

    My intuition will suggest me to try higher attenuations (strong weight decay) in the first layers (the ones that generalize better) and lower attenuation (smoother decay) in the last ones (the abstracts layers)….
    Also because from the point of view of model instability (output response to input in term of convergence), the model it is more sensible to first layers weight values (so bigger ones will produce more instability)…But all of this are intuitions ideas not confirmed yet…What do you think Jason?

    • Jason Brownlee November 26, 2018 at 6:18 am #

      From what I have seen, the same level of weight regularization is used across all layers. Perhaps try experimenting with different approaches to see if it makes a difference.

  3. hello December 4, 2018 at 6:32 pm #

    Comparison of Human vs. Machine-learning:
    It is of great interest to know whether human perception is important in identifying spoofing and hence, humans can achieve better performance than automatic approaches in detecting spoofing attacks. There was a benchmark study that compared automatic systems against human performance on a speaker verification and synthetic speech spoofing detection tasks (SS and VC spoofs). It was observed that human listeners detect spoofing less well than most of the automatic approaches except USS speech. In a similar study, it was found that both the humans and machines have difficulties in spoofing detection when narrowband speech signals were used (8 kHz sampling frequency). Hence, for telephone line speech signals, it is more challenging to do SSD due to the lower available bandwidth up to 4 kHz. It may be of great interest to study human performance for replay SSD task.

Leave a Reply