The post Introduction to Regularization to Reduce Overfitting of Deep Learning Neural Networks appeared first on Machine Learning Mastery.

]]>Training a deep neural network that can generalize well to new data is a challenging problem.

A model with too little capacity cannot learn the problem, whereas a model with too much capacity can learn it too well and overfit the training dataset. Both cases result in a model that does not generalize well.

A modern approach to reducing generalization error is to use a larger model that may be required to use regularization during training that keeps the weights of the model small. These techniques not only reduce overfitting, but they can also lead to faster optimization of the model and better overall performance.

In this post, you will discover the problem of overfitting when training neural networks and how it can be addressed with regularization methods.

After reading this post, you will know:

- Underfitting can easily be addressed by increasing the capacity of the network, but overfitting requires the use of specialized techniques.
- Regularization methods like weight decay provide an easy way to control overfitting for large neural network models.
- A modern recommendation for regularization is to use early stopping with dropout and a weight constraint.

Let’s get started.

This tutorial is divided into four parts; they are:

- The Problem of Model Generalization and Overfitting
- Reduce Overfitting by Constraining Model Complexity
- Methods for Regularization
- Regularization Recommendations

The objective of a neural network is to have a final model that performs well both on the data that we used to train it (e.g. the training dataset) and the new data on which the model will be used to make predictions.

The central challenge in machine learning is that we must perform well on new, previously unseen inputs — not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization.

— Page 110, Deep Learning, 2016.

We require that the model learn from known examples and generalize from those known examples to new examples in the future. We use methods like a train/test split or k-fold cross-validation only to estimate the ability of the model to generalize to new data.

Learning and also generalizing to new cases is hard.

Too little learning and the model will perform poorly on the training dataset and on new data. The model will underfit the problem. Too much learning and the model will perform well on the training dataset and poorly on new data, the model will overfit the problem. In both cases, the model has not generalized.

**Underfit Model**. A model that fails to sufficiently learn the problem and performs poorly on a training dataset and does not perform well on a holdout sample.**Overfit Model**. A model that learns the training dataset too well, performing well on the training dataset but does not perform well on a hold out sample.**Good Fit Model**. A model that suitably learns the training dataset and generalizes well to the old out dataset.

A model fit can be considered in the context of the bias-variance trade-off.

An underfit model has high bias and low variance. Regardless of the specific samples in the training data, it cannot learn the problem. An overfit model has low bias and high variance. The model learns the training data too well and performance varies widely with new unseen examples or even statistical noise added to examples in the training dataset.

In order to generalize well, a system needs to be sufficiently powerful to approximate the target function. If it is too simple to fit even the training data then generalization to new data is also likely to be poor. […] An overly complex system, however, may be able to approximate the data in many different ways that give similar errors and is unlikely to choose the one that will generalize best …

— Page 241, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

We can address underfitting by increasing the capacity of the model. Capacity refers to the ability of a model to fit a variety of functions; more capacity, means that a model can fit more types of functions for mapping inputs to outputs. Increasing the capacity of a model is easily achieved by changing the structure of the model, such as adding more layers and/or more nodes to layers.

Because an underfit model is so easily addressed, it is more common to have an overfit model.

An overfit model is easily diagnosed by monitoring the performance of the model during training by evaluating it on both a training dataset and on a holdout validation dataset. Graphing line plots of the performance of the model during training, called learning curves, will show a familiar pattern.

For example, line plots of the loss (that we seek to minimize) of the model on train and validation datasets will show a line for the training dataset that drops and may plateau and a line for the validation dataset that drops at first, then at some point begins to rise again.

As training progresses, the generalization error may decrease to a minimum and then increase again as the network adapts to idiosyncrasies of the training data.

— Page 250, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

A learning curve plot tells the story of the model learning the problem until a point at which it begins overfitting and its ability to generalize to the unseen validation dataset begins to get worse.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

There are two ways to approach an overfit model:

- Reduce overfitting by training the network on more examples.
- Reduce overfitting by changing the complexity of the network.

A benefit of very deep neural networks is that their performance continues to improve as they are fed larger and larger datasets. A model with a near-infinite number of examples will eventually plateau in terms of what the capacity of the network is capable of learning.

A model can overfit a training dataset because it has sufficient capacity to do so. Reducing the capacity of the model reduces the likelihood of the model overfitting the training dataset, to a point where it no longer overfits.

The capacity of a neural network model, it’s complexity, is defined by both it’s structure in terms of nodes and layers and the parameters in terms of its weights. Therefore, we can reduce the complexity of a neural network to reduce overfitting in one of two ways:

- Change network complexity by changing the network structure (number of weights).
- Change network complexity by changing the network parameters (values of weights).

In the case of neural networks, the complexity can be varied by changing the number of adaptive parameters in the network. This is called structural stabilization. […] The second principal approach to controlling the complexity of a model is through the use of regularization which involves the addition of a penalty term to the error function.

— Page 332, Neural Networks for Pattern Recognition, 1995.

For example, the structure could be tuned such as via grid search until a suitable number of nodes and/or layers is found to reduce or remove overfitting for the problem. Alternately, the model could be overfit and pruned by removing nodes until it achieves suitable performance on a validation dataset.

It is more common to instead constrain the complexity of the model by ensuring the parameters (weights) of the model remain small. Small parameters suggest a less complex and, in turn, more stable model that is less sensitive to statistical fluctuations in the input data.

Large weighs tend to cause sharp transitions in the [activation] functions and thus large changes in output for small changes in inputs.

— Page 269, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

It is more common to focus on methods that constrain the size of the weights in a neural network because a single network structure can be defined that is under-constrained, e.g. has a much larger capacity than is required for the problem, and regularization can be used during training to ensure that the model does not overfit. In such cases, performance can even be better as the additional capacity can be focused on better learning generalizable concepts in the problem.

Techniques that seek to reduce overfitting (reduce generalization error) by keeping network weights small are referred to as regularization methods. More specifically, regularization refers to a class of approaches that add additional information to transform an ill-posed problem into a more stable well-posed problem.

A problem is said to be ill-posed if small changes in the given information cause large changes in the solution. This instability with respect to the data makes solutions unreliable because small measurement errors or uncertainties in parameters may be greatly magnified and lead to wildly different responses. […] The idea behind regularization is to use supplementary information to restate an ill-posed problem in a stable form.

— Page 266, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

Regularization methods are so widely used to reduce overfitting that the term “*regularization*” may be used for any method that improves the generalization error of a neural network model.

Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error. Regularization is one of the central concerns of the field of machine learning, rivaled in its importance only by optimization.

— Page 120, Deep Learning, 2016.

The simplest and perhaps most common regularization method is to add a penalty to the loss function in proportion to the size of the weights in the model.

**Weight Regularization**: Penalize the model during training based on the magnitude of the weights.

This will encourage the model to map the inputs to the outputs of the training dataset in such a way that the weights of the model are kept small. This approach is called weight regularization or weight decay and has proven very effective for decades for both simpler linear models and neural networks.

A simple alternative to gathering more data is to reduce the size of the model or improve regularization, by adjusting hyperparameters such as weight decay coefficients …

— Page 427, Deep Learning, 2016.

Below is a list of five of the most common additional regularization methods.

**Activity Regularization**: Penalize the model during training base on the magnitude of the activations.**Weight Constraint**: Constrain the magnitude of weights to be within a range or below a limit.**Dropout**: Probabilistically remove inputs during training.**Noise**: Add statistical noise to inputs during training.**Early Stopping**: Monitor model performance on a validation set and stop training when performance degrades.

Most of these methods have been demonstrated (or proven) to approximate the effect of adding a penalty to the loss function.

Each method approaches the problem differently, offering benefits in terms of a mixture of generalization performance, configurability, and/or computational complexity.

This section outlines some recommendations for using regularization methods for deep learning neural networks.

You should always consider using regularization, unless you have a very large dataset, e.g. big-data scale.

Unless your training set contains tens of millions of examples or more, you should include some mild forms of regularization from the start.

— Page 426, Deep Learning, 2016.

A good general recommendation is to design a neural network structure that is under-constrained and to use regularization to reduce the likelihood of overfitting.

… controlling the complexity of the model is not a simple matter of finding the model of the right size, with the right number of parameters. Instead, … in practical deep learning scenarios, we almost always do find—that the best fitting model (in the sense of minimizing generalization error) is a large model that has been regularized appropriately.

— Page 229, Deep Learning, 2016.

Early stopping should almost universally be used in addition to a method to keep weights small during training.

Early stopping should be used almost universally.

— Page 426, Deep Learning, 2016.

Some more specific recommendations include:

**Classical**: use early stopping and weight decay (L2 weight regularization).**Alternate**: use early stopping and added noise with a weight constraint.**Modern**: use early stopping and dropout, in addition to a weight constraint.

These recommendations would suit Multilayer Perceptrons and Convolutional Neural Networks.

Some recommendations for recurrent neural nets include:

**Classical**: use early stopping with added weight noise and a weight constraint such as maximum norm.**Modern**: use early stopping with a backpropagation-through-time-aware version of dropout and a weight constraint.

There are no silver bullets when it comes to regularization and systematic experimentation is strongly encouraged.

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 7 Regularization for Deep Learning, Deep Learning, 2016.
- Section 5.5. Regularization in Neural Networks, Pattern Recognition and Machine Learning, 2006.
- Chapter 16, Heuristics for Improving Generalization, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
- Chapter 9 Learning and Generalization, Neural Networks for Pattern Recognition, 1995.

- What is overfitting and how can I avoid it? Neural Network FAQ.
- Regularization (mathematics), Wikipedia.

In this post, you discovered the problem of overfitting when training neural networks and how it can be addressed with regularization methods.

Specifically, you learned:

- Underfitting can easily be addressed by increasing the capacity of the network, but overfitting requires the use of specialized techniques.
- Regularization methods like weight decay provide an easy way to control overfitting for large neural network models.
- A modern recommendation for regularization is to use early stopping with dropout and a weight constraint.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Introduction to Regularization to Reduce Overfitting of Deep Learning Neural Networks appeared first on Machine Learning Mastery.

]]>The post How to Improve Deep Learning Model Robustness by Adding Noise appeared first on Machine Learning Mastery.

]]>Adding noise to an underconstrained neural network model with a small training dataset can have a regularizing effect and reduce overfitting.

Keras supports the addition of Gaussian noise via a separate layer called the GaussianNoise layer. This layer can be used to add noise to an existing model.

In this tutorial, you will discover how to add noise to deep learning models in Keras in order to reduce overfitting and improve model generalization.

After completing this tutorial, you will know:

- Noise can be added to a neural network model via the GaussianNoise layer.
- The GaussianNoise can be used to add noise to input values or between hidden layers.
- How to add a GaussianNoise layer in order to reduce overfitting in a Multilayer Perceptron model for classification.

Let’s get started.

This tutorial is divided into three parts; they are:

- Noise Regularization in Keras
- Noise Regularization in Models
- Noise Regularization Case Study

Keras supports the addition of noise to models via the GaussianNoise layer.

This is a layer that will add noise to inputs of a given shape. The noise has a mean of zero and requires that a standard deviation of the noise be specified as a parameter. For example:

# import noise layer from keras.layers import GaussianNoise # define noise layer layer = GaussianNoise(0.1)

The output of the layer will have the same shape as the input, with the only modification being the addition of noise to the values.

The GaussianNoise can be used in a few different ways with a neural network model.

Firstly, it can be used as an input layer to add noise to input variables directly. This is the traditional use of noise as a regularization method in neural networks.

Below is an example of defining a GaussianNoise layer as an input layer for a model that takes 2 input variables.

... model.add(GaussianNoise(0.01, input_shape=(2,))) ...

Noise can also be added between hidden layers in the model. Given the flexibility of Keras, the noise can be added before or after the use of the activation function. It may make more sense to add it before the activation; nevertheless, both options are possible.

Below is an example of a GaussianNoise layer that adds noise to the linear output of a Dense layer before a rectified linear activation function, perhaps a more appropriate use of noise between hidden layers.

... model.add(Dense(32)) model.add(GaussianNoise(0.1)) model.add(Activation('relu')) model.add(Dense(32)) ...

Noise can also be added after the activation function, much like using a noisy activation function. One downside of this usage is that the resulting values may be out-of-range from what the activation function may normally provide. For example, a value with added noise may be less than zero, whereas the relu activation function will only ever output values 0 or larger.

... model.add(Dense(32, activation='reu')) model.add(GaussianNoise(0.1)) model.add(Dense(32)) ...

Let’s take a look at how noise regularization can be used with some common network types.

The example below adds noise between two Dense fully connected layers.

# example of noise between fully connected layers from keras.layers import Dense from keras.layers import GaussianNoise from keras.layers import Activation ... model.add(Dense(32)) model.add(GaussianNoise(0.1)) model.add(Activation('relu')) model.add(Dense(1)) ...

The example below adds noise after a pooling layer in a convolutional network.

# example of noise for a CNN from keras.layers import Dense from keras.layers import Conv2D from keras.layers import MaxPooling2D from keras.layers import GaussianNoise ... model.add(Conv2D(32, (3,3))) model.add(Conv2D(32, (3,3))) model.add(MaxPooling2D()) model.add(GaussianNoise(0.1)) model.add(Dense(1)) ...

The example below adds noise between an LSTM recurrent layer and a Dense fully connected layer.

# example of noise between LSTM and fully connected layers from keras.layers import Dense from keras.layers import Activation from keras.layers import LSTM from keras.layers import GaussianNoise ... model.add(LSTM(32)) model.add(GaussianNoise(0.5)) model.add(Activation('relu')) model.add(Dense(1)) ...

Now that we have seen how to add noise to neural network models, let’s look at a case study of adding noise to an overfit model to reduce generalization error.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will demonstrate how to use noise regularization to reduce overfitting of an MLP on a simple binary classification problem.

This example provides a template for applying noise regularization to your own neural network for classification and regression problems.

We will use a standard binary classification problem that defines two two-dimensional concentric circles of observations, one semi-circle for each class.

Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “*circles*” dataset because of the shape of the observations in each class when plotted.

We can use the make_circles() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.

# generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1)

We can plot the dataset where the two variables are taken as x and y coordinates on a graph and the class value is taken as the color of the observation.

The complete example of generating the dataset and plotting it is listed below.

# generate two circles dataset from sklearn.datasets import make_circles from matplotlib import pyplot from pandas import DataFrame # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # scatter plot, dots colored by class value df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y)) colors = {0:'red', 1:'blue'} fig, ax = pyplot.subplots() grouped = df.groupby('label') for key, group in grouped: group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key]) pyplot.show()

Running the example creates a scatter plot showing the concentric circles shape of the observations in each class. We can see the noise in the dispersal of the points making the circles less obvious.

This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.

We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset, a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.

We can develop an MLP model to address this binary classification problem.

The model will have one hidden layer with more nodes than may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.

Before we define the model, we will split the dataset into train and test sets, using 30 examples to train the model and 70 to evaluate the fit model’s performance.

# generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:]

Next, we can define the model.

The hidden layer uses 500 nodes in the hidden layer and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1. The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.

# define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The defined model is then fit on the training data for 4,000 epochs and the default batch size of 32.

We will also use the test dataset as a validation dataset.

# fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)

We can evaluate the performance of the model on the test dataset and report the result.

# evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

Finally, we will plot the performance of the model on both the train and test set each epoch.

If the model does indeed overfit the training dataset, we would expect the line plot of accuracy on the training set to continue to increase and the test set to rise and then fall again as the model learns statistical noise in the training dataset.

# plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

We can tie all of these pieces together; the complete example is listed below.

# mlp overfit on the two circles dataset from sklearn.datasets import make_circles from keras.layers import Dense from keras.models import Sequential from matplotlib import pyplot # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[: n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

We can see that the model has better performance on the training dataset than the test dataset, one possible sign of overfitting.

Your specific results may vary given the stochastic nature of the neural network and the training algorithm. Because the model is severely overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.

Train: 1.000, Test: 0.757

A figure is created showing line plots of the model accuracy on the train and test sets.

We can see that expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.

The dataset is defined by points that have a controlled amount of statistical noise.

Nevertheless, because the dataset is small, we can add further noise to the input values. This will have the effect of creating more samples or resampling the domain, making the structure of the input space artificially smoother. This may make the problem easier to learn and improve generalization performance.

We can add a GaussianNoise layer as the input layer. The amount of noise must be small. Given that the input values are within the range [0, 1], we will add Gaussian noise with a mean of 0.0 and a standard deviation of 0.01, chosen arbitrarily.

# define model model = Sequential() model.add(GaussianNoise(0.01, input_shape=(2,))) model.add(Dense(500, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The complete example with this change is listed below.

# mlp overfit on the two circles dataset with input noise from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from keras.layers import GaussianNoise from matplotlib import pyplot # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(GaussianNoise(0.01, input_shape=(2,))) model.add(Dense(500, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

Your results will vary, given both the stochastic nature of the learning algorithm and the stochastic nature of the noise added to the model. Try running the example a few times.

In this case, we may see a small lift in performance of the model on the test dataset, with no negative impact on the training dataset.

Train: 1.000, Test: 0.771

We clearly see the impact of the added noise on the evaluation of the model during training as graphed on the line plot. The noise cases the accuracy of the model to jump around during training, possibly due to the noise introducing points that conflict with true points from the training dataset.

Perhaps a lower input noise standard deviation would be more appropriate.

The model still shows a pattern of being overfit, with a rise and then fall in test accuracy over training epochs.

An alternative approach to adding noise to the input values is to add noise between the hidden layers.

This can be done by adding noise to the linear output of the layer (weighted sum) before the activation function is applied, in this case a rectified linear activation function. We can also use a larger standard deviation for the noise as the model is less sensitive to noise at this level given the presumably larger weights from being overfit. We will use a standard deviation of 0.1, again, chosen arbitrarily.

# define model model = Sequential() model.add(Dense(500, input_dim=2)) model.add(GaussianNoise(0.1)) model.add(Activation('relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The complete example with Gaussian noise between the hidden layers is listed below.

# mlp overfit on the two circles dataset with hidden layer noise from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from keras.layers import Activation from keras.layers import GaussianNoise from matplotlib import pyplot # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2)) model.add(GaussianNoise(0.1)) model.add(Activation('relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

Your results will vary, given both the stochastic nature of the learning algorithm and the stochastic nature of the noise added to the model. Try running the example a few times.

In this case, we can see a marked increase in the performance of the model on the hold out test set.

# Train: 0.967, Test: 0.814

We can also see from the line plot of accuracy over training epochs that the model no longer appears to show the properties of being overfit.

We can also experiment and add the noise after the outputs of the first hidden layer pass through the activation function.

# define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(GaussianNoise(0.1)) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The complete example is listed below.

# mlp overfit on the two circles dataset with hidden layer noise (alternate) from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from keras.layers import GaussianNoise from matplotlib import pyplot # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(GaussianNoise(0.1)) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

Surprisingly, we see little difference in the performance of the model.

Train: 0.967, Test: 0.814

Again, we can see from the line plot of accuracy over training epochs that the model no longer shows sign of overfitting.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Repeated Evaluation**. Update the example to use repeated evaluation of the model with and without noise and report performance as the mean and standard deviation over repeats.**Grid Search Standard Deviation**. Develop a grid search in order to discover the amount of noise that reliably results in the best performing model.**Input and Hidden Noise**. Update the example to introduce noise at both the input and hidden layers of the model.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Keras Regularizers API
- Keras Core Layers API
- Keras Convolutional Layers API
- Keras Recurrent Layers API
- Keras Noise API
- sklearn.datasets.make_circles API

In this tutorial, you discovered how to add noise to deep learning models in Keras in order to reduce overfitting and improve model generalization.

Specifically, you learned:

- Noise can be added to a neural network model via the GaussianNoise layer.
- The GaussianNoise can be used to add noise to input values or between hidden layers.
- How to add a GaussianNoise layer in order to reduce overfitting in a Multilayer Perceptron model for classification.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Improve Deep Learning Model Robustness by Adding Noise appeared first on Machine Learning Mastery.

]]>The post Train Neural Networks With Noise to Reduce Overfitting appeared first on Machine Learning Mastery.

]]>Training a neural network with a small dataset can cause the network to memorize all training examples, in turn leading to poor performance on a holdout dataset.

Small datasets may also represent a harder mapping problem for neural networks to learn, given the patchy or sparse sampling of points in the high-dimensional input space.

One approach to making the input space smoother and easier to learn is to add noise to inputs during training.

In this post, you will discover that adding noise to a neural network during training can improve the robustness of the network, resulting in better generalization and faster learning.

After reading this post, you will know:

- Small datasets can make learning challenging for neural nets and the examples can be memorized.
- Adding noise during training can make the training process more robust and reduce generalization error.
- Noise is traditionally added to the inputs, but can also be added to weights, gradients, and even activation functions.

Let’s get started.

This tutorial is divided into five parts; they are:

- Challenge of Small Training Datasets
- Add Random Noise During Training
- How and Where to Add Noise
- Examples of Adding Noise During Training
- Tips for Adding Noise During Training

Small datasets can introduce problems when training large neural networks.

The first problem is that the network may effectively memorize the training dataset. Instead of learning a general mapping from inputs to outputs, the model may learn the specific input examples and their associated outputs. This will result in a model that performs well on the training dataset, and poor on new data, such as a holdout dataset.

The second problem is that a small dataset provides less opportunity to describe the structure of the input space and its relationship to the output. More training data provides a richer description of the problem from which the model may learn. Fewer data points means that rather than a smooth input space, the points may represent a jarring and disjointed structure that may result in a difficult, if not unlearnable, mapping function.

It is not always possible to acquire more data. Further, getting a hold of more data may not address these problems.

One approach to improving generalization error and to improving the structure of the mapping problem is to add random noise.

Many studies […] have noted that adding small amounts of input noise (jitter) to the training data often aids generalization and fault tolerance.

— Page 273, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

At first, this sounds like a recipe for making learning more challenging. It is a counter-intuitive suggestion to improving performance because one would expect noise to degrade performance of the model during training.

Heuristically, we might expect that the noise will ‘smear out’ each data point and make it difficult for the network to fit individual data points precisely, and hence will reduce over-fitting. In practice, it has been demonstrated that training with noise can indeed lead to improvements in network generalization.

— Page 347, Neural Networks for Pattern Recognition, 1995.

The addition of noise during the training of a neural network model has a regularization effect and, in turn, improves the robustness of the model. It has been shown to have a similar impact on the loss function as the addition of a penalty term, as in the case of weight regularization methods.

It is well known that the addition of noise to the input data of a neural network during training can, in some circumstances, lead to significant improvements in generalization performance. Previous work has shown that such training with noise is equivalent to a form of regularization in which an extra term is added to the error function.

— Training with Noise is Equivalent to Tikhonov Regularization, 2008.

In effect, adding noise expands the size of the training dataset. Each time a training sample is exposed to the model, random noise is added to the input variables making them different every time it is exposed to the model. In this way, adding noise to input samples is a simple form of data augmentation.

Injecting noise in the input to a neural network can also be seen as a form of data augmentation.

— Page 241, Deep Learning, 2016.

Adding noise means that the network is less able to memorize training samples because they are changing all of the time, resulting in smaller network weights and a more robust network that has lower generalization error.

The noise means that it is as though new samples are being drawn from the domain in the vicinity of known samples, smoothing the structure of the input space. This smoothing may mean that the mapping function is easier for the network to learn, resulting in better and faster learning.

… input noise and weight noise encourage the neural-network output to be a smooth function of the input or its weights, respectively.

— The Effects of Adding Noise During Backpropagation Training on a Generalization Performance, 1996.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The most common type of noise used during training is the addition of Gaussian noise to input variables.

Gaussian noise, or white noise, has a mean of zero and a standard deviation of one and can be generated as needed using a pseudorandom number generator. The addition of Gaussian noise to the inputs to a neural network was traditionally referred to as “*jitter*” or “*random jitter*” after the use of the term in signal processing to refer to to the uncorrelated random noise in electrical circuits.

The amount of noise added (eg. the spread or standard deviation) is a configurable hyperparameter. Too little noise has no effect, whereas too much noise makes the mapping function too challenging to learn.

This is generally done by adding a random vector onto each input pattern before it is presented to the network, so that, if the patterns are being recycled, a different random vector is added each time.

— Training with Noise is Equivalent to Tikhonov Regularization, 2008.

The standard deviation of the random noise controls the amount of spread and can be adjusted based on the scale of each input variable. It can be easier to configure if the scale of the input variables has first been normalized.

Noise is only added during training. No noise is added during the evaluation of the model or when the model is used to make predictions on new data.

The addition of noise is also an important part of automatic feature learning, such as in the case of autoencoders, so-called denoising autoencoders that explicitly require models to learn robust features in the presence of noise added to inputs.

We have seen that the reconstruction criterion alone is unable to guarantee the extraction of useful features as it can lead to the obvious solution “simply copy the input” or similarly uninteresting ones that trivially maximizes mutual information. […] we change the reconstruction criterion for a both more challenging and more interesting objective: cleaning partially corrupted input, or in short denoising.

Although additional noise to the inputs is the most common and widely studied approach, random noise can be added to other parts of the network during training. Some examples include:

**Add noise to activations**, i.e. the outputs of each layer.**Add noise to weights**, i.e. an alternative to the inputs.**Add noise to the gradients**, i.e. the direction to update weights.**Add noise to the outputs**, i.e. the labels or target variables.

The addition of noise to the layer activations allows noise to be used at any point in the network. This can be beneficial for very deep networks. Noise can be added to the layer outputs themselves, but this is more likely achieved via the use of a noisy activation function.

The addition of noise to weights allows the approach to be used throughout the network in a consistent way instead of adding noise to inputs and layer activations. This is particularly useful in recurrent neural networks.

Another way that noise has been used in the service of regularizing models is by adding it to the weights. This technique has been used primarily in the context of recurrent neural networks. […] Noise applied to the weights can also be interpreted as equivalent (under some assumptions) to a more traditional form of regularization, encouraging stability of the function to be learned.

— Page 242, Deep Learning, 2016.

The addition of noise to gradients focuses more on improving the robustness of the optimization process itself rather than the structure of the input domain. The amount of noise can start high at the beginning of training and decrease over time, much like a decaying learning rate. This approach has proven to be an effective method for very deep networks and for a variety of different network types.

We consistently see improvement from injected gradient noise when optimizing a wide variety of models, including very deep fully-connected networks, and special-purpose architectures for question answering and algorithm learning. […] Our experiments indicate that adding annealed Gaussian noise by decaying the variance works better than using fixed Gaussian noise

— Adding Gradient Noise Improves Learning for Very Deep Networks, 2015.

Adding noise to the activations, weights, or gradients all provide a more generic approach to adding noise that is invariant to the types of input variables provided to the model.

If the problem domain is believed or expected to have mislabeled examples, then the addition of noise to the class label can improve the model’s robustness to this type of error. Although, it can be easy to derail the learning process.

Adding noise to a continuous target variable in the case of regression or time series forecasting is much like the addition of noise to the input variables and may be a better use case.

This section summarizes some examples where the addition of noise during training has been used.

Lasse Holmstrom studied the addition of random noise both analytically and experimentally with MLPs in the 1992 paper titled “Using Additive Noise in Back-Propagation Training.” They recommend first standardizing input variables then using cross-validation to choose the amount of noise to use during training.

If a single general-purpose noise design method should be suggested, we would pick maximizing the cross-validated likelihood function. This method is easy to implement, is completely data-driven, and has a validity that is supported by theoretical consistency results

Klaus Gref, et al. in their 2016 paper titled “LSTM: A Search Space Odyssey” used a hyperparameter search for the standard deviation for Gaussian noise on the input variables for a suite of sequence prediction tasks and found that it almost universally resulted in worse performance.

Additive Gaussian noise on the inputs, a traditional regularizer for neural networks, has been used for LSTM as well. However, we find that not only does it almost always hurt performance, it also slightly increases training times.

Alex Graves, et al. in their groundbreaking 2013 paper titled “Speech recognition with deep recurrent neural networks” that achieved then state-of-the-art results for speech recognition added noise to the weights of LSTMs during training.

… weight noise [was used] (the addition of Gaussian noise to the network weights during training). Weight noise was added once per training sequence, rather than at every timestep. Weight noise tends to ‘simplify’ neural networks, in the sense of reducing the amount of information required to transmit the parameters, which improves generalisation.

In a prior 2011 paper that studies different types of static and adaptive weight noise titled “Practical Variational Inference for Neural Networks,” Graves recommends using early stopping in conjunction with the addition of weight noise with LSTMs.

… in practice early stopping is required to prevent overfitting when training with weight noise.

This section provides some tips for adding noise during training with your neural network.

Noise can be added to training regardless of the type of problem that is being addressed.

It is appropriate to try adding noise to both classification and regression type problems.

The type of noise can be specialized to the types of data used as input to the model, for example, two-dimensional noise in the case of images and signal noise in the case of audio data.

Adding noise during training is a generic method that can be used regardless of the type of neural network that is being used.

It was a method used primarily with multilayer Perceptrons given their prior dominance, but can be and is used with Convolutional and Recurrent Neural Networks.

It is important that the addition of noise has a consistent effect on the model.

This requires that the input data is rescaled so that all variables have the same scale, so that when noise is added to the inputs with a fixed variance, it has the same effect. The also applies to adding noise to weights and gradients as they too are affected by the scale of the inputs.

This can be achieved via standardization or normalization of input variables.

If random noise is added after data scaling, then the variables may need to be rescaled again, perhaps per mini-batch.

You cannot know how much noise will benefit your specific model on your training dataset.

Experiment with different amounts, and even different types of noise, in order to discover what works best.

Be systematic and use controlled experiments, perhaps on smaller datasets across a range of values.

Noise is only added during the training of your model.

Be sure that any source of noise is not added during the evaluation of your model, or when your model is used to make predictions on new data.

This section provides more resources on the topic if you are looking to go deeper.

- Section 7.5 Noise Robustness, Deep Learning, 2016.
- Chapter 17 Training with Noisy Inputs, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
- Section 9.3, Training with Noise, Neural Networks for Pattern Recognition, 1995.

- Creating artificial neural networks that generalize, 1991.
- Deep networks for robust visual recognition, 2010.
- Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion, 2010.
- Analyzing noise in autoencoders and deep networks, 2014.
- The Effects of Adding Noise During Backpropagation Training on a Generalization Performance, 1996.
- Training with Noise is Equivalent to Tikhonov Regularization, 2008.
- Adding Gradient Noise Improves Learning for Very Deep Networks, 2016.
- Noisy Activation Functions, 2016.

In this post, you discovered that adding noise to a neural network during training can improve the robustness of the network resulting in better generalization and faster learning.

Specifically, you learned:

- Small datasets can make learning challenging for neural nets and the examples can be memorized.
- Adding noise during training can make the training process more robust and reduce generalization error.
- Noise is traditionally added to the inputs, but can also be added to weights, gradients, and even activation functions.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Train Neural Networks With Noise to Reduce Overfitting appeared first on Machine Learning Mastery.

]]>The post How to Stop Training Deep Neural Networks At the Right Time Using Early Stopping appeared first on Machine Learning Mastery.

]]>A problem with training neural networks is in the choice of the number of training epochs to use.

Too many epochs can lead to overfitting of the training dataset, whereas too few may result in an underfit model. Early stopping is a method that allows you to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out validation dataset.

In this tutorial, you will discover the Keras API for adding early stopping to overfit deep learning neural network models.

After completing this tutorial, you will know:

- How to monitor the performance of a model during training using the Keras API.
- How to create and configure early stopping and model checkpoint callbacks using the Keras API.
- How to reduce overfitting by adding an early stopping to an existing model.

Let’s get started.

This tutorial is divided into six parts; they are:

- Using Callbacks in Keras
- Evaluating a Validation Dataset
- Monitoring Model Performance
- Early Stopping in Keras
- Checkpointing in Keras
- Early Stopping Case Study

Callbacks provide a way to execute code and interact with the training model process automatically.

Callbacks can be provided to the *fit()* function via the “*callbacks*” argument.

First, callbacks must be instantiated.

... cb = Callback(...)

Then, one or more callbacks that you intend to use must be added to a Python list.

... cb_list = [cb, ...]

Finally, the list of callbacks is provided to the callback argument when fitting the model.

... model.fit(..., callbacks=cb_list)

Early stopping requires that a validation dataset is evaluated during training.

This can be achieved by specifying the validation dataset to the fit() function when training your model.

There are two ways of doing this.

The first involves you manually splitting your training data into a train and validation dataset and specifying the validation dataset to the *fit()* function via the *validation_data* argument. For example:

... model.fit(train_X, train_y, validation_data=(val_x, val_y))

Alternately, the *fit()* function can automatically split your training dataset into train and validation sets based on a percentage split specified via the *validation_split* argument.

The *validation_split* is a value between 0 and 1 and defines the percentage amount of the training dataset to use for the validation dataset. For example:

... model.fit(train_X, train_y, validation_split=0.3)

In both cases, the model is not trained on the validation dataset. Instead, the model is evaluated on the validation dataset at the end of each training epoch.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The loss function chosen to be optimized for your model is calculated at the end of each epoch.

To callbacks, this is made available via the name “*loss*.”

If a validation dataset is specified to the *fit()* function via the *validation_data* or *validation_split* arguments, then the loss on the validation dataset will be made available via the name “*val_loss*.”

Additional metrics can be monitored during the training of the model.

They can be specified when compiling the model via the “*metrics*” argument to the compile function. This argument takes a Python list of known metric functions, such as ‘*mse*‘ for mean squared error and ‘*acc*‘ for accuracy. For example:

... model.compile(..., metrics=['acc'])

If additional metrics are monitored during training, they are also available to the callbacks via the same name, such as ‘*acc*‘ for accuracy on the training dataset and ‘*val_acc*‘ for the accuracy on the validation dataset. Or, ‘*mse*‘ for mean squared error on the training dataset and ‘*val_mse*‘ on the validation dataset.

Keras supports the early stopping of training via a callback called *EarlyStopping*.

This callback allows you to specify the performance measure to monitor, the trigger, and once triggered, it will stop the training process.

The *EarlyStopping* callback is configured when instantiated via arguments.

The “*monitor*” allows you to specify the performance measure to monitor in order to end training. Recall from the previous section that the calculation of measures on the validation dataset will have the ‘*val_*‘ prefix, such as ‘*val_loss*‘ for the loss on the validation dataset.

es = EarlyStopping(monitor='val_loss')

Based on the choice of performance measure, the “*mode*” argument will need to be specified as whether the objective of the chosen metric is to increase (maximize or ‘*max*‘) or to decrease (minimize or ‘*min*‘).

For example, we would seek a minimum for validation loss and a minimum for validation mean squared error, whereas we would seek a maximum for validation accuracy.

es = EarlyStopping(monitor='val_loss', mode='min')

By default, mode is set to ‘*auto*‘ and knows that you want to minimize loss or maximize accuracy.

That is all that is needed for the simplest form of early stopping. Training will stop when the chosen performance measure stops improving. To discover the training epoch on which training was stopped, the “*verbose*” argument can be set to 1. Once stopped, the callback will print the epoch number.

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

Often, the first sign of no further improvement may not be the best time to stop training. This is because the model may coast into a plateau of no improvement or even get slightly worse before getting much better.

We can account for this by adding a delay to the trigger in terms of the number of epochs on which we would like to see no improvement. This can be done by setting the “*patience*” argument.

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=50)

The exact amount of patience will vary between models and problems. Reviewing plots of your performance measure can be very useful to get an idea of how noisy the optimization process for your model on your data may be.

By default, any change in the performance measure, no matter how fractional, will be considered an improvement. You may want to consider an improvement that is a specific increment, such as 1 unit for mean squared error or 1% for accuracy. This can be specified via the “*min_delta*” argument.

es = EarlyStopping(monitor='val_acc', mode='max', min_delta=1)

Finally, it may be desirable to only stop training if performance stays above or below a given threshold or baseline. For example, if you have familiarity with the training of the model (e.g. learning curves) and know that once a validation loss of a given value is achieved that there is no point in continuing training. This can be specified by setting the “*baseline*” argument.

This might be more useful when fine tuning a model, after the initial wild fluctuations in the performance measure seen in the early stages of training a new model are past.

es = EarlyStopping(monitor='val_loss', mode='min', baseline=0.4)

The *EarlyStopping* callback will stop training once triggered, but the model at the end of training may not be the model with best performance on the validation dataset.

An additional callback is required that will save the best model observed during training for later use. This is the *ModelCheckpoint* callback.

The *ModelCheckpoint* callback is flexible in the way it can be used, but in this case we will use it only to save the best model observed during training as defined by a chosen performance measure on the validation dataset.

Saving and loading models requires that HDF5 support has been installed on your workstation. For example, using the *pip* Python installer, this can be achieved as follows:

sudo pip install h5py

You can learn more from the h5py Installation documentation.

The callback will save the model to file, which requires that a path and filename be specified via the first argument.

mc = ModelCheckpoint('best_model.h5')

The preferred loss function to be monitored can be specified via the monitor argument, in the same way as the *EarlyStopping* callback. For example, loss on the validation dataset (the default).

mc = ModelCheckpoint('best_model.h5', monitor='val_loss')

Also, as with the *EarlyStopping* callback, we must specify the “*mode*” as either minimizing or maximizing the performance measure. Again, the default is ‘*auto*,’ which is aware of the standard performance measures.

mc = ModelCheckpoint('best_model.h5', monitor='val_loss', mode='min')

Finally, we are interested in only the very best model observed during training, rather than the best compared to the previous epoch, which might not be the best overall if training is noisy. This can be achieved by setting the “*save_best_only*” argument to *True*.

mc = ModelCheckpoint('best_model.h5', monitor='val_loss', mode='min', save_best_only=True)

That is all that is needed to ensure the model with the best performance is saved when using early stopping, or in general.

It may be interesting to know the value of the performance measure and at what epoch the model was saved. This can be printed by the callback by setting the “*verbose*” argument to “*1*“.

mc = ModelCheckpoint('best_model.h5', monitor='val_loss', mode='min', verbose=1)

The saved model can then be loaded and evaluated any time by calling the *load_model()* function.

# load a saved model from keras.models import load_model saved_model = load_model('best_model.h5')

Now that we know how to use the early stopping and model checkpoint APIs, let’s look at a worked example.

In this section, we will demonstrate how to use early stopping to reduce overfitting of an MLP on a simple binary classification problem.

This example provides a template for applying early stopping to your own neural network for classification and regression problems.

We will use a standard binary classification problem that defines two semi-circles of observations, one semi-circle for each class.

Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “*moons*” dataset because of the shape of the observations in each class when plotted.

We can use the make_moons() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.

# generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1)

We can plot the dataset where the two variables are taken as x and y coordinates on a graph and the class value is taken as the color of the observation.

The complete example of generating the dataset and plotting it is listed below.

# generate two moons dataset from sklearn.datasets import make_moons from matplotlib import pyplot from pandas import DataFrame # generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1) # scatter plot, dots colored by class value df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y)) colors = {0:'red', 1:'blue'} fig, ax = pyplot.subplots() grouped = df.groupby('label') for key, group in grouped: group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key]) pyplot.show()

Running the example creates a scatter plot showing the semi-circle or moon shape of the observations in each class. We can see the noise in the dispersal of the points making the moons less obvious.

This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.

We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset: a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.

We can develop an MLP model to address this binary classification problem.

The model will have one hidden layer with more nodes than may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.

Before we define the model, we will split the dataset into train and test sets, using 30 examples to train the model and 70 to evaluate the fit model’s performance.

# generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:]

Next, we can define the model.

The hidden layer uses 500 nodes and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1. The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.

# define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The defined model is then fit on the training data for 4,000 epochs and the default batch size of 32.

We will also use the test dataset as a validation dataset. This is just a simplification for this example. In practice, you would split the training set into train and validation and also hold back a test set for final model evaluation.

# fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)

We can evaluate the performance of the model on the test dataset and report the result.

# evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

Finally, we will plot the loss of the model on both the train and test set each epoch.

If the model does indeed overfit the training dataset, we would expect the line plot of loss (and accuracy) on the training set to continue to increase and the test set to rise and then fall again as the model learns statistical noise in the training dataset.

# plot training history pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show()

We can tie all of these pieces together; the complete example is listed below.

# mlp overfit on the moons dataset from sklearn.datasets import make_moons from keras.layers import Dense from keras.models import Sequential from matplotlib import pyplot # generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot training history pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

We can see that the model has better performance on the training dataset than the test dataset, one possible sign of overfitting.

Your specific results may vary given the stochastic nature of the neural network and the training algorithm. Because the model is severely overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.

Train: 1.000, Test: 0.914

A figure is created showing line plots of the model loss on the train and test sets.

We can see that expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.

Reviewing the figure, we can also see flat spots in the ups and downs in the validation loss. Any early stopping will have to account for these behaviors. We would also expect that a good time to stop training might be around epoch 800.

We can update the example and add very simple early stopping.

As soon as the loss of the model begins to increase on the test dataset, we will stop training.

First, we can define the early stopping callback.

# simple early stopping es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

We can then update the call to the *fit()* function and specify a list of callbacks via the “*callback*” argument.

# fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es])

The complete example with the addition of simple early stopping is listed below.

# mlp overfit on the moons dataset with simple early stopping from sklearn.datasets import make_moons from keras.models import Sequential from keras.layers import Dense from keras.callbacks import EarlyStopping from matplotlib import pyplot # generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # simple early stopping es = EarlyStopping(monitor='val_loss', mode='min', verbose=1) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es]) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot training history pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

We can also see that the callback stopped training at epoch 200. This is too early as we would expect an early stop to be around epoch 800. This is also highlighted by the classification accuracy on both the train and test sets, which is worse than no early stopping.

Epoch 00219: early stopping Train: 0.967, Test: 0.814

Reviewing the line plot of train and test loss, we can indeed see that training was stopped at the point when validation loss began to plateau for the first time.

We can improve the trigger for early stopping by waiting a while before stopping.

This can be achieved by setting the “*patience*” argument.

In this case, we will wait 200 epochs before training is stopped. Specifically, this means that we will allow training to continue for up to an additional 200 epochs after the point that validation loss started to degrade, giving the training process an opportunity to get across flat spots or find some additional improvement.

# patient early stopping es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)

The complete example with this change is listed below.

# mlp overfit on the moons dataset with patient early stopping from sklearn.datasets import make_moons from keras.models import Sequential from keras.layers import Dense from keras.callbacks import EarlyStopping from matplotlib import pyplot # generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # patient early stopping es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es]) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot training history pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show()

Running the example, we can see that training was stopped much later, in this case after epoch 1,000. Your specific results may differ given the stochastic nature of training neural networks.

We can also see that the performance on the test dataset is better than not using any early stopping.

Epoch 01033: early stopping Train: 1.000, Test: 0.943

Reviewing the line plot of loss during training, we can see that the patience allowed the training to progress past some small flat and bad spots.

We can also see that test loss started to increase again in the last approximately 100 epochs.

This means that although the performance of the model has improved, we may not have the best performing or most stable model at the end of training. We can address this by using a *ModelChecckpoint* callback.

In this case, we are interested in saving the model with the best accuracy on the test dataset. We could also seek the model with the best loss on the test dataset, but this may or may not correspond to the model with the best accuracy.

This highlights an important concept in model selection. The notion of the “*best*” model during training may conflict when evaluated using different performance measures. Try to choose models based on the metric by which they will be evaluated and presented in the domain. In a balanced binary classification problem, this will most likely be classification accuracy. Therefore, we will use accuracy on the validation in the *ModelCheckpoint* callback to save the best model observed during training.

mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)

During training, the entire model will be saved to the file “*best_model.h5*” only when accuracy on the validation dataset improves overall across the entire training process. A verbose output will also inform us as to the epoch and accuracy value each time the model is saved to the same file (e.g. overwritten).

This new additional callback can be added to the list of callbacks when calling the *fit()* function.

history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es, mc])

We are no longer interested in the line plot of loss during training; it will be much the same as the previous run.

Instead, we want to load the saved model from file and evaluate its performance on the test dataset.

# load the saved model saved_model = load_model('best_model.h5') # evaluate the model _, train_acc = saved_model.evaluate(trainX, trainy, verbose=0) _, test_acc = saved_model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

The complete example with these changes is listed below.

# mlp overfit on the moons dataset with patient early stopping and model checkpointing from sklearn.datasets import make_moons from keras.models import Sequential from keras.layers import Dense from keras.callbacks import EarlyStopping from keras.callbacks import ModelCheckpoint from matplotlib import pyplot from keras.models import load_model # generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # simple early stopping es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200) mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es, mc]) # load the saved model saved_model = load_model('best_model.h5') # evaluate the model _, train_acc = saved_model.evaluate(trainX, trainy, verbose=0) _, test_acc = saved_model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

Running the example, we can see the verbose output from the *ModelCheckpoint* callback for both when a new best model is saved and from when no improvement was observed.

We can see that the best model was observed at epoch 879 during this run. Your specific results may vary given the stochastic nature of training neural networks.

Again, we can see that early stopping continued patiently until after epoch 1,000. Note that epoch 880 + a patience of 200 is not epoch 1044. Recall that early stopping is monitoring loss on the validation dataset and that the model checkpoint is saving models based on accuracy. As such, the patience of early stopping started at an epoch other than 880.

... Epoch 00878: val_acc did not improve from 0.92857 Epoch 00879: val_acc improved from 0.92857 to 0.94286, saving model to best_model.h5 Epoch 00880: val_acc did not improve from 0.94286 ... Epoch 01042: val_acc did not improve from 0.94286 Epoch 01043: val_acc did not improve from 0.94286 Epoch 01044: val_acc did not improve from 0.94286 Epoch 01044: early stopping Train: 1.000, Test: 0.943

In this case, we don’t see any further improvement in model accuracy on the test dataset. Nevertheless, we have followed a good practice.

Why not monitor validation accuracy for early stopping?

This is a good question. The main reason is that accuracy is a coarse measure of model performance during training and that loss provides more nuance when using early stopping with classification problems. The same measure may be used for early stopping and model checkpointing in the case of regression, such as mean squared error.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Use Accuracy**. Update the example to monitor accuracy on the test dataset rather than loss, and plot learning curves showing accuracy.**Use True Validation Set**. Update the example to split the training set into train and validation sets, then evaluate the model on the test dataset.**Regression Example**. Create a new example of using early stopping to address overfitting on a simple regression problem and monitoring mean squared error.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Avoid Overfitting by Early Stopping With XGBoost in Python
- How to Check-Point Deep Learning Models in Keras

- H5Py Installation Documentation
- Keras Regularizers API
- Keras Core Layers API
- Keras Convolutional Layers API
- Keras Recurrent Layers API
- Keras Callbacks API
- sklearn.datasets.make_moons API

In this tutorial, you discovered the Keras API for adding early stopping to overfit deep learning neural network models.

Specifically, you learned:

- How to monitor the performance of a model during training using the Keras API.
- How to create and configure early stopping and model checkpoint callbacks using the Keras API.
- How to reduce overfitting by adding a early stopping to an existing model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Stop Training Deep Neural Networks At the Right Time Using Early Stopping appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Early Stopping to Avoid Overtraining Deep Learning Neural Network Models appeared first on Machine Learning Mastery.

]]>A major challenge in training neural networks is how long to train them.

Too little training will mean that the model will underfit the train and the test sets. Too much training will mean that the model will overfit the training dataset and have poor performance on the test set.

A compromise is to train on the training dataset but to stop training at the point when performance on a validation dataset starts to degrade. This simple, effective, and widely used approach to training neural networks is called early stopping.

In this post, you will discover that stopping the training of a neural network early before it has overfit the training dataset can reduce overfitting and improve the generalization of deep neural networks.

After reading this post, you will know:

- The challenge of training a neural network long enough to learn the mapping, but not so long that it overfits the training data.
- Model performance on a holdout validation dataset can be monitored during training and training stopped when generalization error starts to increase.
- The use of early stopping requires the selection of a performance measure to monitor, a trigger to stop training, and a selection of the model weights to use.

Let’s get started.

This tutorial is divided into five parts; they are:

- The Problem of Training Just Enough
- Stop Training When Generalization Error Increases
- How to Stop Training Early
- Examples of Early Stopping
- Tips for Early Stopping

Training neural networks is challenging.

When training a large network, there will be a point during training when the model will stop generalizing and start learning the statistical noise in the training dataset.

This overfitting of the training dataset will result in an increase in generalization error, making the model less useful at making predictions on new data.

The challenge is to train the network long enough that it is capable of learning the mapping from inputs to outputs, but not training the model so long that it overfits the training data.

However, all standard neural network architectures such as the fully connected multi-layer perceptron are prone to overfitting [10]: While the network seems to get better and better, i.e., the error on the training set decreases, at some point during training it actually begins to get worse again, i.e., the error on unseen examples increases.

— Early Stopping – But When?, 2002.

One approach to solving this problem is to treat the number of training epochs as a hyperparameter and train the model multiple times with different values, then select the number of epochs that result in the best performance on the train or a holdout test dataset.

The downside of this approach is that it requires multiple models to be trained and discarded. This can be computationally inefficient and time-consuming, especially for large models trained on large datasets over days or weeks.

An alternative approach is to train the model once for a large number of training epochs.

During training, the model is evaluated on a holdout validation dataset after each epoch. If the performance of the model on the validation dataset starts to degrade (e.g. loss begins to increase or accuracy begins to decrease), then the training process is stopped.

… the error measured with respect to independent data, generally called a validation set, often shows a decrease at first, followed by an increase as the network starts to over-fit. Training can therefore be stopped at the point of smallest error with respect to the validation data set

— Page 259, Pattern Recognition and Machine Learning, 2006.

The model at the time that training is stopped is then used and is known to have good generalization performance.

This procedure is called “*early stopping*” and is perhaps one of the oldest and most widely used forms of neural network regularization.

This strategy is known as early stopping. It is probably the most commonly used form of regularization in deep learning. Its popularity is due both to its effectiveness and its simplicity.

— Page 247, Deep Learning, 2016.

If regularization methods like weight decay that update the loss function to encourage less complex models are considered “*explicit*” regularization, then early stopping may be thought of as a type of “*implicit*” regularization, much like using a smaller network that has less capacity.

Regularization may also be implicit as is the case with early stopping.

— Understanding deep learning requires rethinking generalization, 2017.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Early stopping requires that you configure your network to be under constrained, meaning that it has more capacity than is required for the problem.

When training the network, a larger number of training epochs is used than may normally be required, to give the network plenty of opportunity to fit, then begin to overfit the training dataset.

There are three elements to using early stopping; they are:

- Monitoring model performance.
- Trigger to stop training.
- The choice of model to use.

The performance of the model must be monitored during training.

This requires the choice of a dataset that is used to evaluate the model and a metric used to evaluate the model.

It is common to split the training dataset and use a subset, such as 30%, as a validation dataset used to monitor performance of the model during training. This validation set is not used to train the model. It is also common to use the loss on a validation dataset as the metric to monitor, although you may also use prediction error in the case of regression, or accuracy in the case of classification.

The loss of the model on the training dataset will also be available as part of the training procedure, and additional metrics may also be calculated and monitored on the training dataset.

Performance of the model is evaluated on the validation set at the end of each epoch, which adds an additional computational cost during training. This can be reduced by evaluating the model less frequently, such as every 2, 5, or 10 training epochs.

Once a scheme for evaluating the model is selected, a trigger for stopping the training process must be chosen.

The trigger will use a monitored performance metric to decide when to stop training. This is often the performance of the model on the holdout dataset, such as the loss.

In the simplest case, training is stopped as soon as the performance on the validation dataset decreases as compared to the performance on the validation dataset at the prior training epoch (e.g. an increase in loss).

More elaborate triggers may be required in practice. This is because the training of a neural network is stochastic and can be noisy. Plotted on a graph, the performance of a model on a validation dataset may go up and down many times. This means that the first sign of overfitting may not be a good place to stop training.

… the validation error can still go further down after it has begun to increase […] Real validation error curves almost always have more than one local minimum.

— Early Stopping – But When?, 2002.

Some more elaborate triggers may include:

- No change in metric over a given number of epochs.
- An absolute change in a metric.
- A decrease in performance observed over a given number of epochs.
- Average change in metric over a given number of epochs.

Some delay or “*patience*” in stopping is almost always a good idea.

… results indicate that “slower” criteria, which stop later than others, on the average lead to improved generalization compared to “faster” ones. However, the training time that has to be expended for such improvements is rather large on average and also varies dramatically when slow criteria are used.

— Early Stopping – But When?, 2002.

At the time that training is halted, the model is known to have slightly worse generalization error than a model at a prior epoch.

As such, some consideration may need to be given as to exactly which model is saved. Specifically, the training epoch from which weights in the model that are saved to file.

This will depend on the trigger chosen to stop the training process. For example, if the trigger is a simple decrease in performance from one epoch to the next, then the weights for the model at the prior epoch will be preferred.

If the trigger is required to observe a decrease in performance over a fixed number of epochs, then the model at the beginning of the trigger period will be preferred.

Perhaps a simple approach is to always save the model weights if the performance of the model on a holdout dataset is better than at the previous epoch. That way, you will always have the model with the best performance on the holdout set.

Every time the error on the validation set improves, we store a copy of the model parameters. When the training algorithm terminates, we return these parameters, rather than the latest parameters.

— Page 246, Deep Learning, 2016.

This section summarizes some examples where early stopping has been used.

Yoon Kim in his seminal application of convolutional neural networks to sentiment analysis in the 2014 paper titled “Convolutional Neural Networks for Sentence Classification” used early stopping with 10% of the training dataset used as the validation hold outset.

We do not otherwise perform any dataset-specific tuning other than early stopping on dev sets. For datasets without a standard dev set we randomly select 10% of the training data as the dev set.

Chiyuan Zhang, et al. from MIT, Berkeley, and Google in their 2017 paper titled “Understanding deep learning requires rethinking generalization” highlight that on very deep convolutional neural networks for photo classification where there is an abundant dataset that early stopping may not always offer benefit, as the model is less likely to overfit such large datasets.

[regarding] the training and testing accuracy on ImageNet [results suggest] a reference of potential performance gain for early stopping. However, on the CIFAR10 dataset, we do not observe any potential benefit of early stopping.

Yarin Gal and Zoubin Ghahramani from Cambridge in their 2015 paper titled “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks” use early stopping as an “*unregularized baseline*” for LSTM models on a suite of language modeling problems.

Lack of regularisation in RNN models makes it difficult to handle small data, and to avoid overfitting researchers often use early stopping, or small and under-specified models

Alex Graves, et al., in their famous 2013 paper titled “Speech recognition with deep recurrent neural networks” achieved state-of-the-art results with LSTMs for speech recognition, while making use of early stopping.

Regularisation is vital for good performance with RNNs, as their flexibility makes them prone to overfitting. Two regularisers were used in this paper: early stopping and weight noise …

This section provides some tips for using early stopping regularization with your neural network.

Early stopping is so easy to use, e.g. with the simplest trigger, that there is little reason to not use it when training neural networks.

Use of early stopping may be a staple of the modern training of deep neural networks.

Early stopping should be used almost universally.

— Page 425, Deep Learning, 2016.

Before using early stopping, it may be interesting to fit an under constrained model and monitor the performance of the model on a train and validation dataset.

Plotting the performance of the model in real-time or at the end of a long run will show how noisy the training process is with your specific model and dataset.

This may help in the choice of a trigger for early stopping.

Loss is an easy metric to monitor during training and to trigger early stopping.

The problem is that loss does not always capture what is most important about the model to you and your project.

It may be better to choose a performance metric to monitor that best defines the performance of the model in terms of the way you intend to use it. This may be the metric that you intend to use to report the performance of the model.

A problem with early stopping is that the model does not make use of all available training data.

It may be desirable to avoid overfitting and to train on all possible data, especially on problems where the amount of training data is very limited.

A recommended approach would be to treat the number of training epochs as a hyperparameter and to grid search a range of different values, perhaps using k-fold cross-validation. This will allow you to fix the number of training epochs and fit a final model on all available data.

Early stopping could be used instead. The early stopping procedure could be repeated a number of times. The epoch number at which training was stopped could be recorded. Then, the average of the epoch number across all repeats of early stopping could be used when fitting a final model on all available training data.

This process could be performed using a different split of the training set into train and validation steps each time early stopping is run.

An alternative might be to use early stopping with a validation dataset, then update the final model with further training on the held out validation set.

Early stopping could be used with k-fold cross-validation, although it is not recommended.

The k-fold cross-validation procedure is designed to estimate the generalization error of a model by repeatedly refitting and evaluating it on different subsets of a dataset.

Early stopping is designed to monitor the generalization error of one model and stop training when generalization error begins to degrade.

They are at odds because cross-validation assumes you don’t know the generalization error and early stopping is trying to give you the best model based on knowledge of generalization error.

It may be desirable to use cross-validation to estimate the performance of models with different hyperparameter values, such as learning rate or network structure, whilst also using early stopping.

In this case, if you have the resources to repeatedly evaluate the performance of the model, then perhaps the number of training epochs may also be treated as a hyperparameter to be optimized, instead of using early stopping.

Instead of using cross-validation with early stopping, early stopping may be used directly without repeated evaluation when evaluating different hyperparameter values for the model (e.g. different learning rates).

One possible point of confusion is that early stopping is sometimes referred to as “*cross-validated training*.” Further, research into early stopping that compares triggers may use cross-validation to compare the impact of different triggers.

Repeating the early stopping procedure many times may result in the model overfitting the validation dataset.

This can happen just as easily as overfitting the training dataset.

One approach is to only use early stopping once all other hyperparameters of the model have been chosen.

Another strategy may be to use a different split of the training dataset into train and validation sets each time early stopping is used.

This section provides more resources on the topic if you are looking to go deeper.

- Section 7.8 Early Stopping, Deep Learning, 2016.
- Section 5.5.2 Early stopping, Pattern Recognition and Machine Learning, 2006.
- Section 16.1 Early Stopping, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

- Early Stopping – But When?, 2002.
- Improving model selection by nonconvergent methods, 1993.
- Automatic early stopping using cross validation: quantifying the criteria, 1997.
- Understanding deep learning requires rethinking generalization, 2017.

In this post, you discovered that stopping the training of neural network early before it has overfit the training dataset can reduce overfitting and improve the generalization of deep neural networks.

Specifically, you learned:

- The challenge of training a neural network long enough to learn the mapping, but not so long that it overfits the training data.
- Model performance on a holdout validation dataset can be monitored during training and training stopped when generalization error starts to increase.
- The use of early stopping requires the selection of a performance measure to monitor, a trigger for stopping training, and a selection of the model weights to use.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Early Stopping to Avoid Overtraining Deep Learning Neural Network Models appeared first on Machine Learning Mastery.

]]>The post How to Reduce Overfitting With Dropout Regularization in Keras appeared first on Machine Learning Mastery.

]]>Dropout regularization is a computationally cheap way to regularize a deep neural network.

Dropout works by probabilistically removing, or “dropping out,” inputs to a layer, which may be input variables in the data sample or activations from a previous layer. It has the effect of simulating a large number of networks with very different network structure and, in turn, making nodes in the network generally more robust to the inputs.

In this tutorial, you will discover the Keras API for adding dropout regularization to deep learning neural network models.

After completing this tutorial, you will know:

- How to create a dropout layer using the Keras API.
- How to add dropout regularization to MLP, CNN, and RNN layers using the Keras API.
- How to reduce overfitting by adding a dropout regularization to an existing model.

Let’s get started.

This tutorial is divided into three parts; they are:

- Dropout Regularization in Keras
- Dropout Regularization on Layers
- Dropout Regularization Case Study

Keras supports dropout regularization.

The simplest form of dropout in Keras is provided by a Dropout core layer.

When created, the dropout rate can be specified to the layer as the probability of setting each input to the layer to zero. This is different from the definition of dropout rate from the papers, in which the rate refers to the probability of retaining an input.

Therefore, when a dropout rate of 0.8 is suggested in a paper (retain 80%), this will, in fact, will be a dropout rate of 0.2 (set 20% of inputs to zero).

Below is an example of creating a dropout layer with a 50% chance of setting inputs to zero.

layer = Dropout(0.5)

The Dropout layer is added to a model between existing layers and applies to outputs of the prior layer that are fed to the subsequent layer.

For example, given two dense layers:

... model.append(Dense(32)) model.append(Dense(32)) ...

We can insert a dropout layer between them, in which case the outputs or activations of the first layer have dropout applied to them, which are then taken as input to the next layer.

It is this second layer now which has dropout applied.

... model.append(Dense(32)) model.append(Dropout(0.5)) model.append(Dense(32)) ...

Dropout can also be applied to the visible layer, e.g. the inputs to the network.

This requires that you define the network with the Dropout layer as the first layer and add the *input_shape* argument to the layer to specify the expected shape of the input samples.

... model.add(Dropout(0.5, input_shape=(2,))) ...

Let’s take a look at how dropout regularization can be used with some common network types.

The example below adds dropout between two dense fully connected layers.

# example of dropout between fully connected layers from keras.layers import Dense from keras.layers import Dropout ... model.add(Dense(32)) model.add(Dropout(0.5)) model.add(Dense(1)) ...

Dropout can be used after convolutional layers (e.g. Conv2D) and after pooling layers (e.g. MaxPooling2D).

Often, dropout is only used after the pooling layers, but this is just a rough heuristic.

# example of dropout for a CNN from keras.layers import Dense from keras.layers import Conv2D from keras.layers import MaxPooling2D from keras.layers import Dropout ... model.add(Conv2D(32, (3,3))) model.add(Conv2D(32, (3,3))) model.add(MaxPooling2D()) model.add(Dropout(0.5)) model.add(Dense(1)) ...

In this case, dropout is applied to each element or cell within the feature maps.

An alternative way to use dropout with convolutional neural networks is to dropout entire feature maps from the convolutional layer which are then not used during pooling. This is called spatial dropout (or “*SpatialDropout*“).

Instead we formulate a new dropout method which we call SpatialDropout. For a given convolution feature tensor […] [we] extend the dropout value across the entire feature map.

— Efficient Object Localization Using Convolutional Networks, 2015.

Spatial Dropout is provided in Keras via the SpatialDropout2D layer (as well as 1D and 3D versions).

# example of spatial dropout for a CNN from keras.layers import Dense from keras.layers import Conv2D from keras.layers import MaxPooling2D from keras.layers import SpatialDropout2D ... model.add(Conv2D(32, (3,3))) model.add(Conv2D(32, (3,3))) model.add(SpatialDropout2D(0.5)) model.add(MaxPooling2D()) model.add(Dense(1)) ...

The example below adds dropout between two layers: an LSTM recurrent layer and a dense fully connected layers.

# example of dropout between LSTM and fully connected layers from keras.layers import Dense from keras.layers import LSTM from keras.layers import Dropout ... model.add(LSTM(32)) model.add(Dropout(0.5)) model.add(Dense(1)) ...

This example applies dropout to, in this case, 32 outputs from the LSTM layer provided as input to the Dense layer.

Alternately, the inputs to the LSTM may be subjected to dropout. In this case, a different dropout mask is applied to each time step within each sample presented to the LSTM.

# example of dropout before LSTM layer from keras.layers import Dense from keras.layers import LSTM from keras.layers import Dropout ... model.add(Dropout(0.5, input_shape=(...))) model.add(LSTM(32)) model.add(Dense(1)) ...

There is an alternative way to use dropout with recurrent layers like the LSTM. The same dropout mask may be used by the LSTM for all inputs within a sample. The same approach may be used for recurrent input connections across the time steps of the sample. This approach to dropout with recurrent models is called a Variational RNN.

The proposed technique (Variational RNN […]) uses the same dropout mask at each time step, including the recurrent layers. […] Implementing our approximate inference is identical to implementing dropout in RNNs with the same network units dropped at each time step, randomly dropping inputs, outputs, and recurrent connections. This is in contrast to existing techniques, where different network units would be dropped at different time steps, and no dropout would be applied to the recurrent connections

— A Theoretically Grounded Application of Dropout in Recurrent Neural Networks, 2016.

Keras supports Variational RNNs (i.e. consistent dropout across the time steps of a sample for inputs and recurrent inputs) via two arguments on the recurrent layers, namely “*dropout*” for inputs and “*recurrent_dropout*” for recurrent inputs.

# example of variational LSTM dropout from keras.layers import Dense from keras.layers import LSTM from keras.layers import Dropout ... model.add(LSTM(32, dropout=0.5, recurrent_dropout=0.5)) model.add(Dense(1)) ...

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will demonstrate how to use dropout regularization to reduce overfitting of an MLP on a simple binary classification problem.

This example provides a template for applying dropout regularization to your own neural network for classification and regression problems.

We will use a standard binary classification problem that defines two two-dimensional concentric circles of observations, one circle for each class.

Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “*circles*” dataset because of the shape of the observations in each class when plotted.

We can use the make_circles() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.

# generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1)

We can plot the dataset where the two variables are taken as x and y coordinates on a graph and the class value is taken as the color of the observation.

The complete example of generating the dataset and plotting it is listed below.

# generate two circles dataset from sklearn.datasets import make_circles from matplotlib import pyplot from pandas import DataFrame # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # scatter plot, dots colored by class value df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y)) colors = {0:'red', 1:'blue'} fig, ax = pyplot.subplots() grouped = df.groupby('label') for key, group in grouped: group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key]) pyplot.show()

Running the example creates a scatter plot showing the concentric circles shape of the observations in each class. We can see the noise in the dispersal of the points making the circles less obvious.

This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.

We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset: a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.

We can develop an MLP model to address this binary classification problem.

The model will have one hidden layer with more nodes than may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.

Before we define the model, we will split the dataset into train and test sets, using 30 examples to train the model and 70 to evaluate the fit model’s performance.

# generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:]

Next, we can define the model.

The hidden layer uses 500 nodes in the hidden layer and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1.

The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.

# define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The defined model is then fit on the training data for 4,000 epochs and the default batch size of 32.

We will also use the test dataset as a validation dataset.

# fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)

We can evaluate the performance of the model on the test dataset and report the result.

# evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

Finally, we will plot the performance of the model on both the train and test set each epoch.

If the model does indeed overfit the training dataset, we would expect the line plot of accuracy on the training set to continue to increase and the test set to rise and then fall again as the model learns statistical noise in the training dataset.

# plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

We can tie all of these pieces together; the complete example is listed below.

# mlp overfit on the two circles dataset from sklearn.datasets import make_circles from keras.layers import Dense from keras.models import Sequential from matplotlib import pyplot # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

We can see that the model has better performance on the training dataset than the test dataset, one possible sign of overfitting.

Your specific results may vary given the stochastic nature of the neural network and the training algorithm. Because the model is severely overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.

Train: 1.000, Test: 0.757

A figure is created showing line plots of the model accuracy on the train and test sets.

We can see that expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.

We can update the example to use dropout regularization.

We can do this by simply inserting a new Dropout layer between the hidden layer and the output layer. In this case, we will specify a dropout rate (probability of setting outputs from the hidden layer to zero) to 40% or 0.4.

# define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dropout(0.4)) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The complete updated example with the addition of dropout after the hidden layer is listed below:

# mlp with dropout on the two circles dataset from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from keras.layers import Dropout from matplotlib import pyplot # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dropout(0.4)) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

Your results will likely vary. In this case, the resulting model has a high variance.

In this specific case, we can see that dropout resulted in a slight drop in accuracy on the training dataset, down from 100% to 96%, and a lift in accuracy on the test set, up from 75% to 81%.

Train: 0.967, Test: 0.814

Reviewing the line plot of train and test accuracy during training, we can see that it no longer appears that the model has overfit the training dataset.

Model accuracy on both the train and test sets continues to increase to a plateau, albeit with a lot of noise given the use of dropout during training.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Input Dropout**. Update the example to use dropout on the input variables and compare results.**Weight Constraint**. Update the example to add a max-norm weight constraint to the hidden layer and compare results.**Repeated Evaluation**. Update the example to repeat the evaluation of the overfit and dropout model and summarize and compare the average results.**Grid Search Rate**. Develop a grid search of dropout probabilities and report the relationship between dropout rate and test set accuracy.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Efficient Object Localization Using Convolutional Networks, 2015.
- A Theoretically Grounded Application of Dropout in Recurrent Neural Networks, 2016.

- Dropout Regularization in Deep Learning Models With Keras
- How to Use Dropout with LSTM Networks for Time Series Forecasting

- Keras Regularizers API
- Keras Core Layers API
- Keras Convolutional Layers API
- Keras Recurrent Layers API
- sklearn.datasets.make_circles API

In this tutorial, you discovered the Keras API for adding dropout regularization to deep learning neural network models.

Specifically, you learned:

- How to create a dropout layer using the Keras API.
- How to add dropout regularization to MLP, CNN, and RNN layers using the Keras API.
- How to reduce overfitting by adding a dropout regularization to an existing model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Reduce Overfitting With Dropout Regularization in Keras appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Dropout for Regularizing Deep Neural Networks appeared first on Machine Learning Mastery.

]]>Deep learning neural networks are likely to quickly overfit a training dataset with few examples.

Ensembles of neural networks with different model configurations are known to reduce overfitting, but require the additional computational expense of training and maintaining multiple models.

A single model can be used to simulate having a large number of different network architectures by randomly dropping out nodes during training. This is called dropout and offers a very computationally cheap and remarkably effective regularization method to reduce overfitting and generalization error in deep neural networks of all kinds.

In this post, you will discover the use of dropout regularization for reducing overfitting and improving the generalization of deep neural networks.

After reading this post, you will know:

- Large weights in a neural network are a sign of a more complex network that has overfit the training data.
- Probabilistically dropping out nodes in the network is a simple and effective regularization method.
- A large network with more training and the use of a weight constraint are suggested when using dropout.

Let’s get started.

This tutorial is divided into five parts; they are:

- Problem With Overfitting
- Randomly Drop Nodes
- How to Dropout
- Examples of Using Dropout
- Tips for Using Dropout Regularization

Large neural nets trained on relatively small datasets can overfit the training data.

This has the effect of the model learning the statistical noise in the training data, which results in poor performance when the model is evaluated on new data, e.g. a test dataset. Generalization error increases due to overfitting.

One approach to reduce overfitting is to fit all possible different neural networks on the same dataset and to average the predictions from each model. This is not feasible in practice, and can be approximated using a small collection of different models, called an ensemble.

With unlimited computation, the best way to “regularize” a fixed-sized model is to average the predictions of all possible settings of the parameters, weighting each setting by its posterior probability given the training data.

— Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014.

A problem even with the ensemble approximation is that it requires multiple models to be fit and stored, which can be a challenge if the models are large, requiring days or weeks to train and tune.

Dropout is a regularization method that approximates training a large number of neural networks with different architectures in parallel.

During training, some number of layer outputs are randomly ignored or “*dropped out*.” This has the effect of making the layer look-like and be treated-like a layer with a different number of nodes and connectivity to the prior layer. In effect, each update to a layer during training is performed with a different “*view*” of the configured layer.

By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections

— Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014.

Dropout has the effect of making the training process noisy, forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs.

This conceptualization suggests that perhaps dropout breaks-up situations where network layers co-adapt to correct mistakes from prior layers, in turn making the model more robust.

… units may change in a way that they fix up the mistakes of the other units. This may lead to complex co-adaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data. […]

— Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014.

Dropout simulates a sparse activation from a given layer, which interestingly, in turn, encourages the network to actually learn a sparse representation as a side-effect. As such, it may be used as an alternative to activity regularization for encouraging sparse representations in autoencoder models.

We found that as a side-effect of doing dropout, the activations of the hidden units become sparse, even when no sparsity inducing regularizers are present.

— Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014.

Because the outputs of a layer under dropout are randomly subsampled, it has the effect of reducing the capacity or thinning the network during training. As such, a wider network, e.g. more nodes, may be required when using dropout.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Dropout is implemented per-layer in a neural network.

It can be used with most types of layers, such as dense fully connected layers, convolutional layers, and recurrent layers such as the long short-term memory network layer.

Dropout may be implemented on any or all hidden layers in the network as well as the visible or input layer. It is not used on the output layer.

The term “dropout” refers to dropping out units (hidden and visible) in a neural network.

— Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014.

A new hyperparameter is introduced that specifies the probability at which outputs of the layer are dropped out, or inversely, the probability at which outputs of the layer are retained. The interpretation is an implementation detail that can differ from paper to code library.

A common value is a probability of 0.5 for retaining the output of each node in a hidden layer and a value close to 1.0, such as 0.8, for retaining inputs from the visible layer.

In the simplest case, each unit is retained with a fixed probability p independent of other units, where p can be chosen using a validation set or can simply be set at 0.5, which seems to be close to optimal for a wide range of networks and tasks. For the input units, however, the optimal probability of retention is usually closer to 1 than to 0.5.

— Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014.

Dropout is not used after training when making a prediction with the fit network.

The weights of the network will be larger than normal because of dropout. Therefore, before finalizing the network, the weights are first scaled by the chosen dropout rate. The network can then be used as per normal to make predictions.

If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time

— Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014.

The rescaling of the weights can be performed at training time instead, after each weight update at the end of the mini-batch. This is sometimes called “*inverse dropout*” and does not require any modification of weights during training. Both the Keras and PyTorch deep learning libraries implement dropout in this way.

At test time, we scale down the output by the dropout rate. […] Note that this process can be implemented by doing both operations at training time and leaving the output unchanged at test time, which is often the way it’s implemented in practice

— Page 109, Deep Learning With Python, 2017.

Dropout works well in practice, perhaps replacing the need for weight regularization (e.g. weight decay) and activation regularization (e.g. representation sparsity).

… dropout is more effective than other standard computationally inexpensive regularizers, such as weight decay, filter norm constraints and sparse activity regularization. Dropout may also be combined with other forms of regularization to yield a further improvement.

— Page 265, Deep Learning, 2016.

This section summarizes some examples where dropout was used in recent research papers to provide a suggestion for how and where it may be used.

Geoffrey Hinton, et al. in their 2012 paper that first introduced dropout titled “Improving neural networks by preventing co-adaptation of feature detectors” applied used the method with a range of different neural networks on different problem types achieving improved results, including handwritten digit recognition (MNIST), photo classification (CIFAR-10), and speech recognition (TIMIT).

… we use the same dropout rates – 50% dropout for all hidden units and 20% dropout for visible units

Nitish Srivastava, et al. in their 2014 journal paper introducing dropout titled “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” used dropout on a wide range of computer vision, speech recognition, and text classification tasks and found that it consistently improved performance on each problem.

We trained dropout neural networks for classification problems on data sets in different domains. We found that dropout improved generalization performance on all data sets compared to neural networks that did not use dropout.

On the computer vision problems, different dropout rates were used down through the layers of the network in conjunction with a max-norm weight constraint.

Dropout was applied to all the layers of the network with the probability of retaining the unit being p = (0.9, 0.75, 0.75, 0.5, 0.5, 0.5) for the different layers of the network (going from input to convolutional layers to fully connected layers). In addition, the max-norm constraint with c = 4 was used for all the weights. […]

A simpler configuration was used for the text classification task.

We used probability of retention p = 0.8 in the input layers and 0.5 in the hidden layers. Max-norm constraint with c = 4 was used in all the layers.

Alex Krizhevsky, et al. in their famous 2012 paper titled “ImageNet Classification with Deep Convolutional Neural Networks” achieved (at the time) state-of-the-art results for photo classification on the ImageNet dataset with deep convolutional neural networks and dropout regularization.

We use dropout in the first two fully-connected layers [of the model]. Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge.

George Dahl, et al. in their 2013 paper titled “Improving deep neural networks for LVCSR using rectified linear units and dropout” used a deep neural network with rectified linear activation functions and dropout to achieve (at the time) state-of-the-art results on a standard speech recognition task. They used a bayesian optimization procedure to configure the choice of activation function and the amount of dropout.

… the Bayesian optimization procedure learned that dropout wasn’t helpful for sigmoid nets of the sizes we trained. In general, ReLUs and dropout seem to work quite well together.

This section provides some tips for using dropout regularization with your neural network.

Dropout regularization is a generic approach.

It can be used with most, perhaps all, types of neural network models, not least the most common network types of Multilayer Perceptrons, Convolutional Neural Networks, and Long Short-Term Memory Recurrent Neural Networks.

In the case of LSTMs, it may be desirable to use different dropout rates for the input and recurrent connections.

The default interpretation of the dropout hyperparameter is the probability of training a given node in a layer, where 1.0 means no dropout, and 0.0 means no outputs from the layer.

A good value for dropout in a hidden layer is between 0.5 and 0.8. Input layers use a larger dropout rate, such as of 0.8.

It is common for larger networks (more layers or more nodes) to more easily overfit the training data.

When using dropout regularization, it is possible to use larger networks with less risk of overfitting. In fact, a large network (more nodes per layer) may be required as dropout will probabilistically reduce the capacity of the network.

A good rule of thumb is to divide the number of nodes in the layer before dropout by the proposed dropout rate and use that as the number of nodes in the new network that uses dropout. For example, a network with 100 nodes and a proposed dropout rate of 0.5 will require 200 nodes (100 / 0.5) when using dropout.

If n is the number of hidden units in any layer and p is the probability of retaining a unit […] a good dropout net should have at least n/p units

— Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014.

Rather than guess at a suitable dropout rate for your network, test different rates systematically.

For example, test values between 1.0 and 0.1 in increments of 0.1.

This will both help you discover what works best for your specific model and dataset, as well as how sensitive the model is to the dropout rate. A more sensitive model may be unstable and could benefit from an increase in size.

Network weights will increase in size in response to the probabilistic removal of layer activations.

Large weight size can be a sign of an unstable network.

To counter this effect a weight constraint can be imposed to force the norm (magnitude) of all weights in a layer to be below a specified value. For example, the maximum norm constraint is recommended with a value between 3-4.

[…] we can use max-norm regularization. This constrains the norm of the vector of incoming weights at each hidden unit to be bound by a constant c. Typical values of c range from 3 to 4.

— Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014.

This does introduce an additional hyperparameter that may require tuning for the model.

Like other regularization methods, dropout is more effective on those problems where there is a limited amount of training data and the model is likely to overfit the training data.

Problems where there is a large amount of training data may see less benefit from using dropout.

For very large datasets, regularization confers little reduction in generalization error. In these cases, the computational cost of using dropout and larger models may outweigh the benefit of regularization.

— Page 265, Deep Learning, 2016.

This section provides more resources on the topic if you are looking to go deeper.

- Section 7.12 Dropout, Deep Learning, 2016.
- Section 4.4.3 Adding dropout, Deep Learning With Python, 2017.

- Improving neural networks by preventing co-adaptation of feature detectors, 2012.
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014.
- Improving deep neural networks for LVCSR using rectified linear units and dropout, 2013.
- Dropout Training as Adaptive Regularization, 2013.

- Dropout Regularization in Deep Learning Models With Keras
- How to Use Dropout with LSTM Networks for Time Series Forecasting

- Dropout (neural networks), Wikipedia.
- Regularization, CS231n Convolutional Neural Networks for Visual Recognition
- How was ‘Dropout’ conceived? Was there an ‘aha’ moment?

In this post, you discovered the use of dropout regularization for reducing overfitting and improving the generalization of deep neural networks.

Specifically, you learned:

- Large weights in a neural network are a sign of a more complex network that has overfit the training data.
- Probabilistically dropping out nodes in the network is a simple and effective regularization method.
- A large network with more training and the use of a weight constraint are suggested when using dropout.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Dropout for Regularizing Deep Neural Networks appeared first on Machine Learning Mastery.

]]>The post How to Reduce Generalization Error in Deep Neural Networks With Activity Regularization in Keras appeared first on Machine Learning Mastery.

]]>Activity regularization provides an approach to encourage a neural network to learn sparse features or internal representations of raw observations.

It is common to seek sparse learned representations in autoencoders, called sparse autoencoders, and in encoder-decoder models, although the approach can also be used generally to reduce overfitting and improve a model’s ability to generalize to new observations.

In this tutorial, you will discover the Keras API for adding activity regularization to deep learning neural network models.

After completing this tutorial, you will know:

- How to create vector norm regularizers using the Keras API.
- How to add activity regularization to MLP, CNN, and RNN layers using the Keras API.
- How to reduce overfitting by adding activity regularization to an existing model.

Let’s get started.

This tutorial is divided into three parts; they are:

- Activity Regularization in Keras
- Activity Regularization on Layers
- Activity Regularization Case Study

Keras supports activity regularization.

There are three different regularization techniques supported, each provided as a class in the *keras.regularizers* module:

**l1**: Activity is calculated as the sum of absolute values.**l2**: Activity is calculated as the sum of the squared values.**l1_l2**: Activity is calculated as the sum of absolute and sum of the squared values.

Each of the *l1* and *l2* regularizers takes a single hyperparameter that controls the amount that each activity contributes to the sum. The *l1_l2* regularizer takes two hyperparameters, one for each of the l1 and l2 methods.

The regularizer class must be imported and then instantiated; for example:

# import regularizer from keras.regularizers import l1 # instantiate regularizer reg = l1(0.001)

Activity regularization is specified on a layer in Keras.

This can be achieved by setting the *activity_regularizer* argument on the layer to an instantiated and configured regularizer class.

The regularizer is applied to the output of the layer, but you have control over what the “*output*” of the layer actually means. Specifically, you have flexibility as to whether the layer output means that the regularization is applied before or after the ‘*activation*‘ function.

For example, you can specify the function and the regularization on the layer, in which case activation regularization is applied to the output of the activation function, in this case, relu.

... model.add(Dense(32, activation='relu', activity_regularizer=l1(0.001))) ...

Alternately, you can specify a linear activation function (the default, that does not perform any transform) which means that the activation regularization is applied on the raw outputs, then, the activation function can be added as a subsequent layer.

... model.add(Dense(32, activation='linear', activity_regularizer=l1(0.001))) model.add(Activation('relu')) ...

The latter is probably the preferred usage of activation regularization as described in “Deep Sparse Rectifier Neural Networks” in order to allow the model to learn to take activations to a true zero value in conjunction with the rectified linear activation function. Nevertheless, the two possible uses of activation regularization may be explored in order to discover what works best for your specific model and dataset.

Let’s take a look at how activity regularization can be used with some common layer types.

The example below sets l1 norm activity regularization on a Dense fully connected layer.

# example of l1 norm on activity from a dense layer from keras.layers import Dense from keras.regularizers import l1 ... model.add(Dense(32, activity_regularizer=l1(0.001))) ...

The example below sets l1 norm activity regularization on a Conv2D convolutional layer.

# example of l1 norm on activity from a cnn layer from keras.layers import Conv2D from keras.regularizers import l1 ... model.add(Conv2D(32, (3,3), activity_regularizer=l1(0.001))) ...

The example below sets l1 norm activity regularization on an LSTM recurrent layer.

# example of l1 norm on activity from an lstm layer from keras.layers import LSTM from keras.regularizers import l1 ... model.add(LSTM(32, activity_regularizer=l1(0.001))) ...

Now that we know how to use the activity regularization API, let’s look at a worked example.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will demonstrate how to use activity regularization to reduce overfitting of an MLP on a simple binary classification problem.

Although activity regularization is most often used to encourage sparse learned representations in autoencoder and encoder-decoder models, it can also be used directly within normal neural networks to achieve the same effect and improve the generalization of the model.

This example provides a template for applying activity regularization to your own neural network for classification and regression problems.

We will use a standard binary classification problem that defines two two-dimensional concentric circles of observations, one circle for each class.

Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “*circles*” dataset because of the shape of the observations in each class when plotted.

We can use the make_circles() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.

# generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1)

The complete example of generating the dataset and plotting it is listed below.

# generate two circles dataset from sklearn.datasets import make_circles from matplotlib import pyplot from pandas import DataFrame # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # scatter plot, dots colored by class value df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y)) colors = {0:'red', 1:'blue'} fig, ax = pyplot.subplots() grouped = df.groupby('label') for key, group in grouped: group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key]) pyplot.show()

Running the example creates a scatter plot showing the concentric circles shape of the observations in each class.

We can see the noise in the dispersal of the points making the circles less obvious.

We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset: a good case for using regularization.

Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.

We can develop an MLP model to address this binary classification problem.

The model will have one hidden layer with more nodes that may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.

# generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:]

Next, we can define the model.

The hidden layer uses 500 nodes and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1.

The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.

We will also use the test dataset as a validation dataset.

We can evaluate the performance of the model on the test dataset and report the result.

Finally, we will plot the performance of the model on both the train and test set each epoch.

If the model does indeed overfit the training dataset, we would expect the line plot of accuracy on the training set to continue to increase and the test set to rise and then fall again as the model learns statistical noise in the training dataset.

# plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

We can tie all of these pieces together, the complete example is listed below.

# mlp overfit on the two circles dataset from sklearn.datasets import make_circles from keras.layers import Dense from keras.models import Sequential from matplotlib import pyplot # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

Train: 1.000, Test: 0.786

A figure is created showing line plots of the model accuracy on the train and test sets.

We can see the expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.

We can update the example to use activation regularization.

There are a few different regularization methods to choose from, but it is probably a good idea to use the most common, which is the L1 vector norm.

This regularization has the effect of encouraging a sparse representation (lots of zeros), which is supported by the rectified linear activation function that permits true zero values.

We can do this by using the *keras.regularizers.l1* class in Keras.

We will configure the layer to use the linear activation function so that we can regularize the raw outputs, then add a relu activation layer after the regularized outputs of the layer. We will set the regularization hyperparameter to 1E-4 or 0.0001, found with a little trial and error.

model.add(Dense(500, input_dim=2, activation='linear', activity_regularizer=l1(0.0001))) model.add(Activation('relu'))

The complete updated example with the L1 norm constraint is listed below:

# mlp overfit on the two circles dataset with activation regularization from sklearn.datasets import make_circles from keras.layers import Dense from keras.models import Sequential from keras.regularizers import l1 from keras.layers import Activation from matplotlib import pyplot # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='linear', activity_regularizer=l1(0.0001))) model.add(Activation('relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

We can see that activity regularization resulted in a slight drop in accuracy on the training dataset down from 100% to 96% and a lift in accuracy on the test set up from 78% to 82%.

Train: 0.967, Test: 0.829

Reviewing the line plot of train and test accuracy, we can see that it no longer appears that the model has overfit the training dataset.

Model accuracy on both the train and test sets continues to increase to a plateau.

For completeness, we can compare results to a version of the model where activity regularization is applied after the relu activation function.

model.add(Dense(500, input_dim=2, activation='relu', activity_regularizer=l1(0.0001)))

The complete example is listed below.

# mlp overfit on the two circles dataset with activation regularization from sklearn.datasets import make_circles from keras.layers import Dense from keras.models import Sequential from keras.regularizers import l1 from matplotlib import pyplot # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu', activity_regularizer=l1(0.0001))) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

We can see that, at least on this problem and with this model, activation regularization after the activation function did not improve generalization error; in fact, it made it worse.

Train: 1.000, Test: 0.743

Reviewing the line plot of train and test accuracy, we can see that indeed the model still shows the signs of having overfit the training dataset.

This suggests that it may be worth experimenting with both approaches for implementing activity regularization with your own dataset, to confirm that you are getting the most out of the method.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Report Activation Mean**. Update the example to calculate the mean activation of the regularized layer and confirm that indeed the activations have been made more sparse.**Grid Search**. Update the example to grid search different values for the regularization hyperparameter.**Alternate Norm**. Update the example to evaluate the L2 or L1_L2 vector norm for regularizing the hidden layer outputs.**Repeated Evaluation**. Update the example to fit and evaluate the model multiple times and report the mean and standard deviation of model performance.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Keras Regularizers API
- Keras Core Layers API
- Keras Convolutional Layers API
- Keras Recurrent Layers API
- sklearn.datasets.make_circles

In this tutorial, you discovered the Keras API for adding activity regularization to deep learning neural network models.

Specifically, you learned:

- How to create vector norm regularizers using the Keras API.
- How to add activity regularization to MLP, CNN, and RNN layers using the Keras API.
- How to reduce overfitting by adding an activity regularization to an existing model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Reduce Generalization Error in Deep Neural Networks With Activity Regularization in Keras appeared first on Machine Learning Mastery.

]]>The post Activation Regularization for Reducing Generalization Error in Deep Learning Neural Networks appeared first on Machine Learning Mastery.

]]>Deep learning models are capable of automatically learning a rich internal representation from raw input data.

This is called feature or representation learning. Better learned representations, in turn, can lead to better insights into the domain, e.g. via visualization of learned features, and to better predictive models that make use of the learned features.

A problem with learned features is that they can be too specialized to the training data, or overfit, and not generalize well to new examples. Large values in the learned representation can be a sign of the representation being overfit. Activity or representation regularization provides a technique to encourage the learned representations, the output or activation of the hidden layer or layers of the network, to stay small and sparse.

In this post, you will discover activation regularization as a technique to improve the generalization of learned features in neural networks.

After reading this post, you will know:

- Neural networks learn features from data and models, such as autoencoders and encoder-decoder models, explicitly seek effective learned representations.
- Similar to weights, large values in learned features, e.g. large activations, may indicate an overfit model.
- The addition of penalties to the loss function that penalize a model in proportion to the magnitude of the activations may result in more robust and generalized learned features.

Let’s get started.

This tutorial is divided into five parts; they are:

- Problem With Learned Features
- Encourage Small Activations
- How to Encourage Small Activations
- Examples of Activation Regularization
- Tips for Using Activation Regularization

Deep learning models are able to perform feature learning.

That is, during the training of the network, the model will automatically extract the salient features from the input patterns or “*learn features*.” These features may be used in the network in order to predict a quantity for regression or predict a class value for classification.

These internal representations are tangible things. The output of a hidden layer within the network represent the learned features by the model at that point in the network.

There is a field of study focused on the efficient and effective automatic learning of features, often investigated by having a network reduce an input to a small learned feature before using a second network to reconstruct the original input from the learned feature. Models of this type are called auto-encoders, or encoder-decoders, and their learned features can be useful to learn more about the domain (e.g. via visualization) and in predictive models.

The learned features, or “*encoded inputs*,” must be large enough to capture the salient features of the input but also focused enough to not over-fit the specific examples in the training dataset. As such, there is a tension between the expressiveness and the generalization of the learned features.

More importantly, when the dimension of the code in an encoder-decoder architecture is larger than the input, it is necessary to limit the amount of information carried by the code, lest the encoder-decoder may simply learn the identity function in a trivial way and produce uninteresting features.

— Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition, 2007.

In the same way that large weights in the network can signify an unstable and overfit model, large output values in the learned features can signify the same problems.

It is desirable to have small values in the learned features, e.g. small outputs or activations from the encoder network.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The loss function of the network can be updated to penalize models in proportion to the magnitude of their activation.

This is similar to “*weight regularization*” where the loss function is updated to penalize the model in proportion to the magnitude of the weights. The output of a layer is referred to as its ‘*activation*,’ as such, this form of penalty or regularization is referred to as ‘*activation regularization*.’

… place a penalty on the activations of the units in a neural network, encouraging their activations to be sparse.

— Page 254, Deep Learning, 2016.

The output of an encoder or, generally, the output of a hidden layer in a neural network may be considered the representation of the problem at that point in the model. As such, this type of penalty may also be referred to as ‘*representation regularization*.’

The desire to have small activations or even very few activations with mostly zero values is also called a desire for sparsity. As such, this type of penalty is also referred to as ‘*sparse feature learning*.’

One way to limit the information content of an overcomplete code is to make it sparse.

— Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition, 2007.

The encouragement of sparse learned features in autoencoder models is referred to as ‘*sparse autoencoders*.’

A sparse autoencoder is simply an autoencoder whose training criterion involves a sparsity penalty on the code layer, in addition to the reconstruction error

— Page 505, Deep Learning, 2016.

Sparsity is most commonly sought when a larger-than-required hidden layer (e.g. over-complete) is used to learn features that may encourage over-fitting. The introduction of a sparsity penalty counters this problem and encourages better generalization.

A sparse overcomplete learned feature has been shown to be more effective than other types of learned features offering better robustness to noise and even transforms in the input, e.g. learned features of images may have improved invariance to the position of objects in the image.

Sparse-overcomplete representations have a number of theoretical and practical advantages, as demonstrated in a number of recent studies. In particular, they have good robustness to noise, and provide a good tiling of the joint space of location and frequency. In addition, they are advantageous for classifiers because classification is more likely to be easier in higher dimensional spaces.

— Sparse Feature Learning for Deep Belief Networks, 2007.

There is a general focus on sparsity of the representations rather than small vector magnitudes. A study of these representations that is more general than the use of neural networks is known as ‘*sparse coding*.’

Sparse coding provides a class of algorithms for finding succinct representations of stimuli; given only unlabeled input data, it learns basis functions that capture higher-level features in the data.

— Efficient sparse coding algorithms, 2007.

An activation penalty can be applied per-layer, perhaps only at one layer that is the focus of the learned representation, such as the output of the encoder model or the middle (bottleneck) of an autoencoder model.

A constraint can be applied that adds a penalty proportional to the magnitude of the vector output of the layer.

The activation values may be positive or negative, so we cannot simply sum the values.

Two common methods for calculating the magnitude of the activation are:

- Sum of the absolute activation values, called l1 vector norm.
- Sum of the squared activation values, called the l2 vector norm.

The L1 norm encourages sparsity, e.g. allows some activations to become zero, whereas the l2 norm encourages small activations values in general. Use of the L1 norm may be a more commonly used penalty for activation regularization.

A hyperparameter must be specified that indicates the amount or degree that the loss function will weight or pay attention to the penalty. Common values are on a logarithmic scale between 0 and 0.1, such as 0.1, 0.001, 0.0001, etc.

Activity regularization can be used in conjunction with other regularization techniques, such as weight regularization.

This section provides some examples of activation regularization in order to provide some context for how the technique may be used in practice.

Regularized or sparse activations were originally sought as an approach to support the development of much deeper neural networks, early in the history of deep learning. As such, many examples may make use of architectures like restricted Boltzmann machines (RBMs) that have been replaced by more modern methods. Another big application of weight regularization is in autoencoders with semi-labeled or unlabeled data, so-called sparse autoencoders.

Xavier Glorot, et al. at the University of Montreal introduced the use of the rectified linear activation function to encourage sparsity of representation. They used an L1 penalty and evaluate deep supervised MLPs on a range of classical computer vision classification tasks such as MNIST and CIFAR10.

Additionally, an L1 penalty on the activations with a coefficient of 0.001 was added to the cost function during pre-training and fine-tuning in order to increase the amount of sparsity in the learned representations

— Deep Sparse Rectifier Neural Networks, 2011.

Stephen Merity, et al. from Salesforce Research used L2 activation regularization with LSTMs on outputs and recurrent outputs for natural language process in conjunction with dropout regularization. They tested a suite of different activation regularization coefficient values on a range of language modeling problems.

While simple to implement, activity regularization and temporal activity regularization are competitive with other far more complex regularization techniques and offer equivalent or better results.

— Revisiting Activation Regularization for Language RNNs, 2017.

This section provides some tips for using activation regularization with your neural network.

Activation regularization is a generic approach.

It can be used with most, perhaps all, types of neural network models, not least the most common network types of Multilayer Perceptrons, Convolutional Neural Networks, and Long Short-Term Memory Recurrent Neural Networks.

Activity regularization may be best suited to those model types that explicitly seek an efficient learned representation.

These include models such as autoencoders (i.e. sparse autoencoders) and encoder-decoder models, such as encoder-decoder LSTMs used for sequence-to-sequence prediction problems.

The most common activation regularization is the L1 norm as it encourages sparsity.

Experiment with other types of regularization such as the L2 norm or using both the L1 and L2 norms at the same time, e.g. like the Elastic Net linear regression algorithm.

The rectified linear activation function, also called relu, is an activation function that is now widely used in the hidden layer of deep neural networks.

Unlike classical activation functions such as tanh (hyperbolic tangent function) and sigmoid (logistic function), the relu function allows exact zero values easily. This makes it a good candidate when learning sparse representations, such as with the l1 vector norm activation regularization.

It is common to use small values for the regularization hyperparameter that controls the contribution of each activation to the penalty.

Perhaps start by testing values on a log scale, such as 0.1, 0.001, and 0.0001. Then use a grid search at the order of magnitude that shows the most promise.

It is a generally good practice to rescale input variables to have the same scale.

When input variables have different scales, the scale of the weights of the network will, in turn, vary accordingly. Large weights can saturate the nonlinear transfer function and reduce the variance in the output from the layer. This may introduce a problem when using activation regularization.

This problem can be addressed by either normalizing or standardizing input variables.

Configure the layer chosen to be the learned features, e.g. the output of the encoder or the bottleneck in the autoencoder, to have more nodes that may be required.

This is called an overcomplete representation that will encourage the network to overfit the training examples. This can be countered with a strong activation regularization in order to encourage a rich learned representation that is also sparse.

This section provides more resources on the topic if you are looking to go deeper.

- 7.10 Sparse Representations, Deep Learning, 2016.

- Deep Sparse Rectifier Neural Networks, 2011.
- Sparse Feature Learning for Deep Belief Networks, 2007.
- Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition, 2007.
- Efficient sparse coding algorithms, 2007.
- Measuring Invariances in Deep Networks, 2009.
- Sparse deep belief net model for visual area V2, 2007.
- Revisiting Activation Regularization for Language RNNs, 2017.
- Sparse Activity and Sparse Connectivity in Supervised Learning, 2013.

In this post, you discovered activation regularization as a technique to improve the generalization of learned features.

Specifically, you learned:

- Neural networks learn features from data and models, such as autoencoders and encoder-decoder models, explicitly seek effective learned representations.
- Similar to weights, large values in learned features, e.g. large activations, may indicate an overfit model.
- The addition of penalties to the loss function that penalize a model in proportion to the magnitude of the activations may result in more robust and generalized learned features.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Activation Regularization for Reducing Generalization Error in Deep Learning Neural Networks appeared first on Machine Learning Mastery.

]]>The post How to Reduce Overfitting in Deep Neural Networks Using Weight Constraints in Keras appeared first on Machine Learning Mastery.

]]>Weight constraints provide an approach to reduce the overfitting of a deep learning neural network model on the training data and improve the performance of the model on new data, such as the holdout test set.

There are multiple types of weight constraints, such as maximum and unit vector norms, and some require a hyperparameter that must be configured.

In this tutorial, you will discover the Keras API for adding weight constraints to deep learning neural network models to reduce overfitting.

After completing this tutorial, you will know:

- How to create vector norm constraints using the Keras API.
- How to add weight constraints to MLP, CNN, and RNN layers using the Keras API.
- How to reduce overfitting by adding a weight constraint to an existing model.

Let’s get started.

This tutorial is divided into three parts; they are:

- Weight Constraints in Keras
- Weight Constraints on Layers
- Weight Constraint Case Study

The Keras API supports weight constraints.

The constraints are specified per-layer, but applied and enforced per-node within the layer.

Using a constraint generally involves setting the *kernel_constraint* argument on the layer for the input weights and the *bias_constraint* for the bias weights.

Generally, weight constraints are not used on the bias weights.

A suite of different vector norms can be used as constraints, provided as classes in the keras.constraints module. They are:

**Maximum norm**(*max_norm*), to force weights to have a magnitude at or below a given limit.**Non-negative norm**(*non_neg*), to force weights to have a positive magnitude.**Unit norm**(*unit_norm*), to force weights to have a magnitude of 1.0.**Min-Max norm**(*min_max_norm*), to force weights to have a magnitude between a range.

For example, a constraint can imported and instantiated:

# import norm from keras.constraints import max_norm # instantiate norm norm = max_norm(3.0)

The weight norms can be used with most layers in Keras.

In this section, we will look at some common examples.

The example below sets a maximum norm weight constraint on a Dense fully connected layer.

# example of max norm on a dense layer from keras.layers import Dense from keras.constraints import max_norm ... model.add(Dense(32, kernel_constraint=max_norm(3), bias_constraint==max_norm(3))) ...

The example below sets a maximum norm weight constraint on a convolutional layer.

# example of max norm on a cnn layer from keras.layers import Conv2D from keras.constraints import max_norm ... model.add(Conv2D(32, (3,3), kernel_constraint=max_norm(3), bias_constraint==max_norm(3))) ...

Unlike other layer types, recurrent neural networks allow you to set a weight constraint on both the input weights and bias, as well as the recurrent input weights.

The constraint for the recurrent weights is set via the *recurrent_constraint* argument to the layer.

The example below sets a maximum norm weight constraint on an LSTM layer.

# example of max norm on an lstm layer from keras.layers import LSTM from keras.constraints import max_norm ... model.add(LSTM(32, kernel_constraint=max_norm(3), recurrent_constraint=max_norm(3), bias_constraint==max_norm(3))) ...

Now that we know how to use the weight constraint API, let’s look at a worked example.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will demonstrate how to use weight constraints to reduce overfitting of an MLP on a simple binary classification problem.

This example provides a template for applying weight constraints to your own neural network for classification and regression problems.

We will use a standard binary classification problem that defines two semi-circles of observations, one semi-circle for each class.

Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “*moons*” dataset because of the shape of the observations in each class when plotted.

We can use the make_moons() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.

# generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1)

The complete example of generating the dataset and plotting it is listed below.

# generate two moons dataset from sklearn.datasets import make_moons from matplotlib import pyplot from pandas import DataFrame # generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1) # scatter plot, dots colored by class value df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y)) colors = {0:'red', 1:'blue'} fig, ax = pyplot.subplots() grouped = df.groupby('label') for key, group in grouped: group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key]) pyplot.show()

Running the example creates a scatter plot showing the semi-circle or moon shape of the observations in each class. We can see the noise in the dispersal of the points making the moons less obvious.

We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset: a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.

We can develop an MLP model to address this binary classification problem.

# generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:]

Next, we can define the model.

The hidden layer uses 500 nodes in the hidden layer and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1.

The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.

We will also use the test dataset as a validation dataset.

We can evaluate the performance of the model on the test dataset and report the result.

Finally, we will plot the performance of the model on both the train and test set each epoch.

We can tie all of these pieces together; the complete example is listed below.

# mlp overfit on the moons dataset from sklearn.datasets import make_moons from keras.layers import Dense from keras.models import Sequential from matplotlib import pyplot # generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

Your specific results may vary given the stochastic nature of the neural network and the training algorithm. Because the model is overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.

Train: 1.000, Test: 0.914

A figure is created showing line plots of the model accuracy on the train and test sets.

We can update the example to use a weight constraint.

There are a few different weight constraints to choose from. A good simple constraint for this model is to simply normalize the weights so that the norm is equal to 1.0.

This constraint has the effect of forcing all incoming weights to be small.

We can do this by using the *unit_norm* in Keras. This constraint can be added to the first hidden layer as follows:

model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=unit_norm()))

We can also achieve the same result by using the *min_max_norm* and setting the min and maximum to 1.0, for example:

model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=min_max_norm(min_value=1.0, max_value=1.0)))

We cannot achieve the same result with the maximum norm constraint as it will allow norms at or below the specified limit; for example:

model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=max_norm(1.0)))

The complete updated example with the unit norm constraint is listed below:

# mlp overfit on the moons dataset with a unit norm constraint from sklearn.datasets import make_moons from keras.layers import Dense from keras.models import Sequential from keras.constraints import unit_norm from matplotlib import pyplot # generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=unit_norm())) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot history pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

Running the example reports the model performance on the train and test datasets.

We can see that indeed the strict constraint on the size of the weights has improved the performance of the model on the holdout set without impacting performance on the training set.

Train: 1.000, Test: 0.943

Reviewing the line plot of train and test accuracy, we can see that it no longer appears that the model has overfit the training dataset.

Model accuracy on both the train and test sets continues to increase to a plateau.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Report Weight Norm**. Update the example to calculate the magnitude of the network weights and demonstrate that the constraint indeed made the magnitude smaller.**Constrain Output Layer**. Update the example to add a constraint to the output layer of the model and compare the results.**Constrain Bias**. Update the example to add a constraint to the bias weight and compare the results.**Repeated Evaluation**. Update the example to fit and evaluate the model multiple times and report the mean and standard deviation of model performance.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Keras Constraints API
- Keras constraints.py
- Keras Core Layers API
- Keras Convolutional Layers API
- Keras Recurrent Layers API
- sklearn.datasets.make_moons API

In this tutorial, you discovered the Keras API for adding weight constraints to deep learning neural network models.

Specifically, you learned:

- How to create vector norm constraints using the Keras API.
- How to add weight constraints to MLP, CNN, and RNN layers using the Keras API.
- How to reduce overfitting by adding a weight constraint to an existing model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Reduce Overfitting in Deep Neural Networks Using Weight Constraints in Keras appeared first on Machine Learning Mastery.

]]>