How to Reduce Overfitting Using Weight Constraints in Keras

By Jason Brownlee on August 25, 2020 in Deep Learning Performance 34

Weight constraints provide an approach to reduce the overfitting of a deep learning neural network model on the training data and improve the performance of the model on new data, such as the holdout test set.

There are multiple types of weight constraints, such as maximum and unit vector norms, and some require a hyperparameter that must be configured.

In this tutorial, you will discover the Keras API for adding weight constraints to deep learning neural network models to reduce overfitting.

After completing this tutorial, you will know:

How to create vector norm constraints using the Keras API.
How to add weight constraints to MLP, CNN, and RNN layers using the Keras API.
How to reduce overfitting by adding a weight constraint to an existing model.

Kick-start your project with my new book Better Deep Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Updated Mar/2019: fixed typo using equality instead of assignment in some usage examples.
Updated Oct/2019: Updated for Keras 2.3 and TensorFlow 2.0.

How to Reduce Overfitting in Deep Neural Networks With Weight Constraints in Keras
Photo by Ian Sane, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Weight Constraints in Keras
Weight Constraints on Layers
Weight Constraint Case Study

Weight Constraints in Keras

The Keras API supports weight constraints.

The constraints are specified per-layer, but applied and enforced per-node within the layer.

Using a constraint generally involves setting the kernel_constraint argument on the layer for the input weights and the bias_constraint for the bias weights.

Generally, weight constraints are not used on the bias weights.

A suite of different vector norms can be used as constraints, provided as classes in the keras.constraints module. They are:

Maximum norm (max_norm), to force weights to have a magnitude at or below a given limit.
Non-negative norm (non_neg), to force weights to have a positive magnitude.
Unit norm (unit_norm), to force weights to have a magnitude of 1.0.
Min-Max norm (min_max_norm), to force weights to have a magnitude between a range.

For example, a constraint can imported and instantiated:

# import norm
from keras.constraints import max_norm
# instantiate norm
norm = max_norm(3.0)

# import norm

from keras.constraints import max_norm

# instantiate norm

norm = max_norm(3.0)

Weight Constraints on Layers

The weight norms can be used with most layers in Keras.

In this section, we will look at some common examples.

MLP Weight Constraint

The example below sets a maximum norm weight constraint on a Dense fully connected layer.

# example of max norm on a dense layer
from keras.layers import Dense
from keras.constraints import max_norm
...
model.add(Dense(32, kernel_constraint=max_norm(3), bias_constraint=max_norm(3)))
...

# example of max norm on a dense layer

from keras.layers import Dense

from keras.constraints import max_norm

...

model.add(Dense(32, kernel_constraint=max_norm(3), bias_constraint=max_norm(3)))

...

CNN Weight Constraint

The example below sets a maximum norm weight constraint on a convolutional layer.

# example of max norm on a cnn layer
from keras.layers import Conv2D
from keras.constraints import max_norm
...
model.add(Conv2D(32, (3,3), kernel_constraint=max_norm(3), bias_constraint=max_norm(3)))
...

# example of max norm on a cnn layer

from keras.layers import Conv2D

from keras.constraints import max_norm

...

model.add(Conv2D(32, (3,3), kernel_constraint=max_norm(3), bias_constraint=max_norm(3)))

...

RNN Weight Constraint

Unlike other layer types, recurrent neural networks allow you to set a weight constraint on both the input weights and bias, as well as the recurrent input weights.

The constraint for the recurrent weights is set via the recurrent_constraint argument to the layer.

The example below sets a maximum norm weight constraint on an LSTM layer.

# example of max norm on an lstm layer
from keras.layers import LSTM
from keras.constraints import max_norm
...
model.add(LSTM(32, kernel_constraint=max_norm(3), recurrent_constraint=max_norm(3), bias_constraint=max_norm(3)))
...

# example of max norm on an lstm layer

from keras.layers import LSTM

from keras.constraints import max_norm

...

model.add(LSTM(32, kernel_constraint=max_norm(3), recurrent_constraint=max_norm(3), bias_constraint=max_norm(3)))

...

Now that we know how to use the weight constraint API, let’s look at a worked example.

Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Weight Constraint Case Study

In this section, we will demonstrate how to use weight constraints to reduce overfitting of an MLP on a simple binary classification problem.

This example provides a template for applying weight constraints to your own neural network for classification and regression problems.

Binary Classification Problem

We will use a standard binary classification problem that defines two semi-circles of observations, one semi-circle for each class.

Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “moons” dataset because of the shape of the observations in each class when plotted.

We can use the make_moons() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.

# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)

1 2	# generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.2, random_state=1)

We can plot the dataset where the two variables are taken as x and y coordinates on a graph and the class value is taken as the color of the observation.

The complete example of generating the dataset and plotting it is listed below.

# generate two moons dataset
from sklearn.datasets import make_moons
from matplotlib import pyplot
from pandas import DataFrame
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# scatter plot, dots colored by class value
df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue'}
fig, ax = pyplot.subplots()
grouped = df.groupby('label')
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
pyplot.show()

# generate two moons dataset

from sklearn.datasets import make_moons

from matplotlib import pyplot

from pandas import DataFrame

# generate 2d classification dataset

X, y = make_moons(n_samples=100, noise=0.2, random_state=1)

# scatter plot, dots colored by class value

df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y))

colors = {0:'red', 1:'blue'}

fig, ax = pyplot.subplots()

grouped = df.groupby('label')

for key, group in grouped:

group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])

pyplot.show()

Running the example creates a scatter plot showing the semi-circle or moon shape of the observations in each class. We can see the noise in the dispersal of the points making the moons less obvious.

Scatter Plot of Moons Dataset With Color Showing the Class Value of Each Sample

This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.

We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset: a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.

Overfit Multilayer Perceptron

We can develop an MLP model to address this binary classification problem.

The model will have one hidden layer with more nodes than may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.

Before we define the model, we will split the dataset into train and test sets, using 30 examples to train the model and 70 to evaluate the fit model’s performance.

# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]

# generate 2d classification dataset

X, y = make_moons(n_samples=100, noise=0.2, random_state=1)

# split into train and test

n_train = 30

trainX, testX = X[:n_train, :], X[n_train:, :]

trainy, testy = y[:n_train], y[n_train:]

Next, we can define the model.

The hidden layer uses 500 nodes in the hidden layer and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1.

The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.

# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# define model

model = Sequential()

model.add(Dense(500, input_dim=2, activation='relu'))

model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The defined model is then fit on the training data for 4,000 epochs and the default batch size of 32.

We will also use the test dataset as a validation dataset.

# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)

1 2	# fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)

We can evaluate the performance of the model on the test dataset and report the result.

# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# evaluate the model

_, train_acc = model.evaluate(trainX, trainy, verbose=0)

_, test_acc = model.evaluate(testX, testy, verbose=0)

print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

Finally, we will plot the performance of the model on both the train and test set each epoch.

If the model does indeed overfit the training dataset, we would expect the line plot of accuracy on the training set to continue to increase and the test set to rise and then fall again as the model learns statistical noise in the training dataset.

# plot history
pyplot.plot(history.history['accuracy'], label='train')
pyplot.plot(history.history['val_accuracy'], label='test')
pyplot.legend()
pyplot.show()

# plot history

pyplot.plot(history.history['accuracy'], label='train')

pyplot.plot(history.history['val_accuracy'], label='test')

pyplot.legend()

pyplot.show()

We can tie all of these pieces together; the complete example is listed below.

# mlp overfit on the moons dataset
from sklearn.datasets import make_moons
from keras.layers import Dense
from keras.models import Sequential
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot history
pyplot.plot(history.history['accuracy'], label='train')
pyplot.plot(history.history['val_accuracy'], label='test')
pyplot.legend()
pyplot.show()

# mlp overfit on the moons dataset

from sklearn.datasets import make_moons

from keras.layers import Dense

from keras.models import Sequential

from matplotlib import pyplot

# generate 2d classification dataset

X, y = make_moons(n_samples=100, noise=0.2, random_state=1)

# split into train and test

n_train = 30

trainX, testX = X[:n_train, :], X[n_train:, :]

trainy, testy = y[:n_train], y[n_train:]

# define model

model = Sequential()

model.add(Dense(500, input_dim=2, activation='relu'))

model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit model

history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)

# evaluate the model

_, train_acc = model.evaluate(trainX, trainy, verbose=0)

_, test_acc = model.evaluate(testX, testy, verbose=0)

print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot history

pyplot.plot(history.history['accuracy'], label='train')

pyplot.plot(history.history['val_accuracy'], label='test')

pyplot.legend()

pyplot.show()

Running the example reports the model performance on the train and test datasets.

We can see that the model has better performance on the training dataset than the test dataset, one possible sign of overfitting.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Because the model is overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.

Train: 1.000, Test: 0.914

1	Train: 1.000, Test: 0.914

A figure is created showing line plots of the model accuracy on the train and test sets.

We can see that expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.

Line Plots of Accuracy on Train and Test Datasets While Training Showing an Overfit

Overfit MLP With Weight Constraint

We can update the example to use a weight constraint.

There are a few different weight constraints to choose from. A good simple constraint for this model is to simply normalize the weights so that the norm is equal to 1.0.

This constraint has the effect of forcing all incoming weights to be small.

We can do this by using the unit_norm in Keras. This constraint can be added to the first hidden layer as follows:

model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=unit_norm()))

1	model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=unit_norm()))

We can also achieve the same result by using the min_max_norm and setting the min and maximum to 1.0, for example:

model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=min_max_norm(min_value=1.0, max_value=1.0)))

1	model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=min_max_norm(min_value=1.0, max_value=1.0)))

We cannot achieve the same result with the maximum norm constraint as it will allow norms at or below the specified limit; for example:

model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=max_norm(1.0)))

1	model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=max_norm(1.0)))

The complete updated example with the unit norm constraint is listed below:

# mlp overfit on the moons dataset with a unit norm constraint
from sklearn.datasets import make_moons
from keras.layers import Dense
from keras.models import Sequential
from keras.constraints import unit_norm
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=unit_norm()))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot history
pyplot.plot(history.history['accuracy'], label='train')
pyplot.plot(history.history['val_accuracy'], label='test')
pyplot.legend()
pyplot.show()

# mlp overfit on the moons dataset with a unit norm constraint

from sklearn.datasets import make_moons

from keras.layers import Dense

from keras.models import Sequential

from keras.constraints import unit_norm

from matplotlib import pyplot

# generate 2d classification dataset

X, y = make_moons(n_samples=100, noise=0.2, random_state=1)

# split into train and test

n_train = 30

trainX, testX = X[:n_train, :], X[n_train:, :]

trainy, testy = y[:n_train], y[n_train:]

# define model

model = Sequential()

model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=unit_norm()))

model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit model

history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)

# evaluate the model

_, train_acc = model.evaluate(trainX, trainy, verbose=0)

_, test_acc = model.evaluate(testX, testy, verbose=0)

print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot history

pyplot.plot(history.history['accuracy'], label='train')

pyplot.plot(history.history['val_accuracy'], label='test')

pyplot.legend()

pyplot.show()

Running the example reports the model performance on the train and test datasets.

We can see that indeed the strict constraint on the size of the weights has improved the performance of the model on the holdout set without impacting performance on the training set.

Train: 1.000, Test: 0.943

1	Train: 1.000, Test: 0.943

Reviewing the line plot of train and test accuracy, we can see that it no longer appears that the model has overfit the training dataset.

Model accuracy on both the train and test sets continues to increase to a plateau.

Line Plots of Accuracy on Train and Test Datasets While Training With Weight Constraints

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

Report Weight Norm. Update the example to calculate the magnitude of the network weights and demonstrate that the constraint indeed made the magnitude smaller.
Constrain Output Layer. Update the example to add a constraint to the output layer of the model and compare the results.
Constrain Bias. Update the example to add a constraint to the bias weight and compare the results.
Repeated Evaluation. Update the example to fit and evaluate the model multiple times and report the mean and standard deviation of model performance.

If you explore any of these extensions, I’d love to know.

Summary

In this tutorial, you discovered the Keras API for adding weight constraints to deep learning neural network models.

Specifically, you learned:

How to create vector norm constraints using the Keras API.
How to add weight constraints to MLP, CNN, and RNN layers using the Keras API.
How to reduce overfitting by adding a weight constraint to an existing model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

34 Responses to How to Reduce Overfitting Using Weight Constraints in Keras

Philippe November 26, 2018 at 8:35 am #

Is gradient clipping similar to a weight constraint?

Reply
- Jason Brownlee November 26, 2018 at 2:00 pm #
  
  Great question!
  
  Not quite.
  
  Weight constraints are applied to the weights and is a regularization technique.
  
  Gradient clipping is applied to the error gradient used to update the weights and is used to avoid exploding gradients.
  
  Reply
  - allen December 10, 2018 at 12:59 am #
    
    thanks for sharing,but why like regularization,because no penalty. i think Weight constraints more like normalization
    
    Reply
    - Jason Brownlee December 10, 2018 at 6:05 am #
      
      Yes, exactly like the normalization of weights after each update.
      
      Reply
diffusion November 26, 2018 at 12:34 pm #

Awesome article. This helps to impove theprediction in the kaggle competition, “Don’t call me turkey!”.

Wishes,

Reply
- Jason Brownlee November 26, 2018 at 2:02 pm #
  
  I’m happy to hear that, well done!
  
  Reply
Oren November 27, 2018 at 7:04 am #

Does containing the weights in each layer say to sum up to one make the model easier to interpret?

Reply
- Jason Brownlee November 27, 2018 at 2:07 pm #
  
  Maybe on the input layer, but perhaps not on hidden layers.
  
  Reply
Ruchit Dalwadi November 27, 2018 at 9:06 am #

Say if kernel_constraint=max_norm(A). On what basis should I set up the value of ‘A’ ?

Reply
- Jason Brownlee November 27, 2018 at 2:10 pm #
  
  Experiment with a range of small integer values, often in [1,4]
  
  Reply
Ruchit Dalwadi November 27, 2018 at 9:24 am #

On which layer this should be applied ? As most of the networks in CNN are too deeper. Is there a way to figure that out ?

Reply
- Jason Brownlee November 27, 2018 at 2:10 pm #
  
  Use on all layers.
  
  Reply
JG December 2, 2018 at 10:46 pm #

Hola Jason:

nice post! . thks.

I think this tutorial targeted to helping to reduce overfitting via “weight constraint” is close linked to the other tutorial named: How to Reduce Overfitting of a Deep Learning Model with Weight Regularization, that use “weight regularization” techniques instead. So both of tutorials (kernel_constraint vs kernel_regularizer arguments) are have similar implementation and results.
In fact they achieve same test accuracy of 94.3%.

In that sense, I decided to implement also the Grid search “Limiter” Hyperparameter analysis (omitted here), as replication of the “Grid Regularizer” of the previous post and defining an ALPHA parameters list with these setup values = [0.3, 0.6, 0.9, 1., 1.3, 1.6], these are the parameters values used as argument of e.g. min_max_norm (ALPHA are the same value for min and max).

My Alpha exploration indicate that in addition to parameter 1.0 (unit_norm for example), also Alpha parameters values of 0.9 and 1.3 reach the same Test Accuracy maximum (94.3%), but lower and higher values of Alpha parameters have lower accuracy (92.9 % of test data).

Thanks

Reply
- Jason Brownlee December 3, 2018 at 6:50 am #
  
  Very nice finding, well done!
  
  Reply
mamina sahu December 14, 2018 at 6:13 pm #

nice post

Reply
- Jason Brownlee December 15, 2018 at 6:10 am #
  
  Thanks.
  
  Reply
Vineetha December 19, 2018 at 8:41 pm #

how can i calculate equal error rate? Is there any formula?

Reply
- Jason Brownlee December 20, 2018 at 6:24 am #
  
  What is the equal error rate?
  
  Reply
Vital March 11, 2019 at 5:28 am #

Hi Jason,

I tried using bias_constraint==max_norm(3) in a CNN, and the == causes an error:
SyntaxError: positional argument follows keyword argument

I believe it should be just =

Reply
- Jason Brownlee March 11, 2019 at 6:55 am #
  
  Yes, use a single = for assignment.
  
  I have fixed the examples, thanks.
  
  Reply
mustafa mohammed September 5, 2019 at 4:56 am #

hello Jason Brownlee
What is the code to add weights manually to LSTM

Reply
- Jason Brownlee September 5, 2019 at 6:59 am #
  
  model.set_weights()
  
  Reply
mustafa mohammed September 5, 2019 at 10:07 pm #

how to use model.set_weights()with LSTM (regression)?

how to add weight from csv file ?
can you help me by code ?
thank you very mach

Reply
- Jason Brownlee September 6, 2019 at 4:58 am #
  
  Sorry, I don’t have the capacity to prepare an example for you.
  
  Why do you want to set model weights from a CSV file?
  
  Reply
mustafa mohammed September 6, 2019 at 5:38 am #

Because generate weights from an algorithm PSO and I want put in csv and I try to carry it to LSTM

Reply
- Jason Brownlee September 6, 2019 at 1:53 pm #
  
  I see. How to you evaluate the weights without using an LSTM structure in the first place?
  
  Reply
GB December 3, 2020 at 9:52 pm #

Hi Jason,
when using constraints such as max_norm you are tackling the norm of the weight vector w, right? What if I also want to impose contraints on the individual weights (i.e. vector elements w_i?)
Say for example that I want the vector norm of the input layer to be equal to 1, but I also want all the individual weights on this layer to fall between, say, 0 and 0.05.
How would you implement that in a simple case with only one input, one hidden and one output layer? Am I missing something obvious here?

Reply
- Jason Brownlee December 4, 2020 at 6:42 am #
  
  Yes.
  
  It sounds like you’re describing weight clipping.
  
  Reply
Atis January 9, 2021 at 12:50 am #

Maybe what is missing from this post is a discussion of the axis= parameter in the max_norm() specification.

Different choices for axis/axes lead to constraints of different strength and meaning. And some choices are better motivated than others in certain contexts.

https://www.tensorflow.org/api_docs/python/tf/keras/constraints/MaxNorm

Reply
- Jason Brownlee January 9, 2021 at 6:43 am #
  
  You can learn more about the axis parameter here:
  https://machinelearningmastery.com/numpy-axis-for-rows-and-columns/
  
  Reply
tiffany January 12, 2021 at 5:47 pm #

Does weight constraint only could use in input weight, how about filter weight?
I am wondering we could use this to constraint to have input weight’s minmax for post training quantization.

Reply
- Jason Brownlee January 13, 2021 at 6:11 am #
  
  Weight constraints can be used with any weights in the network.
  
  Typically they are specified layer-wise.
  
  Reply
Udit May 11, 2021 at 8:07 pm #

Kernel constraint can be applied to each layer (hidden layers), should the norm value be a function of number of nodes in the previous hidden layer?
For example, there are two hidden layers 500, 10. If unit norm is applied to both, then weights of first hidden layer will be much smaller than the second layer. So, is it advised to vary the norm value such that weights remain of same magnitude across hidden layers?
Looking forward to your response, haven’t found the answer anywhere.

Reply
- Jason Brownlee May 12, 2021 at 6:11 am #
  
  No.
  
  Reply

Navigation

How to Reduce Overfitting Using Weight Constraints in Keras

Tutorial Overview

Weight Constraints in Keras

Weight Constraints on Layers

MLP Weight Constraint

CNN Weight Constraint

RNN Weight Constraint

Want Better Results with Deep Learning?

Weight Constraint Case Study

Binary Classification Problem

Overfit Multilayer Perceptron

Overfit MLP With Weight Constraint

Extensions

Further Reading

Posts

API

Summary

Develop Better Deep Learning Models Today!

Train Faster, Reduce Overftting, and Ensembles

Bring better deep learning to your projects!

More On This Topic

34 Responses to How to Reduce Overfitting Using Weight Constraints in Keras

Leave a Reply Click here to cancel reply.