How to Get Better Deep Learning Results (7-Day Mini-Course)

By Jason Brownlee on January 8, 2020 in Deep Learning Performance 61

Better Deep Learning Neural Networks Crash Course.

Get Better Performance From Your Deep Learning Models in 7 Days.

Configuring neural network models is often referred to as a “dark art.”

This is because there are no hard and fast rules for configuring a network for a given problem. We cannot analytically calculate the optimal model type or model configuration for a given dataset.

Fortunately, there are techniques that are known to address specific issues when configuring and training a neural network that are available in modern deep learning libraries such as Keras.

In this crash course, you will discover how you can confidently get better performance from your deep learning models in seven days.

This is a big and important post. You might want to bookmark it.

Kick-start your project with my new book Better Deep Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Jan/2020: Updated API for Keras 2.3 and TensorFlow 2.0.

How to Get Better Deep Learning Performance (7 Day Mini-Course)
Photo by Damian Gadal, some rights reserved.

Who Is This Crash-Course For?

Before we get started, let’s make sure you are in the right place.

The list below provides some general guidelines as to who this course was designed for.

You need to know:

Your way around basic Python and NumPy.
The basics of Keras for deep learning.

You do NOT need to know:

How to be a math wiz!
How to be a deep learning expert!

This crash course will take you from a developer that knows a little deep learning to a developer who can get better performance on your deep learning project.

Note: This crash course assumes you have a working Python 2 or 3 SciPy environment with at least NumPy and Keras 2 installed. If you need help with your environment, you can follow the step-by-step tutorial here:

How to Set Up a Python Environment for Machine Learning and Deep Learning With Anaconda

Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Crash-Course Overview

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below are seven lessons that will allow you to confidently improve the performance of your deep learning model:

Lesson 01: Better Deep Learning Framework
Lesson 02: Batch Size
Lesson 03: Learning Rate Schedule
Lesson 04: Batch Normalization
Lesson 05: Weight Regularization
Lesson 06: Adding Noise
Lesson 07: Early Stopping

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help (hint, I have all of the answers directly on this blog; use the search box).

I do provide more help in the form of links to related posts because I want you to build up some confidence and inertia.

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

Note: This is just a crash course. For a lot more detail and fleshed out tutorials, see my book on the topic titled “Better Deep Learning.”

Lesson 01: Better Deep Learning Framework

In this lesson, you will discover a framework that you can use to systematically improve the performance of your deep learning model.

Modern deep learning libraries such as Keras allow you to define and start fitting a wide range of neural network models in minutes with just a few lines of code.

Nevertheless, it is still challenging to configure a neural network to get good performance on a new predictive modeling problem.

There are three types of problems that are straightforward to diagnose with regard to the poor performance of a deep learning neural network model; they are:

Problems with Learning. Problems with learning manifest in a model that cannot effectively learn a training dataset or shows slow progress or bad performance when learning the training dataset.
Problems with Generalization. Problems with generalization manifest in a model that overfits the training dataset and makes poor performance on a holdout dataset.
Problems with Predictions. Problems with predictions manifest as the stochastic training algorithm having a strong influence on the final model, causing a high variance in behavior and performance.

The sequential relationship between the three areas in the proposed breakdown allows the issue of deep learning model performance to be first isolated, then targeted with a specific technique or methodology.

We can summarize techniques that assist with each of these problems as follows:

Better Learning. Techniques that improve or accelerate the adaptation of neural network model weights in response to a training dataset.
Better Generalization. Techniques that improve the performance of a neural network model on a holdout dataset.
Better Predictions. Techniques that reduce the variance in the performance of a final model.

You can use this framework to first diagnose the type of problem that you have and then identify a technique to evaluate to attempt to address your problem.

Your Task

For this lesson, you must list two techniques or areas of focus that belong to each of the three areas of the framework.

Having trouble? Note that we will be looking some examples from two of the three areas as part of this mini-course.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to control the speed of learning with the batch size.

Lesson 02: Batch Size

In this lesson, you will discover the importance of the batch size when training neural networks.

Neural networks are trained using gradient descent where the estimate of the error used to update the weights is calculated based on a subset of the training dataset.

The number of examples from the training dataset used in the estimate of the error gradient is called the batch size and is an important hyperparameter that influences the dynamics of the learning algorithm.

The choice of batch size controls how quickly the algorithm learns, for example:

Batch Gradient Descent. Batch size is set to the number of examples in the training dataset, more accurate estimate of error but longer time between weight updates.
Stochastic Gradient Descent. Batch size is set to 1, noisy estimate of error but frequent updates to weights.
Minibatch Gradient Descent. Batch size is set to a value more than 1 and less than the number of training examples, trade-off between batch and stochastic gradient descent.

Keras allows you to configure the batch size via the batch_size argument to the fit() function, for example:

# fit model
history = model.fit(trainX, trainy, epochs=1000, batch_size=len(trainX))

1 2	# fit model history = model.fit(trainX, trainy, epochs=1000, batch_size=len(trainX))

The example below demonstrates a Multilayer Perceptron with batch gradient descent on a binary classification problem.

# example of batch gradient descent
from sklearn.datasets import make_circles
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD
from matplotlib import pyplot
# generate dataset
X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
# split into train and test
n_train = 500
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile model
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=1000, batch_size=len(trainX), verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot loss learning curves
pyplot.subplot(211)
pyplot.title('Cross-Entropy Loss', pad=-40)
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
# plot accuracy learning curves
pyplot.subplot(212)
pyplot.title('Accuracy', pad=-40)
pyplot.plot(history.history['accuracy'], label='train')
pyplot.plot(history.history['val_accuracy'], label='test')
pyplot.legend()
pyplot.show()

# example of batch gradient descent

from sklearn.datasets import make_circles

from keras.layers import Dense

from keras.models import Sequential

from keras.optimizers import SGD

from matplotlib import pyplot

# generate dataset

X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)

# split into train and test

n_train = 500

trainX, testX = X[:n_train, :], X[n_train:, :]

trainy, testy = y[:n_train], y[n_train:]

# define model

model = Sequential()

model.add(Dense(50, input_dim=2, activation='relu'))

model.add(Dense(1, activation='sigmoid'))

# compile model

opt = SGD(lr=0.01, momentum=0.9)

model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

# fit model

history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=1000, batch_size=len(trainX), verbose=0)

# evaluate the model

_, train_acc = model.evaluate(trainX, trainy, verbose=0)

_, test_acc = model.evaluate(testX, testy, verbose=0)

print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss learning curves

pyplot.subplot(211)

pyplot.title('Cross-Entropy Loss', pad=-40)

pyplot.plot(history.history['loss'], label='train')

pyplot.plot(history.history['val_loss'], label='test')

pyplot.legend()

# plot accuracy learning curves

pyplot.subplot(212)

pyplot.title('Accuracy', pad=-40)

pyplot.plot(history.history['accuracy'], label='train')

pyplot.plot(history.history['val_accuracy'], label='test')

pyplot.legend()

pyplot.show()

Your Task

For this lesson, you must run the code example with each type of gradient descent (batch, minibatch, and stochastic) and describe the effect that it has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to fine tune a model during training with a learning rate schedule

Lesson 03: Learning Rate Schedule

In this lesson, you will discover how to configure an adaptive learning rate schedule to fine tune the model during the training run.

The amount of change to the model during each step of this search process, or the step size, is called the “learning rate” and provides perhaps the most important hyperparameter to tune for your neural network in order to achieve good performance on your problem.

Configuring a fixed learning rate is very challenging and requires careful experimentation. An alternative to using a fixed learning rate is to instead vary the learning rate over the training process.

Keras provides the ReduceLROnPlateau learning rate schedule that will adjust the learning rate when a plateau in model performance is detected, e.g. no change for a given number of training epochs. For example:

# define learning rate schedule
rlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_delta=1E-7, verbose=1)

1 2	# define learning rate schedule rlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_delta=1E-7, verbose=1)

This callback is designed to reduce the learning rate after the model stops improving with the hope of fine-tuning model weights during training.

The example below demonstrates a Multilayer Perceptron with a learning rate schedule on a binary classification problem, where the learning rate will be reduced by an order of magnitude if no change is detected in validation loss over 5 training epochs.

# example of a learning rate schedule
from sklearn.datasets import make_circles
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD
from keras.callbacks import ReduceLROnPlateau
from matplotlib import pyplot
# generate dataset
X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
# split into train and test
n_train = 500
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile model
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
# define learning rate schedule
rlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_delta=1E-7, verbose=1)
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=300, verbose=0, callbacks=[rlrp])
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot loss learning curves
pyplot.subplot(211)
pyplot.title('Cross-Entropy Loss', pad=-40)
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
# plot accuracy learning curves
pyplot.subplot(212)
pyplot.title('Accuracy', pad=-40)
pyplot.plot(history.history['accuracy'], label='train')
pyplot.plot(history.history['val_accuracy'], label='test')
pyplot.legend()
pyplot.show()

# example of a learning rate schedule

from sklearn.datasets import make_circles

from keras.layers import Dense

from keras.models import Sequential

from keras.optimizers import SGD

from keras.callbacks import ReduceLROnPlateau

from matplotlib import pyplot

# generate dataset

X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)

# split into train and test

n_train = 500

trainX, testX = X[:n_train, :], X[n_train:, :]

trainy, testy = y[:n_train], y[n_train:]

# define model

model = Sequential()

model.add(Dense(50, input_dim=2, activation='relu'))

model.add(Dense(1, activation='sigmoid'))

# compile model

opt = SGD(lr=0.01, momentum=0.9)

model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

# define learning rate schedule

rlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_delta=1E-7, verbose=1)

# fit model

history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=300, verbose=0, callbacks=[rlrp])

# evaluate the model

_, train_acc = model.evaluate(trainX, trainy, verbose=0)

_, test_acc = model.evaluate(testX, testy, verbose=0)

print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss learning curves

pyplot.subplot(211)

pyplot.title('Cross-Entropy Loss', pad=-40)

pyplot.plot(history.history['loss'], label='train')

pyplot.plot(history.history['val_loss'], label='test')

pyplot.legend()

# plot accuracy learning curves

pyplot.subplot(212)

pyplot.title('Accuracy', pad=-40)

pyplot.plot(history.history['accuracy'], label='train')

pyplot.plot(history.history['val_accuracy'], label='test')

pyplot.legend()

pyplot.show()

Your Task

For this lesson, you must run the code example with and without the learning rate schedule and describe the effect that the learning rate schedule has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how you can accelerate the training process with batch normalization

Lesson 04: Batch Normalization

In this lesson, you will discover how to accelerate the training process of your deep learning neural network using batch normalization.

Batch normalization, or batchnorm for short, is proposed as a technique to help coordinate the update of multiple layers in the model.

The authors of the paper introducing batch normalization refer to change in the distribution of inputs during training as “internal covariate shift“. Batch normalization was designed to counter the internal covariate shift by scaling the output of the previous layer, specifically by standardizing the activations of each input variable per mini-batch, such as the activations of a node from the previous layer.

Keras supports Batch Normalization via a separate BatchNormalization layer that can be added between the hidden layers of your model. For example:

model.add(BatchNormalization())

1	model.add(BatchNormalization())

The example below demonstrates a Multilayer Perceptron model with batch normalization on a binary classification problem.

# example of batch normalization
from sklearn.datasets import make_circles
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from keras.layers import BatchNormalization
from matplotlib import pyplot
# generate dataset
X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
# split into train and test
n_train = 500
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(1, activation='sigmoid'))
# compile model
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=300, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot loss learning curves
pyplot.subplot(211)
pyplot.title('Cross-Entropy Loss', pad=-40)
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
# plot accuracy learning curves
pyplot.subplot(212)
pyplot.title('Accuracy', pad=-40)
pyplot.plot(history.history['accuracy'], label='train')
pyplot.plot(history.history['val_accuracy'], label='test')
pyplot.legend()
pyplot.show()

# example of batch normalization

from sklearn.datasets import make_circles

from keras.models import Sequential

from keras.layers import Dense

from keras.optimizers import SGD

from keras.layers import BatchNormalization

from matplotlib import pyplot

# generate dataset

X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)

# split into train and test

n_train = 500

trainX, testX = X[:n_train, :], X[n_train:, :]

trainy, testy = y[:n_train], y[n_train:]

# define model

model = Sequential()

model.add(Dense(50, input_dim=2, activation='relu'))

model.add(BatchNormalization())

model.add(Dense(1, activation='sigmoid'))

# compile model

opt = SGD(lr=0.01, momentum=0.9)

model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

# fit model

history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=300, verbose=0)

# evaluate the model

_, train_acc = model.evaluate(trainX, trainy, verbose=0)

_, test_acc = model.evaluate(testX, testy, verbose=0)

print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss learning curves

pyplot.subplot(211)

pyplot.title('Cross-Entropy Loss', pad=-40)

pyplot.plot(history.history['loss'], label='train')

pyplot.plot(history.history['val_loss'], label='test')

pyplot.legend()

# plot accuracy learning curves

pyplot.subplot(212)

pyplot.title('Accuracy', pad=-40)

pyplot.plot(history.history['accuracy'], label='train')

pyplot.plot(history.history['val_accuracy'], label='test')

pyplot.legend()

pyplot.show()

Your Task

For this lesson, you must run the code example with and without batch normalization and describe the effect that batch normalization has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to reduce overfitting using weight regularization.

Lesson 05: Weight Regularization

In this lesson, you will discover how to reduce overfitting of your deep learning neural network using weight regularization.

A model with large weights is more complex than a model with smaller weights. It is a sign of a network that may be overly specialized to training data.

The learning algorithm can be updated to encourage the network toward using small weights.

One way to do this is to change the calculation of loss used in the optimization of the network to also consider the size of the weights. This is called weight regularization or weight decay.

Keras supports weight regularization via the kernel_regularizer argument on a layer, which can be configured to use the L1 or L2 vector norm, for example:

model.add(Dense(500, input_dim=2, activation='relu', kernel_regularizer=l2(0.01)))

1	model.add(Dense(500, input_dim=2, activation='relu', kernel_regularizer=l2(0.01)))

The example below demonstrates a Multilayer Perceptron model with weight decay on a binary classification problem.

# example of weight decay
from sklearn.datasets import make_circles
from keras.models import Sequential
from keras.layers import Dense
from keras.regularizers import l2
from matplotlib import pyplot
# generate dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu', kernel_regularizer=l2(0.01)))
model.add(Dense(1, activation='sigmoid'))
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot loss learning curves
pyplot.subplot(211)
pyplot.title('Cross-Entropy Loss', pad=-40)
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
# plot accuracy learning curves
pyplot.subplot(212)
pyplot.title('Accuracy', pad=-40)
pyplot.plot(history.history['accuracy'], label='train')
pyplot.plot(history.history['val_accuracy'], label='test')
pyplot.legend()
pyplot.show()

# example of weight decay

from sklearn.datasets import make_circles

from keras.models import Sequential

from keras.layers import Dense

from keras.regularizers import l2

from matplotlib import pyplot

# generate dataset

X, y = make_circles(n_samples=100, noise=0.1, random_state=1)

# split into train and test

n_train = 30

trainX, testX = X[:n_train, :], X[n_train:, :]

trainy, testy = y[:n_train], y[n_train:]

# define model

model = Sequential()

model.add(Dense(500, input_dim=2, activation='relu', kernel_regularizer=l2(0.01)))

model.add(Dense(1, activation='sigmoid'))

# compile model

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit model

history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)

# evaluate the model

_, train_acc = model.evaluate(trainX, trainy, verbose=0)

_, test_acc = model.evaluate(testX, testy, verbose=0)

print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss learning curves

pyplot.subplot(211)

pyplot.title('Cross-Entropy Loss', pad=-40)

pyplot.plot(history.history['loss'], label='train')

pyplot.plot(history.history['val_loss'], label='test')

pyplot.legend()

# plot accuracy learning curves

pyplot.subplot(212)

pyplot.title('Accuracy', pad=-40)

pyplot.plot(history.history['accuracy'], label='train')

pyplot.plot(history.history['val_accuracy'], label='test')

pyplot.legend()

pyplot.show()

Your Task

For this lesson, you must run the code example with and without weight regularization and describe the effect that it has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to reduce overfitting by adding noise to your model

Lesson 06: Adding Noise

In this lesson, you will discover that adding noise to a neural network during training can improve the robustness of the network, resulting in better generalization and faster learning.

Training a neural network with a small dataset can cause the network to memorize all training examples, in turn leading to poor performance on a holdout dataset.

One approach to making the input space smoother and easier to learn is to add noise to inputs during training.

The addition of noise during the training of a neural network model has a regularization effect and, in turn, improves the robustness of the model.

Noise can be added to your model in Keras via the GaussianNoise layer. For example:

model.add(GaussianNoise(0.1))

1	model.add(GaussianNoise(0.1))

Noise can be added to a model at the input layer or between hidden layers.

The example below demonstrates a Multilayer Perceptron model with added noise between the hidden layers on a binary classification problem.

# example of adding noise
from sklearn.datasets import make_circles
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import GaussianNoise
from matplotlib import pyplot
# generate dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(GaussianNoise(0.1))
model.add(Dense(1, activation='sigmoid'))
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot loss learning curves
pyplot.subplot(211)
pyplot.title('Cross-Entropy Loss', pad=-40)
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
# plot accuracy learning curves
pyplot.subplot(212)
pyplot.title('Accuracy', pad=-40)
pyplot.plot(history.history['accuracy'], label='train')
pyplot.plot(history.history['val_accuracy'], label='test')
pyplot.legend()
pyplot.show()

# example of adding noise

from sklearn.datasets import make_circles

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import GaussianNoise

from matplotlib import pyplot

# generate dataset

X, y = make_circles(n_samples=100, noise=0.1, random_state=1)

# split into train and test

n_train = 30

trainX, testX = X[:n_train, :], X[n_train:, :]

trainy, testy = y[:n_train], y[n_train:]

# define model

model = Sequential()

model.add(Dense(500, input_dim=2, activation='relu'))

model.add(GaussianNoise(0.1))

model.add(Dense(1, activation='sigmoid'))

# compile model

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit model

history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)

# evaluate the model

_, train_acc = model.evaluate(trainX, trainy, verbose=0)

_, test_acc = model.evaluate(testX, testy, verbose=0)

print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss learning curves

pyplot.subplot(211)

pyplot.title('Cross-Entropy Loss', pad=-40)

pyplot.plot(history.history['loss'], label='train')

pyplot.plot(history.history['val_loss'], label='test')

pyplot.legend()

# plot accuracy learning curves

pyplot.subplot(212)

pyplot.title('Accuracy', pad=-40)

pyplot.plot(history.history['accuracy'], label='train')

pyplot.plot(history.history['val_accuracy'], label='test')

pyplot.legend()

pyplot.show()

Your Task

For this lesson, you must run the code example with and without the addition of noise and describe the effect that it has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to reduce overfitting using early stopping.

Lesson 07: Early Stopping

In this lesson, you will discover that stopping the training of a neural network early before it has overfit the training dataset can reduce overfitting and improve the generalization of deep neural networks.

A major challenge in training neural networks is how long to train them.

Too little training will mean that the model will underfit the train and the test sets. Too much training will mean that the model will overfit the training dataset and have poor performance on the test set.

A compromise is to train on the training dataset but to stop training at the point when performance on a validation dataset starts to degrade. This simple, effective, and widely used approach to training neural networks is called early stopping.

Keras supports early stopping via the EarlyStopping callback that allows you to specify the metric to monitor during training.

# patient early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)

1 2	# patient early stopping es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)

The example below demonstrates a Multilayer Perceptron with early stopping on a binary classification problem that will stop when the validation loss has not improved for 200 training epochs.

# example of early stopping
from sklearn.datasets import make_circles
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from matplotlib import pyplot
# generate dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# patient early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es])
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot loss learning curves
pyplot.subplot(211)
pyplot.title('Cross-Entropy Loss', pad=-40)
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
# plot accuracy learning curves
pyplot.subplot(212)
pyplot.title('Accuracy', pad=-40)
pyplot.plot(history.history['accuracy'], label='train')
pyplot.plot(history.history['val_accuracy'], label='test')
pyplot.legend()
pyplot.show()

# example of early stopping

from sklearn.datasets import make_circles

from keras.models import Sequential

from keras.layers import Dense

from keras.callbacks import EarlyStopping

from matplotlib import pyplot

# generate dataset

X, y = make_circles(n_samples=100, noise=0.1, random_state=1)

# split into train and test

n_train = 30

trainX, testX = X[:n_train, :], X[n_train:, :]

trainy, testy = y[:n_train], y[n_train:]

# define model

model = Sequential()

model.add(Dense(500, input_dim=2, activation='relu'))

model.add(Dense(1, activation='sigmoid'))

# compile model

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# patient early stopping

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)

# fit model

history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es])

# evaluate the model

_, train_acc = model.evaluate(trainX, trainy, verbose=0)

_, test_acc = model.evaluate(testX, testy, verbose=0)

print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

# plot loss learning curves

pyplot.subplot(211)

pyplot.title('Cross-Entropy Loss', pad=-40)

pyplot.plot(history.history['loss'], label='train')

pyplot.plot(history.history['val_loss'], label='test')

pyplot.legend()

# plot accuracy learning curves

pyplot.subplot(212)

pyplot.title('Accuracy', pad=-40)

pyplot.plot(history.history['accuracy'], label='train')

pyplot.plot(history.history['val_accuracy'], label='test')

pyplot.legend()

pyplot.show()

Your Task

For this lesson, you must run the code example with and without early stopping and describe the effect it has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

This was your final lesson.

The End!
(Look how far you have come!)

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

A framework that you can use to systematically diagnose and improve the performance of your deep learning model.
Batch size can be used to control the precision of the estimated error and the speed of learning during training.
Learning rate schedule can be used to fine tune the model weights during training.
Batch normalization can be used to dramatically accelerate the training process of neural network models.
Weight regularization will penalize models based on the size of the weights and reduce overfitting.
Adding noise will make the model more robust to differences in input and reduce overfitting
Early stopping will halt the training process at the right time and reduce overfitting.

This is just the beginning of your journey with deep learning performance improvement. Keep practicing and developing your skills.

Take the next step and check out my book on getting better performance with deep learning.

Summary

How did you do with the mini-course?
Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.

61 Responses to How to Get Better Deep Learning Results (7-Day Mini-Course)

Michael Adjeisah February 21, 2019 at 2:13 pm #

Hi Jason,
Thanks for the tutorial.
i was trying rlrp = ReduceLROnPlateau(monitor=’val_loss’, factor=0.1, patience=5, min_delta=1E-7, verbose=1)
but got the following rerror;
TypeError Traceback (most recent call last)
in ()
5 # define learning rate schedule
6 # rlrp = ReduceLROnPlateau(monitor=’val_loss’, factor=0.5, patience=5, epsilon=1E-7, verbose=1)
—-> 7 rlrp = ReduceLROnPlateau(monitor=’val_loss’, factor=0.1, patience=5, min_delta=1E-7, verbose=1)
8
9

TypeError: __init__() got an unexpected keyword argument ‘min_delta’

Reply
- Jason Brownlee February 22, 2019 at 6:11 am #
  
  Sorry to hear that.
  
  Perhaps confirm that your version of Keras is up to date, e.g. 2.2.4+
  
  Also, here’s more help on the API:
  https://keras.io/callbacks/#reducelronplateau
  
  Reply
- Shamsan April 6, 2019 at 8:21 am #
  
  Very nice tutorial
  
  Reply
jakub February 23, 2019 at 8:53 am #

Hey Jason, this is by far the most enjoyable course I’ve done since taking on ML 2 months ago. I have done some algebra 15 years ago ???? and struggled to get started with the topic.
The practical tips in this article along with the code ready to play with allowed me to finally understand the topics.
Thanks!

Reply
- Jason Brownlee February 24, 2019 at 9:01 am #
  
  Thanks, I’m glad it helps!
  
  Reply
Francisco del Valle February 26, 2019 at 1:04 am #

Hi Jason, first of all, thanks for the tutorial.

Everything in the tutorial worked fine and I had reasonable results following it and extracting conclusions in every step, but in the batch normalization step, the results are not the expected because it takes longer to train with batch normalization than without it.

I’ve made two test, with the default 300 epochs and with 3000 epochs and these are the results:

– With batch normalization and 300 epochs – > 15.8s

– Without batch normalization and 300 epochs -> 12.8s

– With batch normalization and 3000 epochs – > 2m 28s

– Without batch normalization and 3000 epochs -> 1m 57s

Do you know why I get that results? Also, the acc and loss curve are way smoother on the model without batch normalization.

Thank you and excuse me for my english.

Reply
- Francisco del Valle February 26, 2019 at 1:06 am #
  
  (Extra info) Those test ran on a Nvidia GTX 1080 TI with GPU enabled for keras.
  
  Reply
- Jason Brownlee February 26, 2019 at 6:26 am #
  
  Yes, it is slower given the increased computation required to standardize the activations.
  
  Well done for noticing!
  
  Reply
Charles Brauer February 26, 2019 at 5:58 am #

Hi Jason,

I would like to try you code on my dataset. It’s 6 columns by 30,000 rows and it fits in memory. Would it be advisable to use a batch size that is the number of rows in the dataset? In other words, why do we need a batch size at all since it fits in memory?

Charles

Reply
- Jason Brownlee February 26, 2019 at 6:30 am #
  
  Often mini batch performs better. Perhaps experiment and see what works best for your combination of model/config/data/lrate/etc.
  
  Reply
hosein March 14, 2019 at 10:47 pm #

hello and thanks a lot for speaking about your important tricks
i will start with Keras as soon as

Reply
- Jason Brownlee March 15, 2019 at 5:31 am #
  
  Thanks.
  
  Reply
ali April 21, 2019 at 10:26 pm #

hi jason:

can change a little in one code to use it with Flickr8k because i really cant apply them using

https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/

pls try to help us

Reply
- Jason Brownlee April 22, 2019 at 6:23 am #
  
  Sorry, I don’t follow. What problem are you having exactly?
  
  Reply
ali April 22, 2019 at 1:26 pm #

How can I split Flicker8k dataset like this:

n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]

Reply
- Jason Brownlee April 22, 2019 at 2:28 pm #
  
  You must separate the images and text separately.
  
  This might help:
  https://machinelearningmastery.com/prepare-photo-caption-dataset-training-deep-learning-model/
  
  Reply
chen Mei May 9, 2019 at 5:04 pm #

what kind of error it is:

File “C:\Users\Chen Mei\Anaconda3\lib\site-packages\matplotlib\artist.py”, line 895, in _update_property
raise AttributeError(‘Unknown property %s’ % k)

AttributeError: Unknown property pad

Reply
- Jason Brownlee May 10, 2019 at 8:12 am #
  
  I have not seen this error before, perhaps post to stackoverflow?
  
  Reply
chen Mei May 9, 2019 at 5:39 pm #

My Results against each step:

Batch Size: Train: 0.822, Test: 0.792
Learning Rate Schedule: Train: 0.838, Test: 0.846
Batch Normalization: Train: 0.836, Test: 0.858
Weight Regularization: Train: 0.967, Test: 0.814
Adding Noise: Train: 0.967, Test: 0.771
Early Stopping: Train: 0.967, Test: 0.829

Reply
- Jason Brownlee May 10, 2019 at 8:14 am #
  
  Well done!
  
  Reply
marcello September 29, 2019 at 10:20 pm #

Lesson 01:

1) Learning problems from small training dataset or imperfect data set

2) Generalization from polarized data set or uncomplete data set

3) Predictions from training set composed of related elements or the input data should be randomized

Reply
- Jason Brownlee September 30, 2019 at 6:10 am #
  
  Nice work!
  
  Reply
Marcello September 30, 2019 at 3:41 am #

I use not exactly the same code but something that one of my colleagues has adapted with different amount of epochs:

Batch Size: 0.843, 0.734
Learning Rate Schedule: 0.898, 0.896
Batch Normalization: 0.816, 0.838
Weight Regularization: 0.997, 0.844
Adding Noise: 0.978, 0.751
Early Stopping: 0.977, 0.849

we use GTX 1080

Reply
- Jason Brownlee September 30, 2019 at 6:18 am #
  
  Nice work!
  
  Reply
Lai April 20, 2020 at 5:57 pm #

I’ve tried the steps by using dataset with 9 features and 12 000 observations for binary classification problem. However, my results seem like not that good. May I know how to improve?

Batch Size: Train acc: 61.56, Test acc: 60.38, Train loss: 65.32, Test loss: 67.01
Learning Rate Schedule: Train acc: 62.63, Test acc: 60.9, Train loss: 64.53, Test loss: 66.86
Batch Normalization: Train acc: 64.75, Test acc: 61.47, Train loss: 62.61, Test loss: 67.26
Weight Regularization: Train acc: 63.32, Test acc: 61.47, Train loss: 64.1, Test loss: 67.22
Adding Noise: Train acc: 80.34, Test acc: 62.23, Train loss: 41,38 , Test loss: 90.38
Early Stopping: Train acc: 71.4, Test acc: 60.04, Train loss: 53.26, Test loss: 75.99

Reply
- Jason Brownlee April 21, 2020 at 5:50 am #
  
  Perhaps try changing the model.
  Perhaps try changing the learning algorithm.
  Perhaps try transforming the data.
  …
  
  More ideas here:
  https://machinelearningmastery.com/improve-deep-learning-performance/
  
  Reply
gecso75 May 5, 2020 at 1:09 am #

regarding to batch_size training:
1. in general accuracy flattens but relevant loss stil decreasing: I think this is because accuracy is categorical (tops at the number of all predictions) meanwhile loss is continuous. So seeing a still decreasing loss curve does not mean necessarily the model is still learning
2. from computation capacity side: lesser epochs does not mean lesser computations necessarily. I would rather use number of back-propagations instead as a measurment unit.
3. doing several experiements I see optimal batch_size is the minimum batch with witch the model learns steadily (least computations), but I would not be afraid using larger batch_size still number of back-progations matters really. Everything should be fitted in memory, of course.
regarding lesson 3:
1. suprised how plateou flattens learning curve even for batch_size=4 (of course, momentum is also used)
2. batch_size=16 still consumes less computations (back-propagations)

Reply
- Jason Brownlee May 5, 2020 at 6:32 am #
  
  Thanks for sharing.
  
  Reply
Philip Ching September 8, 2020 at 12:52 pm #

Hi Jason,
I am in Day 3: Learning Rate Schedule. The task says “I must run the code example with and without learning rate schedule …” So I ran the code:

1) with learning rate schedule
….
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=300, verbose=0, callbacks=[rlrp])
….

This give me result:
…
Epoch 00292: ReduceLROnPlateau reducing learning rate to 9.999999682655225e-22.

Epoch 00297: ReduceLROnPlateau reducing learning rate to 9.999999682655225e-23.
Train: 0.828, Test: 0.852

2) with no learning rate
….
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=300, verbose=0)
….

This give me result:
…
Epoch 00292: ReduceLROnPlateau reducing learning rate to 9.999999682655225e-22.
….
2020-09-07 21:15:12.704822: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Train: 0.824, Test: 0.848

I see the two results appear about the same. Am I on the right track?

Thank you. Philip Ching

Reply
- Philip Ching September 8, 2020 at 12:58 pm #
  
  Note: In 2) the following line should not be there. It was my typo.
  
  Epoch 00292: ReduceLROnPlateau reducing learning rate to 9.999999682655225e-22.
  
  Reply
  - Jason Brownlee September 8, 2020 at 1:38 pm #
    
    Thanks.
    
    Reply
- Jason Brownlee September 8, 2020 at 1:37 pm #
  
  Nice work!
  
  Yes, you are on track.
  
  Reply
Abhi Bhagat September 9, 2020 at 11:34 am #

There were 2 types of problems defined in :
“Lesson 01: Better Deep Learning Framework”
–Problems with Generalization
–Problems with Predictions.

Generalization is referred as test set performance.

1) So whats the difference between it and predictions ?
AND
2) What exactly is “variance in the performance of a final model.”
(as defined in :- Better Predictions. Techniques that reduce the variance in the performance of a final model.) ?

Reply
- Jason Brownlee September 9, 2020 at 1:34 pm #
  
  Generalization is how well the model performs on new data, e.g. “performance”. A prediction is a single output from the model. Model performance is estimated from predictions on new data.
  
  Variance is the spread in predictions made by the model or the spread in performance of the model when evaluated on new data, related to the variance of the model – how sensitive it is to training data or the stochastic nature of the learning algorithm.
  
  Reply
  - Abhi Bhagat September 9, 2020 at 4:24 pm #
    
    So if a model has low variance in the final model performance,
    will it have a smooth accuracy graph (less spikes on the test vs epoch graph)?
    
    Reply
    - Jason Brownlee September 10, 2020 at 6:23 am #
      
      No model variance is not represented on the learning curve.
      
      Reply
Abhi Bhagat September 9, 2020 at 4:14 pm #

from BATCH-SIZE case-study i learned that :
1.
Batch GD
takes very small training time
for this dataset Batch GD gives a under-fit
loss graph tells me that the model could be trained better
2.
Stochastic GD
takes very larger training time
the train test’s losses and acuracies are looking like a ‘good-fit’
but contain a extremely high fluctuations
(noise in the error gradient)
3.
Mini-Batch GD
takes noramal training time.
for my batch size of 25 it was overfiting the data
(train loss quickly decreases and converges)
(test loss starts increasing after a previously early decrease)
(test accuracy more than train)

Reply
- Jason Brownlee September 10, 2020 at 6:22 am #
  
  Nice summary!
  
  Reply
Abhi Bhagat September 9, 2020 at 4:58 pm #

from LR case-study i learned that :
1.
No RLRP
loss takes a lot of time to converge
graphs “looks” like a good-fit
2.
RLRP
loss quickly converges and becomes constant
LR was changed by the rlrp on epoch 6,11,16,21,26…
the accuracy, loss graph becomes constant at epoch 25
measuring the lr change at epoch 26: 1.0000000195414814e-26.
best LR : 1.0000000195414814e-26.

Reply
- Jason Brownlee September 10, 2020 at 6:24 am #
  
  Well done!
  
  Reply
Abhi Bhagat September 9, 2020 at 11:41 pm #

from this Batch Norm case-study i learned:
without BN
Accuracy curves are fine
Train,Test Loss takes lots of epoch to decrease (slow training)

BN
Test Accuracy is better train
lot of fluctuations (spikes in data)
Train,Test Loss drops quickly then converges (good-fit)
BN highly speeds up the training proccess by 150 epochs

from this Weight Regularization case-study i learned:
without WR
Train loss is fine, but Test loss increases (highly over-fitting)
Test Accuracy decreases over epochs

WR
Test Accuracy is better than train
Train loss is fine, but Test loss decreases but is very high compared to train loss (less over-fitting than before)
Test Accuracy slightly increases over epochs

from this Gaussian Noise case-study i learned:
without GN
Train loss is fine, but Test loss increases (highly over-fitting)
Test Accuracy decreases over epochs

GN
too much spikes due to GN addition
on closely comparing the graphs
GN very slightly reduces over-fitting

Reply
- Jason Brownlee September 10, 2020 at 6:30 am #
  
  Well done!
  
  Reply
Walter Dini February 4, 2021 at 1:10 pm #

Lesson 2: Batch Size Results
Batch Train data size = 500 is fine
Stochastic Batch size = 1 , very slow and training is not good
MiniBatch Batch size = 250 I think is the better of the three, is faster and better training results.
MiniBatch values Train = 0.838, Test = 0.842

Reply
- Jason Brownlee February 4, 2021 at 1:40 pm #
  
  Well done!
  
  Reply
Walter Dini February 4, 2021 at 1:33 pm #

Lesson 3: Learning Rate Schedule

With RLRP
Train: 0.826, Test: 0.854
Without TLRP
Train: 0.828, Test: 0.856
Very small difference on the tranining

Reply
Walter Dini February 4, 2021 at 1:44 pm #

Lesson 4: Batch Normalization

Maybe I don’t understand this function but with Batch Normalization I get as lot of noise on the Plot and the training of the model is lower than without it.
With Batch Normalization
Train = 0.816, Test = 0.844
Without Batch Normalization
Train = 0.840 Test = 0.852

Reply
- Jason Brownlee February 5, 2021 at 5:32 am #
  
  Well done!
  
  Reply
Walter Dini February 4, 2021 at 2:00 pm #

Lesson 5: Weight Normalization

The training values are very similar, but the time difference to run both is huge.
With Weight Normalization
Train: 0.967 , Test : 0.800
Without Weight Normalization
Train: 1.000, Test: 0.771

Reply
- Jason Brownlee February 5, 2021 at 5:32 am #
  
  Great work!
  
  Reply
Souleymane Sow February 18, 2021 at 6:48 am #

Hello Jason, toujours top tes cours!!
Cependant, j’ai juste fait un remarque sur la taille du dataset utilisé :

– dans le cas où la taille du dataset est (1000,2), vous utilisez une couche cachée de 50 neurones (par exemple model.add(Dense(50, input_dim=2, activation=’relu’)))

– dans le cas où vous preniez un dataset de taille (100, 2), vous utilisez une couche cachée de 500 neurones (par exemple model.add(Dense(500, input_dim=2, activation=’relu’))).

Y’a t-il une explication à cela sur ce choix du nombre de neurones sur la couche cachée?
Le nombre de neurones sur la couche cachée dépend t-il de la taille de notre dataset?
Si oui, Pouvez-vous me donner une technique pour choisir le nombre optimal de neurones à mettre sur la couche cachée?

Merci!!

Reply
- Jason Brownlee February 18, 2021 at 8:09 am #
  
  Thanks!
  
  No, the number of nodes is chosen after a little trial and error, more details here:
  https://machinelearningmastery.com/faq/single-faq/how-many-layers-and-nodes-do-i-need-in-my-neural-network
  
  Reply
Souleymane Sow February 20, 2021 at 11:41 am #

Merci de ta réponse!!

Reply
- Jason Brownlee February 20, 2021 at 1:18 pm #
  
  You’re welcome.
  
  Reply
Nitin May 8, 2021 at 1:18 pm #

Lesson 01:

Better training: well-defined dataset with minimum noise, number of training samples, learning rate.

Better generalization: increasing/decreasing the number of layers (complexity), dropouts.

Better predictions: data from the same distribution, batch normalization.

Thank you for the quality content!

Reply
- Jason Brownlee May 9, 2021 at 5:52 am #
  
  Well done!
  
  Reply
Nitin May 8, 2021 at 3:00 pm #

Lesson 02: Batch Size

Hyperparameters: learning_rate=0.01, epochs=1000, train_samples=500.

Batch gradient descent: We calculate the error function over all the training samples. The loss is decreasing but requires more epochs as we are finding the error only after each epoch. So we are changing the weights only 1000 times. Since we are updating weights after whole training set, the weights are changed by considering all thetraining samples so the curves are smooth. If we increase the number of epochs, we will get better results.

Stochastic gradient descent: We update weights after every training sample. So, there are 1000*1000 weight updations. The curves seem noisy because, we update the weight for a training sample which may not be suitable for the next training sample. Since weight updates are frequent, the time taken to train the model is high.

Mini-batch gradient descent: We update weights after mini batches. It has both the advantages of both the pervious methods. We update weights (somewhat) frequently. The curve is smoother than stochastic gradient descent because the weights are trained not on 1 sample but on a mini-batch (usually 2^n samples). The training is quicker and also the curves are smoother.

Reply
- Jason Brownlee May 9, 2021 at 5:52 am #
  
  Great work!
  
  Reply
- siegfried Vanaverbeke March 6, 2022 at 3:39 am #
  
  lesson 2:
  
  Train: 0.834, Test: 0.816 batch gradient descent
  Train: 0.816, Test: 0.798 stochastic gradient descent
  Train: 0.844, Test: 0.850 minibatch size=100
  Train: 0.842, Test: 0.846 minibatch size=250
  Train: 0.814, Test: 0.834 minibatch size=500
  Train: 0.828, Test: 0.856 batch gradient descent,n_pochs=5000
  
  I have the same behaviour as Nitin observes: with stochastic gradient descent, the number of weight updates is far bigger, the algorithm slows down, but the accuracy on both train and validation set quickly converge to a noisy saturation level. Introducing batches of various sizes speeds up the training process, smooths the curves and leads to a gradual convergence toward the satuation level around 0.8. However, varying the batch size only does not lead to accuracies exceeding 90 % even with 5000 epochs and full gradient descent.
  
  Reply
CY July 29, 2021 at 3:03 pm #

Hi Jason,

Not sure if this is right but:

Lesson 1:

Better training: train with more datasets, variety input data, make the order of the input data to be random or jumbled up

Better generalization: add more hidden layers, train more times (increase epochs?), rescaling data

better predictions: not really sure but what I know is the model takes the probability and using the number as an index to access an array for classification as the output.

Reply
- Jason Brownlee July 30, 2021 at 6:26 am #
  
  Well done!
  
  Reply
Stephen Fickas March 10, 2022 at 6:26 am #

Learning Rate Schedule

I thought optimizers like Adam focused on adjusting the learning rate for you. Adam does something different than ReduceLROnPlateau?

Reply

Navigation

How to Get Better Deep Learning Results (7-Day Mini-Course)

Better Deep Learning Neural Networks Crash Course.

Get Better Performance From Your Deep Learning Models in 7 Days.

Who Is This Crash-Course For?

Want Better Results with Deep Learning?

Crash-Course Overview

Lesson 01: Better Deep Learning Framework

Your Task

Next

Lesson 02: Batch Size

Your Task

Next

Lesson 03: Learning Rate Schedule

Your Task

Next

Lesson 04: Batch Normalization

Your Task

Next

Lesson 05: Weight Regularization

Your Task

Next

Lesson 06: Adding Noise

Your Task

Next

Lesson 07: Early Stopping

Your Task

Next

The End!
(Look how far you have come!)

Summary

Develop Better Deep Learning Models Today!

Train Faster, Reduce Overftting, and Ensembles

Bring better deep learning to your projects!

More On This Topic

61 Responses to How to Get Better Deep Learning Results (7-Day Mini-Course)

Leave a Reply Click here to cancel reply.

Navigation

Better Deep Learning Neural Networks Crash Course.

Get Better Performance From Your Deep Learning Models in 7 Days.

Who Is This Crash-Course For?

Want Better Results with Deep Learning?

Crash-Course Overview

Lesson 01: Better Deep Learning Framework

Your Task

Next

Lesson 02: Batch Size

Your Task

Next

Lesson 03: Learning Rate Schedule

Your Task

Next

Lesson 04: Batch Normalization

Your Task

Next

Lesson 05: Weight Regularization

Your Task

Next

Lesson 06: Adding Noise

Your Task

Next

Lesson 07: Early Stopping

Your Task

Next

The End! (Look how far you have come!)

Summary

Develop Better Deep Learning Models Today!

Train Faster, Reduce Overftting, and Ensembles

Bring better deep learning to your projects!

More On This Topic

61 Responses to How to Get Better Deep Learning Results (7-Day Mini-Course)

Leave a Reply Click here to cancel reply.

The End!
(Look how far you have come!)