The post Using Normalization Layers to Improve Deep Learning Models appeared first on Machine Learning Mastery.

]]>Also, thinking about it, in neural networks, the output of each layer serves as the inputs into the next layer, so a natural question to ask is: If normalizing inputs to the model helps improve model performance, does standardizing the inputs into each layer help to improve model performance too?

The answer most of the time is yes! However, unlike normalizing our inputs to the model as a whole, it is slightly more complicated to normalize the inputs to intermediate layers as the activations are constantly changing. As such, it is infeasible, or at least, computationally expensive to continuously compute statistics over the entire train set over and over again. In this article, we’ll be exploring normalization layers to normalize your inputs to your model as well as batch normalization, a technique to standardize the inputs into each layer across batches.

Let’s get started!

This tutorial is split into 6 parts; they are:

- What is normalization and why is it helpful?
- Using Normalization layer in TensorFlow
- What is batch normalization and why should we use it?
- Batch normalization: Under the hood
- Normalization and Batch Normalization in Action

Normalizing a set of data transforms the set of data to be on a similar scale. For machine learning models, our goal is usually to recenter and rescale our data such that is between 0 and 1 or -1 and 1, depending on the data itself. One common way to accomplish this is to calculate the mean and the standard deviation on the set of data and transform each sample by subtracting the mean and dividing by the standard deviation, which is good if we assume that the data follows a normal distribution as this method helps us standardize the data and achieve a standard normal distribution.

Normalization can help training of our neural networks as the different features are on a similar scale, which helps to stabilize the gradient descent step, allowing us to use larger learning rates or help models converge faster for a given learning rate.

To normalize inputs in TensorFlow, we can use Normalization layer in Keras. First, let’s define some sample data,

import numpy as np sample1 = np.array([ [1, 1, 1], [1, 1, 1], [1, 1, 1] ], dtype=np.float32) sample2 = np.array([ [2, 2, 2], [2, 2, 2], [2, 2, 2] ], dtype=np.float32) sample3 = np.array([ [3, 3, 3], [3, 3, 3], [3, 3, 3] ], dtype=np.float32)

Then we initialize our Normalization layer.

import tensorflow as tf from tensorflow.keras.layers import Normalization normalization_layer = Normalization()

And then to get the mean and standard deviation of the dataset and set our Normalization layer to use those parameters, we can call Normalization.adapt() method on our data.

combined_batch = tf.constant(np.expand_dims(np.stack([sample1, sample2, sample3]), axis=-1), dtype=tf.float32) normalization_layer = Normalization() normalization_layer.adapt(combined_batch)

For this case, we used `expand_dims`

to add an extra dimension as the Normalization layer normalizes along the last dimension by default (each index in the last dimension gets its own mean and variance parameters computed on the train set) as that is assumed to be the feature dimension, which for RGB images is usually just the different color dimensions.

And then to normalize our data, we can call normalization layer on that data, as such:

normalization_layer(sample1)

which gives the output

<tf.Tensor: shape=(1, 1, 3, 3), dtype=float32, numpy= array([[[[-1.2247449, -1.2247449, -1.2247449], [-1.2247449, -1.2247449, -1.2247449], [-1.2247449, -1.2247449, -1.2247449]]]], dtype=float32)>

And we can verify that this is the expected behavior by running `np.mean`

and `np.std`

on our original data which gives us a mean of 2.0 and a standard deviation of 0.8165. With the input value of $$-1$$, we have $$(-1-2)/0.8165 = -1.2247$$.

Now that we’ve seen how to normalize our inputs, let’s take a look at another normalization method, batch normalization.

From the name, you can probably guess that batch normalization must have something to do with batches during training. Simply put, batch normalization standardizes the input of a layer across a single batch.

You might be thinking, why can’t we just calculate the mean and variance at a given layer and normalize it that way? The problem comes when we train our model as the parameters change during training, hence activations in the intermediate layers are constantly changing and calculating mean and variance across the entire training set for each iteration would be time consuming and potentially pointless since the activations are going to change at each iteration anyway. That’s where batch normalization comes in.

Introduced in “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” by Ioffe and Szegedy, batch normalization looks at standardizing the inputs to a layer in order to reduce the problem of internal covariate shift. In the paper, internal covariate shift is defined as the problem of “the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change.”

The idea of batch normalization fixing the problem of internal covariate shift has been disputed, notably in “How Does Batch Normalization Help Optimization?” by Santurkar, et al. where it was proposed that batch normalization helps to smoothen the loss function over the parameter space instead. While it might not always be clear how batch normalization does it, but it has achieved good empirical results on many different problems and models.

There is also some evidence that batch normalization can contribute significantly to addressing the vanishing gradient problem common with deep learning models. In the original ResNet paper, He, et al. mention in their analysis of ResNet vs plain networks that “backward propagated gradients exhibit healthy norms with BN (batch normalization)” even in plain networks.

It has also been suggested that batch normalization has other benefits as well such as allowing us to use higher learning rates as batch normalization can help to stabilize parameter growth. It can also help to regularize the model. From the original batch normalization paper,

“When training with Batch Normalization, a training example is seen in conjunction with other examples in the mini-batch, and the training network no longer producing deterministic values for a given training example In our experiments, we found this effect to be advantageous to the generalization of the network”

So, what does batch normalization actually do?

First, we need to calculate batch statistics, in particular, the mean and variance for each of the different activations across a batch. Since each layer’s output serves as an input into the next layer in a neural network, by standardizing the output of the layers, we are also standardizing the inputs to the next layer in our model (though in practice, it was suggested in the original paper to implement batch normalization before the activation function, however there’s some debate over this).

So, we calculate

Then, for each of the activation maps, we normalization each value using the respective statistics

For Convolutional Neural Networks (CNNs) in particular, we calculate these statistics over all locations of the same channel. Hence there will be one $$\hat\mu$$ and $$s^2$$ for each channel, which will be applied to all pixels of the same channel in each sample in the same batch. From the original bath normalization paper,

“For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way”

Now that we’ve seen how to calculate the normalized activation maps, let’s explore how this can be implemented using Numpy arrays.

Suppose we had these activation maps with all of them representing a single channel,

import numpy as np activation_map_sample1 = np.array([ [1, 1, 1], [1, 1, 1], [1, 1, 1] ], dtype=np.float32) activation_map_sample2 = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ], dtype=np.float32) activation_map_sample3 = np.array([ [9, 8, 7], [6, 5, 4], [3, 2, 1] ], dtype=np.float32)

Then, we want to standardize each element in the activation map across all locations and across the different samples. To standardize, we compute their mean and standard deviation using

#get mean across the different samples in batch for each activation activation_mean_bn = np.mean([activation_map_sample1, activation_map_sample2, activation_map_sample3], axis=0) #get standard deviation across different samples in batch for each activation activation_std_bn = np.std([activation_map_sample1, activation_map_sample2, activation_map_sample3], axis=0) print (activation_mean_bn) print (activation_std_bn)

which outputs

3.6666667 2.8284268

Then, we can standardize an activation map by doing

#get batch normalized activation map for sample 1 activation_map_sample1_bn = (activation_map_sample1 - activation_mean_bn) / activation_std_bn

and these store the outputs

activation_map_sample1_bn: [[-0.94280916 -0.94280916 -0.94280916] [-0.94280916 -0.94280916 -0.94280916] [-0.94280916 -0.94280916 -0.94280916]] activation_map_sample2_bn: [[-0.94280916 -0.58925575 -0.2357023 ] [ 0.11785112 0.47140455 0.82495797] [ 1.1785114 1.5320647 1.8856182 ]] activation_map_sample3_bn: [[ 1.8856182 1.5320647 1.1785114 ] [ 0.82495797 0.47140455 0.11785112] [-0.2357023 -0.58925575 -0.94280916]]

But we hit a snag when it comes to inference time. What if we don’t have batches of examples at inference time and even if we did, it would still be preferable if the output is computed from the input deterministically. So, we need to calculate a fixed set of parameters to be used at inference time. For this purpose, we store a moving average for the means and variances instead which we use at inference time to compute the outputs of the layers.

However, another problem with simply standardizing the inputs to a model in this way also changes the representational ability of the layers. One example brought up in the batch normalization paper was the sigmoid nonlinear function, where normalizing the inputs would constrain it to the linear regime of the sigmoid function. To address this, another linear layer is added to scale and recenter the values, along with 2 trainable parameters to learn the appropriate scale and center that should be used.

Now that we understand what goes on with batch normalization under the hood, let’s see how we can use Keras’ batch normalization layer as part of our deep learning models.

To implement batch normalization as part of our deep learning models in Tensorflow, we can use the `keras.layers.BatchNormalization`

layer. Using the Numpy arrays from our previous example, we can implement the BatchNormalization on them.

import tensorflow as tf import tensorflow.keras as keras from tensorflow.keras.layers import BatchNormalization import numpy as np #expand dims to create the channels activation_maps = tf.constant(np.expand_dims(np.stack([ activation_map_sample1, activation_map_sample2, activation_map_sample3 ]), axis=0),dtype=tf.float32) print (f"activation_maps: \n{activation_maps}\n") print (BatchNormalization(axis=0)(activation_maps, training=True))

which gives us the output

activation_maps: [[[[1. 1. 1.] [1. 1. 1.] [1. 1. 1.]] [[1. 2. 3.] [4. 5. 6.] [7. 8. 9.]] [[9. 8. 7.] [6. 5. 4.] [3. 2. 1.]]]] tf.Tensor( [[[[-0.9427501 -0.9427501 -0.9427501 ] [-0.9427501 -0.9427501 -0.9427501 ] [-0.9427501 -0.9427501 -0.9427501 ]] [[-0.9427501 -0.5892188 -0.2356875 ] [ 0.11784375 0.471375 0.82490635] [ 1.1784375 1.5319688 1.8855002 ]] [[ 1.8855002 1.5319688 1.1784375 ] [ 0.82490635 0.471375 0.11784375] [-0.2356875 -0.5892188 -0.9427501 ]]]], shape=(1, 3, 3, 3), dtype=float32)

By default, the BatchNormalization layer uses a scale of 1 and center of 0 for the linear layer, hence these values are similar to the values that we computed earlier using Numpy functions.

Now that we’ve seen how to implement the normalization and batch normalization layers in Tensorflow, let’s explore a LeNet-5 model that uses the normalization and batch normalization layers, as well as compare it to a model that does not use either of these layers.

First, let’s get our dataset, we’ll use CIFAR-10 for this example.

(trainX, trainY), (testX, testY) = keras.datasets.cifar10.load_data()

Using a LeNet-5 model with ReLU activation,

from tensorflow.keras.layers import Dense, Input, Flatten, Conv2D, MaxPool2D from tensorflow.keras.models import Model import tensorflow as tf class LeNet5(tf.keras.Model): def __init__(self): super(LeNet5, self).__init__() def call(self, input_tensor): self.conv1 = Conv2D(filters=6, kernel_size=(5,5), padding="same", activation="relu")(input_tensor) self.maxpool1 = MaxPool2D(pool_size=(2,2))(self.conv1) self.conv2 = Conv2D(filters=16, kernel_size=(5,5), padding="same", activation="relu")(self.maxpool1) self.maxpool2 = MaxPool2D(pool_size=(2, 2))(self.conv2) self.flatten = Flatten()(self.maxpool2) self.fc1 = Dense(units=120, activation="relu")(self.flatten) self.fc2 = Dense(units=84, activation="relu")(self.fc1) self.fc3 = Dense(units=10, activation="sigmoid")(self.fc2) return self.fc3 input_layer = Input(shape=(32,32,3,)) x = LeNet5()(input_layer) model = Model(inputs=input_layer, outputs=x) model.compile(optimizer="adam", loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics="acc") history = model.fit(x=trainX, y=trainY, batch_size=256, epochs=10, validation_data=(testX, testY))

Training the model gives us the output,

Epoch 1/10 196/196 [==============================] - 14s 15ms/step - loss: 3.8905 - acc: 0.2172 - val_loss: 1.9656 - val_acc: 0.2853 Epoch 2/10 196/196 [==============================] - 2s 12ms/step - loss: 1.8402 - acc: 0.3375 - val_loss: 1.7654 - val_acc: 0.3678 Epoch 3/10 196/196 [==============================] - 2s 12ms/step - loss: 1.6778 - acc: 0.3986 - val_loss: 1.6484 - val_acc: 0.4039 Epoch 4/10 196/196 [==============================] - 2s 12ms/step - loss: 1.5663 - acc: 0.4355 - val_loss: 1.5644 - val_acc: 0.4380 Epoch 5/10 196/196 [==============================] - 2s 12ms/step - loss: 1.4815 - acc: 0.4712 - val_loss: 1.5357 - val_acc: 0.4472 Epoch 6/10 196/196 [==============================] - 2s 12ms/step - loss: 1.4053 - acc: 0.4975 - val_loss: 1.4883 - val_acc: 0.4675 Epoch 7/10 196/196 [==============================] - 2s 12ms/step - loss: 1.3300 - acc: 0.5262 - val_loss: 1.4643 - val_acc: 0.4805 Epoch 8/10 196/196 [==============================] - 2s 12ms/step - loss: 1.2595 - acc: 0.5531 - val_loss: 1.4685 - val_acc: 0.4866 Epoch 9/10 196/196 [==============================] - 2s 12ms/step - loss: 1.1999 - acc: 0.5752 - val_loss: 1.4302 - val_acc: 0.5026 Epoch 10/10 196/196 [==============================] - 2s 12ms/step - loss: 1.1370 - acc: 0.5979 - val_loss: 1.4441 - val_acc: 0.5009

Next, let’s take a look at what happens if we added normalization and batch normalization layers. We usually add layer normalization. Amending our LeNet-5 model,

class LeNet5_Norm(tf.keras.Model): def __init__(self, norm_layer, *args, **kwargs): super(LeNet5_Norm, self).__init__() self.conv1 = Conv2D(filters=6, kernel_size=(5,5), padding="same") self.norm1 = norm_layer(*args, **kwargs) self.relu = relu self.max_pool2x2 = MaxPool2D(pool_size=(2,2)) self.conv2 = Conv2D(filters=16, kernel_size=(5,5), padding="same") self.norm2 = norm_layer(*args, **kwargs) self.flatten = Flatten() self.fc1 = Dense(units=120) self.norm3 = norm_layer(*args, **kwargs) self.fc2 = Dense(units=84) self.norm4 = norm_layer(*args, **kwargs) self.fc3 = Dense(units=10, activation="softmax") def call(self, input_tensor): conv1 = self.conv1(input_tensor) conv1 = self.norm1(conv1) conv1 = self.relu(conv1) maxpool1 = self.max_pool2x2(conv1) conv2 = self.conv2(maxpool1) conv2 = self.norm2(conv2) conv2 = self.relu(conv2) maxpool2 = self.max_pool2x2(conv2) flatten = self.flatten(maxpool2) fc1 = self.fc1(flatten) fc1 = self.norm3(fc1) fc1 = self.relu(fc1) fc2 = self.fc2(fc1) fc2 = self.norm4(fc2) fc2 = self.relu(fc2) fc3 = self.fc3(fc2) return fc3

And running the training again, this time with the normalization layer added.

normalization_layer = Normalization() normalization_layer.adapt(trainX) input_layer = Input(shape=(32,32,3,)) x = LeNet5_Norm(BatchNormalization)(normalization_layer(input_layer)) model = Model(inputs=input_layer, outputs=x) model.compile(optimizer="adam", loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics="acc") history = model.fit(x=trainX, y=trainY, batch_size=256, epochs=10, validation_data=(testX, testY))

And we see that the model converges faster and gets a higher validation accuracy.

Epoch 1/10 196/196 [==============================] - 5s 17ms/step - loss: 1.4643 - acc: 0.4791 - val_loss: 1.3837 - val_acc: 0.5054 Epoch 2/10 196/196 [==============================] - 3s 14ms/step - loss: 1.1171 - acc: 0.6041 - val_loss: 1.2150 - val_acc: 0.5683 Epoch 3/10 196/196 [==============================] - 3s 14ms/step - loss: 0.9627 - acc: 0.6606 - val_loss: 1.1038 - val_acc: 0.6086 Epoch 4/10 196/196 [==============================] - 3s 14ms/step - loss: 0.8560 - acc: 0.7003 - val_loss: 1.0976 - val_acc: 0.6229 Epoch 5/10 196/196 [==============================] - 3s 14ms/step - loss: 0.7644 - acc: 0.7325 - val_loss: 1.1073 - val_acc: 0.6153 Epoch 6/10 196/196 [==============================] - 3s 15ms/step - loss: 0.6872 - acc: 0.7617 - val_loss: 1.1484 - val_acc: 0.6128 Epoch 7/10 196/196 [==============================] - 3s 14ms/step - loss: 0.6229 - acc: 0.7850 - val_loss: 1.1469 - val_acc: 0.6346 Epoch 8/10 196/196 [==============================] - 3s 14ms/step - loss: 0.5583 - acc: 0.8067 - val_loss: 1.2041 - val_acc: 0.6206 Epoch 9/10 196/196 [==============================] - 3s 15ms/step - loss: 0.4998 - acc: 0.8300 - val_loss: 1.3095 - val_acc: 0.6071 Epoch 10/10 196/196 [==============================] - 3s 14ms/step - loss: 0.4474 - acc: 0.8471 - val_loss: 1.2649 - val_acc: 0.6177

Plotting the train and validation accuracies of both models,

Some caution when using batch normalization, it’s generally not advised to use batch normalization together with dropout as batch normalization has a regularizing effect. Also, too small batch sizes might be an issue for batch normalization as the quality of the statistics (mean and variance) calculated is affected by the batch size and very small batch sizes could lead to issues, with the extreme case being one sample have all activations as 0 if looking at simple neural networks. Consider using layer normalization (more resources in further reading section below) if you are considering using small batch sizes.

Here’s the complete code for the model with normalization too.

from tensorflow.keras.layers import Dense, Input, Flatten, Conv2D, BatchNormalization, MaxPool2D, Normalization from tensorflow.keras.models import Model import tensorflow as tf import tensorflow.keras as keras class LeNet5_Norm(tf.keras.Model): def __init__(self, norm_layer, *args, **kwargs): super(LeNet5_Norm, self).__init__() self.conv1 = Conv2D(filters=6, kernel_size=(5,5), padding="same") self.norm1 = norm_layer(*args, **kwargs) self.relu = relu self.max_pool2x2 = MaxPool2D(pool_size=(2,2)) self.conv2 = Conv2D(filters=16, kernel_size=(5,5), padding="same") self.norm2 = norm_layer(*args, **kwargs) self.flatten = Flatten() self.fc1 = Dense(units=120) self.norm3 = norm_layer(*args, **kwargs) self.fc2 = Dense(units=84) self.norm4 = norm_layer(*args, **kwargs) self.fc3 = Dense(units=10, activation="softmax") def call(self, input_tensor): conv1 = self.conv1(input_tensor) conv1 = self.norm1(conv1) conv1 = self.relu(conv1) maxpool1 = self.max_pool2x2(conv1) conv2 = self.conv2(maxpool1) conv2 = self.norm2(conv2) conv2 = self.relu(conv2) maxpool2 = self.max_pool2x2(conv2) flatten = self.flatten(maxpool2) fc1 = self.fc1(flatten) fc1 = self.norm3(fc1) fc1 = self.relu(fc1) fc2 = self.fc2(fc1) fc2 = self.norm4(fc2) fc2 = self.relu(fc2) fc3 = self.fc3(fc2) return fc3 # load dataset, using cifar10 to show greater improvement in accuracy (trainX, trainY), (testX, testY) = keras.datasets.cifar10.load_data() normalization_layer = Normalization() normalization_layer.adapt(trainX) input_layer = Input(shape=(32,32,3,)) x = LeNet5_Norm(BatchNormalization)(normalization_layer(input_layer)) model = Model(inputs=input_layer, outputs=x) model.compile(optimizer="adam", loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics="acc") history = model.fit(x=trainX, y=trainY, batch_size=256, epochs=10, validation_data=(testX, testY))

Papers:

- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- How Does Batch Normalization Help Optimization?
- Deep Residual Learning for Image Recognition (the ResNet paper)

Here are some of the different types of normalization that you can implement in your model:

- Layer normalization
- Group normalization
- Instance Normalization: The Missing Ingredient for Fast Stylization

Tensorflow layers:

- Tensorflow addons (Layer, Instance, Group normalization): https://github.com/tensorflow/addons/blob/master/docs/tutorials/layers_normalizations.ipynb
- Batch normalization
- Normalization

In this post, you’ve discovered how normalization and batch normalization works, as well as how to implement them in TensorFlow. You have also seen how using these layers can help to significantly improve the performance of our machine learning models.

Specifically, you’ve learned:

- What normalization and batch normalization does
- How to use normalization and batch normalization in TensorFlow
- Some tips when using batch normalization in your machine learning model

The post Using Normalization Layers to Improve Deep Learning Models appeared first on Machine Learning Mastery.

]]>The post Visualizing the vanishing gradient problem appeared first on Machine Learning Mastery.

]]>In this tutorial, we visually examine why vanishing gradient problem exists.

After completing this tutorial, you will know

- What is a vanishing gradient
- Which configuration of neural network will be susceptible to vanishing gradient
- How to run manual training loop in Keras
- How to extract weights and gradients from Keras model

Let’s get started

This tutorial is divided into 5 parts; they are:

- Configuration of multilayer perceptron models
- Example of vanishing gradient problem
- Looking at the weights of each layer
- Looking at the gradients of each layer
- The Glorot initialization

Because neural networks are trained by gradient descent, people believed that a differentiable function is required to be the activation function in neural networks. This caused us to conventionally use sigmoid function or hyperbolic tangent as activation.

For a binary classification problem, if we want to do logistic regression such that 0 and 1 are the ideal output, sigmoid function is preferred as it is in this range:

$$

\sigma(x) = \frac{1}{1+e^{-x}}

$$

and if we need sigmoidal activation at the output, it is natural to use it in all layers of the neural network. Additionally, each layer in a neural network has a weight parameter. Initially, the weights have to be randomized and naturally we would use some simple way to do it, such as using uniform random or normal distribution.

To illustrate the problem of vanishing gradient, let’s try with an example. Neural network is a nonlinear function. Hence it should be most suitable for classification of nonlinear dataset. We make use of scikit-learn’s `make_circle()`

function to generate some data:

from sklearn.datasets import make_circles import matplotlib.pyplot as plt # Make data: Two circles on x-y plane as a classification problem X, y = make_circles(n_samples=1000, factor=0.5, noise=0.1) plt.figure(figsize=(8,6)) plt.scatter(X[:,0], X[:,1], c=y) plt.show()

This is not difficult to classify. A naive way is to build a 3-layer neural network, which can give a quite good result:

from tensorflow.keras.layers import Dense, Input from tensorflow.keras import Sequential model = Sequential([ Input(shape=(2,)), Dense(5, "relu"), Dense(1, "sigmoid") ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=32, epochs=100, verbose=0) print(model.evaluate(X,y))

32/32 [==============================] - 0s 1ms/step - loss: 0.2404 - acc: 0.9730 [0.24042171239852905, 0.9729999899864197]

Note that we used rectified linear unit (ReLU) in the hidden layer above. By default, the dense layer in Keras will be using linear activation (i.e. no activation) which mostly is not useful. We usually use ReLU in modern neural networks. But we can also try the old school way as everyone does two decades ago:

model = Sequential([ Input(shape=(2,)), Dense(5, "sigmoid"), Dense(1, "sigmoid") ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=32, epochs=100, verbose=0) print(model.evaluate(X,y))

32/32 [==============================] - 0s 1ms/step - loss: 0.6927 - acc: 0.6540 [0.6926590800285339, 0.6539999842643738]

The accuracy is much worse. It turns out, it is even worse by adding more layers (at least in my experiment):

model = Sequential([ Input(shape=(2,)), Dense(5, "sigmoid"), Dense(5, "sigmoid"), Dense(5, "sigmoid"), Dense(1, "sigmoid") ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=32, epochs=100, verbose=0) print(model.evaluate(X,y))

32/32 [==============================] - 0s 1ms/step - loss: 0.6922 - acc: 0.5330 [0.6921834349632263, 0.5329999923706055]

Your result may vary given the stochastic nature of the training algorithm. You may see the 5-layer sigmoidal network performing much worse than 3-layer or not. But the idea here is you can’t get back the high accuracy as we can achieve with rectified linear unit activation by merely adding layers.

Shouldn’t we get a more powerful neural network with more layers?

Yes, it should be. But it turns out as we adding more layers, we triggered the vanishing gradient problem. To illustrate what happened, let’s see how are the weights look like as we trained our network.

In Keras, we are allowed to plug-in a callback function to the training process. We are going create our own callback object to intercept and record the weights of each layer of our multilayer perceptron (MLP) model at the end of each epoch.

from tensorflow.keras.callbacks import Callback class WeightCapture(Callback): "Capture the weights of each layer of the model" def __init__(self, model): super().__init__() self.model = model self.weights = [] self.epochs = [] def on_epoch_end(self, epoch, logs=None): self.epochs.append(epoch) # remember the epoch axis weight = {} for layer in model.layers: if not layer.weights: continue name = layer.weights[0].name.split("/")[0] weight[name] = layer.weights[0].numpy() self.weights.append(weight)

We derive the `Callback`

class and define the `on_epoch_end()`

function. This class will need the created model to initialize. At the end of each epoch, it will read each layer and save the weights into numpy array.

For the convenience of experimenting different ways of creating a MLP, we make a helper function to set up the neural network model:

def make_mlp(activation, initializer, name): "Create a model with specified activation and initalizer" model = Sequential([ Input(shape=(2,), name=name+"0"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"1"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"2"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"3"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"4"), Dense(1, activation="sigmoid", kernel_initializer=initializer, name=name+"5") ]) return model

We deliberately create a neural network with 4 hidden layers so we can see how each layer respond to the training. We will vary the activation function of each hidden layer as well as the weight initialization. To make things easier to tell, we are going to name each layer instead of letting Keras to assign a name. The input is a coordinate on the xy-plane hence the input shape is a vector of 2. The output is binary classification. Therefore we use sigmoid activation to make the output fall in the range of 0 to 1.

Then we can `compile()`

the model to provide the evaluation metrics and pass on the callback in the `fit()`

call to train the model:

initializer = RandomNormal(mean=0.0, stddev=1.0) batch_size = 32 n_epochs = 100 model = make_mlp("sigmoid", initializer, "sigmoid") capture_cb = WeightCapture(model) capture_cb.on_epoch_end(-1) model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=1)

Here we create the neural network by calling `make_mlp()`

first. Then we set up our callback object. Since the weights of each layer in the neural network are initialized at creation, we deliberately call the callback function to remember what they are initialized to. Then we call the `compile()`

and `fit()`

from the model as usual, with the callback object provided.

After we fit the model, we can evaluate it with the entire dataset:

... print(model.evaluate(X,y))

[0.6649572253227234, 0.5879999995231628]

Here it means the log-loss is 0.665 and the accuracy is 0.588 for this model of having all layers using sigmoid activation.

What we can further look into is how the weight behaves along the iterations of training. All the layers except the first and the last are having their weight as a 5×5 matrix. We can check the mean and standard deviation of the weights to get a sense of how the weights look like:

def plotweight(capture_cb): "Plot the weights' mean and s.d. across epochs" fig, ax = plt.subplots(2, 1, sharex=True, constrained_layout=True, figsize=(8, 10)) ax[0].set_title("Mean weight") for key in capture_cb.weights[0]: ax[0].plot(capture_cb.epochs, [w[key].mean() for w in capture_cb.weights], label=key) ax[0].legend() ax[1].set_title("S.D.") for key in capture_cb.weights[0]: ax[1].plot(capture_cb.epochs, [w[key].std() for w in capture_cb.weights], label=key) ax[1].legend() plt.show() plotweight(capture_cb)

This results in the following figure:

We see the mean weight moved quickly only in first 10 iterations or so. Only the weights of the first layer getting more diversified as its standard deviation is moving up.

We can restart with the hyperbolic tangent (tanh) activation on the same process:

# tanh activation, large variance gaussian initialization model = make_mlp("tanh", initializer, "tanh") capture_cb = WeightCapture(model) capture_cb.on_epoch_end(-1) model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0) print(model.evaluate(X,y)) plotweight(capture_cb)

[0.012918001972138882, 0.9929999709129333]

The log-loss and accuracy are both improved. If we look at the plot, we don’t see the abrupt change in the mean and standard deviation in the weights but instead, that of all layers are slowly converged.

Similar case can be seen in ReLU activation:

# relu activation, large variance gaussian initialization model = make_mlp("relu", initializer, "relu") capture_cb = WeightCapture(model) capture_cb.on_epoch_end(-1) model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0) print(model.evaluate(X,y)) plotweight(capture_cb)

[0.016895903274416924, 0.9940000176429749]

We see the effect of different activation function in the above. But indeed, what matters is the gradient as we are running gradient decent during training. The paper by Xavier Glorot and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks”, suggested to look at the gradient of each layer in each training iteration as well as the standard deviation of it.

Bradley (2009) found that back-propagated gradients were smaller as one moves from the output layer towards the input layer, just after initialization. He studied networks with linear activation at each layer, finding that the variance of the back-propagated gradients decreases as we go backwards in the network

— “Understanding the difficulty of training deep feedforward neural networks” (2010)

To understand how the activation function related to the gradient as perceived during training, we need to run the training loop manually.

In Tensorflow-Keras, a training loop can be run by turning on the gradient tape, and then make the neural network model produce an output, which afterwards we can obtain the gradient by automatic differentiation from the gradient tape. Subsequently we can update the parameters (weights and biases) according to the gradient descent update rule.

Because the gradient is readily obtained in this loop, we can make a copy of it. The following is how we implement the training loop and at the same time, keep a copy of the gradients:

optimizer = tf.keras.optimizers.RMSprop() loss_fn = tf.keras.losses.BinaryCrossentropy() def train_model(X, y, model, n_epochs=n_epochs, batch_size=batch_size): "Run training loop manually" train_dataset = tf.data.Dataset.from_tensor_slices((X, y)) train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size) gradhistory = [] losshistory = [] def recordweight(): data = {} for g,w in zip(grads, model.trainable_weights): if '/kernel:' not in w.name: continue # skip bias name = w.name.split("/")[0] data[name] = g.numpy() gradhistory.append(data) losshistory.append(loss_value.numpy()) for epoch in range(n_epochs): for step, (x_batch_train, y_batch_train) in enumerate(train_dataset): with tf.GradientTape() as tape: y_pred = model(x_batch_train, training=True) loss_value = loss_fn(y_batch_train, y_pred) grads = tape.gradient(loss_value, model.trainable_weights) optimizer.apply_gradients(zip(grads, model.trainable_weights)) if step == 0: recordweight() # After all epochs, record again recordweight() return gradhistory, losshistory

The key in the function above is the nested for-loop. In which, we launch `tf.GradientTape()`

and pass in a batch of data to the model to get a prediction, which is then evaluated using the loss function. Afterwards, we can pull out the gradient from the tape by comparing the loss with the trainable weight from the model. Next, we update the weights using the optimizer, which will handle the learning weights and momentums in the gradient descent algorithm implicitly.

As a refresh, the gradient here means the following. For a loss value $L$ computed and a layer with weights $W=[w_1, w_2, w_3, w_4, w_5]$ (e.g., on the output layer) then the gradient is the matrix

$$

\frac{\partial L}{\partial W} = \Big[\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \frac{\partial L}{\partial w_3}, \frac{\partial L}{\partial w_4}, \frac{\partial L}{\partial w_5}\Big]

$$

But before we start the next iteration of training, we have a chance to further manipulate the gradient: We match the gradient with the weights, to get the name of each, then save a copy of the gradient as numpy array. We sample the weight and loss only once per epoch, but you can change that to sample in a higher frequency.

With these, we can plot the gradient across epochs. In the following, we create the model (but not calling `compile()`

because we would not call `fit()`

afterwards) and run the manual training loop, then plot the gradient as well as the standard deviation of the gradient:

from sklearn.metrics import accuracy_score def plot_gradient(gradhistory, losshistory): "Plot gradient mean and sd across epochs" fig, ax = plt.subplots(3, 1, sharex=True, constrained_layout=True, figsize=(8, 12)) ax[0].set_title("Mean gradient") for key in gradhistory[0]: ax[0].plot(range(len(gradhistory)), [w[key].mean() for w in gradhistory], label=key) ax[0].legend() ax[1].set_title("S.D.") for key in gradhistory[0]: ax[1].semilogy(range(len(gradhistory)), [w[key].std() for w in gradhistory], label=key) ax[1].legend() ax[2].set_title("Loss") ax[2].plot(range(len(losshistory)), losshistory) plt.show() model = make_mlp("sigmoid", initializer, "sigmoid") print("Before training: Accuracy", accuracy_score(y, (model(X) > 0.5))) gradhistory, losshistory = train_model(X, y, model) print("After training: Accuracy", accuracy_score(y, (model(X) > 0.5))) plot_gradient(gradhistory, losshistory)

It reported a weak classification result:

Before training: Accuracy 0.5 After training: Accuracy 0.652

and the plot we obtained shows vanishing gradient:

From the plot, the loss is not significantly decreased. The mean of gradient (i.e., mean of all elements in the gradient matrix) has noticeable value only for the last layer while all other layers are virtually zero. The standard deviation of the gradient is at the level of between 0.01 and 0.001 approximately.

Repeat this with tanh activation, we see a different result, which explains why the performance is better:

model = make_mlp("tanh", initializer, "tanh") print("Before training: Accuracy", accuracy_score(y, (model(X) > 0.5))) gradhistory, losshistory = train_model(X, y, model) print("After training: Accuracy", accuracy_score(y, (model(X) > 0.5))) plot_gradient(gradhistory, losshistory)

Before training: Accuracy 0.502 After training: Accuracy 0.994

From the plot of the mean of the gradients, we see the gradients from every layer are wiggling equally. The standard deviation of the gradient are also an order of magnitude larger than the case of sigmoid activation, at around 0.1 to 0.01.

Finally, we can also see the similar in rectified linear unit (ReLU) activation. And in this case the loss dropped quickly, hence we see it as the more efficient activation to use in neural networks:

model = make_mlp("relu", initializer, "relu") print("Before training: Accuracy", accuracy_score(y, (model(X) > 0.5))) gradhistory, losshistory = train_model(X, y, model) print("After training: Accuracy", accuracy_score(y, (model(X) > 0.5))) plot_gradient(gradhistory, losshistory)

Before training: Accuracy 0.503 After training: Accuracy 0.995

The following is the complete code:

import numpy as np import tensorflow as tf from tensorflow.keras.callbacks import Callback from tensorflow.keras.layers import Dense, Input from tensorflow.keras import Sequential from tensorflow.keras.initializers import RandomNormal import matplotlib.pyplot as plt from sklearn.datasets import make_circles from sklearn.metrics import accuracy_score tf.random.set_seed(42) np.random.seed(42) # Make data: Two circles on x-y plane as a classification problem X, y = make_circles(n_samples=1000, factor=0.5, noise=0.1) plt.figure(figsize=(8,6)) plt.scatter(X[:,0], X[:,1], c=y) plt.show() # Test performance with 3-layer binary classification network model = Sequential([ Input(shape=(2,)), Dense(5, "relu"), Dense(1, "sigmoid") ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=32, epochs=100, verbose=0) print(model.evaluate(X,y)) # Test performance with 3-layer network with sigmoid activation model = Sequential([ Input(shape=(2,)), Dense(5, "sigmoid"), Dense(1, "sigmoid") ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=32, epochs=100, verbose=0) print(model.evaluate(X,y)) # Test performance with 5-layer network with sigmoid activation model = Sequential([ Input(shape=(2,)), Dense(5, "sigmoid"), Dense(5, "sigmoid"), Dense(5, "sigmoid"), Dense(1, "sigmoid") ]) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"]) model.fit(X, y, batch_size=32, epochs=100, verbose=0) print(model.evaluate(X,y)) # Illustrate weights across epochs class WeightCapture(Callback): "Capture the weights of each layer of the model" def __init__(self, model): super().__init__() self.model = model self.weights = [] self.epochs = [] def on_epoch_end(self, epoch, logs=None): self.epochs.append(epoch) # remember the epoch axis weight = {} for layer in model.layers: if not layer.weights: continue name = layer.weights[0].name.split("/")[0] weight[name] = layer.weights[0].numpy() self.weights.append(weight) def make_mlp(activation, initializer, name): "Create a model with specified activation and initalizer" model = Sequential([ Input(shape=(2,), name=name+"0"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"1"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"2"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"3"), Dense(5, activation=activation, kernel_initializer=initializer, name=name+"4"), Dense(1, activation="sigmoid", kernel_initializer=initializer, name=name+"5") ]) return model def plotweight(capture_cb): "Plot the weights' mean and s.d. across epochs" fig, ax = plt.subplots(2, 1, sharex=True, constrained_layout=True, figsize=(8, 10)) ax[0].set_title("Mean weight") for key in capture_cb.weights[0]: ax[0].plot(capture_cb.epochs, [w[key].mean() for w in capture_cb.weights], label=key) ax[0].legend() ax[1].set_title("S.D.") for key in capture_cb.weights[0]: ax[1].plot(capture_cb.epochs, [w[key].std() for w in capture_cb.weights], label=key) ax[1].legend() plt.show() initializer = RandomNormal(mean=0, stddev=1) batch_size = 32 n_epochs = 100 # Sigmoid activation model = make_mlp("sigmoid", initializer, "sigmoid") capture_cb = WeightCapture(model) capture_cb.on_epoch_end(-1) model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"]) print("Before training: Accuracy", accuracy_score(y, (model(X).numpy() > 0.5).astype(int))) model.fit(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0) print("After training: Accuracy", accuracy_score(y, (model(X).numpy() > 0.5).astype(int))) print(model.evaluate(X,y)) plotweight(capture_cb) # tanh activation model = make_mlp("tanh", initializer, "tanh") capture_cb = WeightCapture(model) capture_cb.on_epoch_end(-1) model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"]) print("Before training: Accuracy", accuracy_score(y, (model(X).numpy() > 0.5).astype(int))) model.fit(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0) print("After training: Accuracy", accuracy_score(y, (model(X).numpy() > 0.5).astype(int))) print(model.evaluate(X,y)) plotweight(capture_cb) # relu activation model = make_mlp("relu", initializer, "relu") capture_cb = WeightCapture(model) capture_cb.on_epoch_end(-1) model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"]) print("Before training: Accuracy", accuracy_score(y, (model(X).numpy() > 0.5).astype(int))) model.fit(X, y, batch_size=batch_size, epochs=n_epochs, callbacks=[capture_cb], verbose=0) print("After training: Accuracy", accuracy_score(y, (model(X).numpy() > 0.5).astype(int))) print(model.evaluate(X,y)) plotweight(capture_cb) # Show gradient across epochs optimizer = tf.keras.optimizers.RMSprop() loss_fn = tf.keras.losses.BinaryCrossentropy() def train_model(X, y, model, n_epochs=n_epochs, batch_size=batch_size): "Run training loop manually" train_dataset = tf.data.Dataset.from_tensor_slices((X, y)) train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size) gradhistory = [] losshistory = [] def recordweight(): data = {} for g,w in zip(grads, model.trainable_weights): if '/kernel:' not in w.name: continue # skip bias name = w.name.split("/")[0] data[name] = g.numpy() gradhistory.append(data) losshistory.append(loss_value.numpy()) for epoch in range(n_epochs): for step, (x_batch_train, y_batch_train) in enumerate(train_dataset): with tf.GradientTape() as tape: y_pred = model(x_batch_train, training=True) loss_value = loss_fn(y_batch_train, y_pred) grads = tape.gradient(loss_value, model.trainable_weights) optimizer.apply_gradients(zip(grads, model.trainable_weights)) if step == 0: recordweight() # After all epochs, record again recordweight() return gradhistory, losshistory def plot_gradient(gradhistory, losshistory): "Plot gradient mean and sd across epochs" fig, ax = plt.subplots(3, 1, sharex=True, constrained_layout=True, figsize=(8, 12)) ax[0].set_title("Mean gradient") for key in gradhistory[0]: ax[0].plot(range(len(gradhistory)), [w[key].mean() for w in gradhistory], label=key) ax[0].legend() ax[1].set_title("S.D.") for key in gradhistory[0]: ax[1].semilogy(range(len(gradhistory)), [w[key].std() for w in gradhistory], label=key) ax[1].legend() ax[2].set_title("Loss") ax[2].plot(range(len(losshistory)), losshistory) plt.show() model = make_mlp("sigmoid", initializer, "sigmoid") print("Before training: Accuracy", accuracy_score(y, (model(X) > 0.5))) gradhistory, losshistory = train_model(X, y, model) print("After training: Accuracy", accuracy_score(y, (model(X) > 0.5))) plot_gradient(gradhistory, losshistory) model = make_mlp("tanh", initializer, "tanh") print("Before training: Accuracy", accuracy_score(y, (model(X) > 0.5))) gradhistory, losshistory = train_model(X, y, model) print("After training: Accuracy", accuracy_score(y, (model(X) > 0.5))) plot_gradient(gradhistory, losshistory) model = make_mlp("relu", initializer, "relu") print("Before training: Accuracy", accuracy_score(y, (model(X) > 0.5))) gradhistory, losshistory = train_model(X, y, model) print("After training: Accuracy", accuracy_score(y, (model(X) > 0.5))) plot_gradient(gradhistory, losshistory)

We didn’t demonstrate in the code above, but the most famous outcome from the paper by Glorot and Bengio is the Glorot initialization. Which suggests to initialize the weights of a layer of the neural network with uniform distribution:

The normalization factor may therefore be important when initializing deep networks because of the multiplicative effect through layers, and we suggest the following initialization procedure to approximately satisfy our objectives of maintaining activation variances and back-propagated gradients variance as one moves up or down the network. We call it the normalized initialization:

$$

W \sim U\Big[-\frac{\sqrt{6}}{\sqrt{n_j+n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j+n_{j+1}}}\Big]

$$

— “Understanding the difficulty of training deep feedforward neural networks” (2010)

This is derived from the linear activation on the condition that the standard deviation of the gradient is keeping consistent across the layers. In the sigmoid and tanh activation, the linear region is narrow. Therefore we can understand why ReLU is the key to workaround the vanishing gradient problem. Comparing to replacing the activation function, changing the weight initialization is less pronounced in helping to resolve the vanishing gradient problem. But this can be an exercise for you to explore to see how this can help improving the result.

The Glorot and Bengio paper is available at:

- “Understanding the difficulty of training deep feedforward neural networks”, by Xavier Glorot and Yoshua Bengio, 2010.

(https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)

The vanishing gradient problem is well known enough in machine learning that many books covered it. For example,

*Deep Learning*, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016.

(https://www.amazon.com/dp/0262035618)

Previously we have posts about vanishing and exploding gradients:

- How to fix vanishing gradients using the rectified linear activation function
- Exploding gradients in neural networks

You may also find the following documentation helpful to explain some syntax we used above:

- Writings a training loop from scratch in Keras: https://keras.io/guides/writing_a_training_loop_from_scratch/
- Writing your own callbacks in Keras: https://keras.io/guides/writing_your_own_callbacks/

In this tutorial, you visually saw how a rectified linear unit (ReLU) can help resolving the vanishing gradient problem.

Specifically, you learned:

- How the problem of vanishing gradient impact the performance of a neural network
- Why ReLU activation is the solution to vanishing gradient problem
- How to use a custom callback to extract data in the middle of training loop in Keras
- How to write a custom training loop
- How to read the weight and gradient from a layer in the neural network

The post Visualizing the vanishing gradient problem appeared first on Machine Learning Mastery.

]]>The post How to Demonstrate Your Basic Skills with Deep Learning appeared first on Machine Learning Mastery.

]]>Explaining that you are familiar with a technique or type of problem is very different to being able to use it effectively with open source APIs on real datasets.

Perhaps the most effective way of demonstrating skill as a deep learning practitioner is by developing models. A practitioner can practice on standard publicly available machine learning datasets and build up a portfolio of completed projects to both leverage on future projects and to demonstrate competence.

In this post, you will discover how you can use small projects to demonstrate basic competence for using deep learning for predictive modeling.

After reading this post, you will know:

- Explaining deep learning math, theory, and methods is not sufficient to demonstrate competence.
- Developing a portfolio of completed small projects allows you to demonstrate your ability to develop and deliver skillful models.
- Using a systematic five-step project template to execute projects and a nine-step template for presenting results allows you to both methodically complete projects and clearly communicate findings.

**Kick-start your project** with my new book Better Deep Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- How Would You Demonstrate Basic Deep Learning Competence?
- Demonstrate Skills With a Portfolio
- How to Select Portfolio Projects
- Template for Systematic Projects
- Template for Presenting Results

How do you know that you have a basic competence with deep learning methods for predictive modeling problems?

- Perhaps you have read a book?
- Perhaps you have worked through some tutorials?
- Perhaps you are familiar with the APIs?

If you had to, how would you demonstrate this competence to someone else?

- Perhaps you could explain common problems and how to address them?
- Perhaps you could summarize popular techniques?
- Perhaps you could reference notable papers?

Is this enough?

If you were hiring a deep learning practitioner for a role, would this satisfy you?

**It’s not enough, and it would not satisfy me.**

The solution is to use the same techniques that modern businesses are using to hire developers.

Developers can be quizzed all day long on math and how algorithms work but what businesses need is someone who can deliver working and maintainable code.

The same applies to deep learning practitioners.

Practitioners can be quizzed all day long on the math of gradient descent and backpropagation, but what businesses need is someone who can deliver stable models and skillful predictions.

This can be achieved through developing a portfolio of completed projects using open source deep learning libraries and standard machine learning datasets.

The portfolio has three main uses:

**Develop Skills**. The portfolio can be used by the practitioner to develop and demonstrate the skills incrementally, leveraging work from prior projects on larger and more challenging future projects.**Demonstrate Skills**. The portfolio can be used by an employer to confirm the practitioner can deliver stable results and skillful predictions.**Discuss Skills**. The portfolio can be used as a starting point for a discussion in an interview where techniques, results, and design decisions are described, understood, and defended.

There are many problem types and many specialized types of data loading and neural network models to address them, such as problems in computer vision, time series, and natural language processing.

Before specialization, you must be able to demonstrate foundational skills. Specifically, you must be able to demonstrate that you are able to work through the steps of an applied machine learning project systematically using the techniques from deep learning.

This then raises the question:

**What projects should you use to demonstrate foundational skills and how should those projects be structured to best demonstrate those skills?**

Use standard and publicly available machine learning datasets.

Ideally, there are datasets that are available with a permissive license such as public domain, GPL, or creative commons, so that you can freely copy them and perhaps even re-distribute them with your completed project.

There are many ways to choose a dataset, such as interest in the domain, prior experience, difficulty, etc.

Instead, I recommend being strategic in the choice of the datasets that you include in your portfolio. Three approaches to dataset selection that I recommend are:

**Popularity**: A good starting point might be to select popular datasets, such as among the most viewed or most cited datasets. Popular datasets are familiar and can provide points of comparison and help to simplify their presentation.**Problem Type**. Another approach might be to choose datasets according to general classes of problems, such as regression, binary classification, and multi-class classification. Demonstrating skill across basic problem types is important.**Problem Property**. A final approach might be to select datasets based on a specific property of the dataset for which you want to demonstrate proficiency, such as class imbalance, mixture of input variable types, etc. This area is often overlooked and rarely are real data as clean and simple as standard machine learning datasets; finding examples that offer more of a challenge provide excellent demonstrations.

Two excellent places to locate and download standard machine learning datasets are:

**Small Data**. I recommend starting with small datasets that fit in memory (RAM), such as many of those on the UCI Machine Learning Repository. This is because it allows you to focus on data preparation and modeling, at least initially, and work through many different configurations rapidly. Larger datasets result in much slower to train models and may require cloud infrastructure.

**Good Enough Performance**. I also recommend not aiming for the best possible model performance on the dataset. A dataset is really a manifestation of a predictive modeling problem, that in reality can become a research project of its own with no end. Instead, the focus is on establishing a threshold for defining a skillful model, then demonstrating that you can develop and wield a skillful model for the problem.

**Small Scope**. Finally, I recommend keeping the projects small, ideally completed in a normal work day, although you may need to spread out the work on nights and weekends. Each project has one aim: to work through the dataset systematically and deliver a skillful model. Be aware that without careful time boxing, the project can easily get away from you.

In summary:

- Use standard publicly available machine learning datasets.
- Choose a dataset strategically.
- Prefer smaller datasets that fit in memory.
- Aim to develop skillful models, not optimal models.
- Keep each project small and focused.

Completed projects of this nature offer a lot of benefits, including:

- Demonstration of a methodical approach to solving prediction problems.
- Demonstration of knowledge of appropriate APIs for data handling and model evaluation.
- Demonstration of the capability with specific deep learning models and techniques.
- Demonstration of time and scope management given that you deliver a skillful model.
- Demonstration of good communication in the presentation of the results and findings.

It is critical that a given dataset is worked through in a systematic manner.

There are standard steps in a predictive modeling problem and being systematic both demonstrates that you are aware of the steps and have considered them on the project.

Being systematic on portfolio projects highlights that you would be equally systematic on new projects.

The steps of a project in your portfolio may include the following.

**Problem Description**. Describe the predictive modeling problem including the domain and relevant background.**Summarize Data**. Describe the available data, including statistical summaries and data visualization.**Evaluate Models**. Spot-check a suite of model types, configurations, data preparation schemes, and more in order to narrow down what works well on the problem.**Improve Performance**. Improve the performance of the model or models that work well with hyperparameter tuning and perhaps ensemble methods.**Present Results**. Present the findings of the project.

A step before this process, a step zero, might be to choose the open source deep learning and machine learning libraries that you wish to use for the demonstration.

I would encourage you to narrow the scope wherever possible. Some additional tips include:

**Use repeated k-fold cross-validation**to evaluate models, especially with smaller datasets that fit into memory.**Use a hold-out test set**that can be used to demonstrate the ability to make predictions and evaluate a final best-performing model.**Establish a baseline performance**in order to provide a threshold of what defines a skillful or non-skillful model.**Publicly present your results**, including all code and data, ideally, a public location you own and control such as GitHub or a blog.

Getting good at working through projects in this manner is invaluable. You will always be able to get good results, quickly.

Specifically, above average, perhaps even a few-percent-from-optimal-quality results within hours to days. Few practitioners are this disciplined and productive even on standard problems.

The project is probably only as good as your ability to present it, including results and findings.

I strongly encourage you to use one (or all) of the following approaches in order to present your projects:

**Blog post**. Write up your results as a blog post on your own blog.**GitHub Repository**. Store all code and data in a GitHub repository and present results using a hosted Markdown file or Notebook that allows rich text and images.**YouTube Video**. Present your results and findings in video format, perhaps with slides.

I also strongly encourage you to define the structure of the presentation prior to starting the project, and fill in the details as you go.

A template that I recommend when presenting project results is as follows:

- 1.
**Problem Description**. Describes the problem that is being solved, the source of the data, inputs, and outputs. - 2.
**Data Summary**. Describes the distribution and relationships in the data and perhaps ideas for data preparation and modeling. - 3.
**Test Harness**. Describes how model selection will be performed including the resampling method and model evaluation metrics. - 4.
**Baseline Performance**. Describes the baseline model performance (using the test harness) that defines whether a model is skillful or not. - 5.
**Experimental Results**. Presents the experimental results, perhaps testing a suite of models, model configurations, data preparation schemes, and more. Each subsection should have some form of:- 5.1
**Intent**: why run the experiment? - 5.2
**Expectations**: what was the expected outcome of the experiment? - 5.3
**Methods**: what data, models, and configurations are used in the experiment? - 5.4
**Results**: what were the actual results of the experiment? - 5.5
**Findings**: what do the results mean, how do they relate to expectations, what other experiments do they inspire?

- 5.1
- 6.
**Improvements**(*optional*). Describes experimental results for attempts to improve the performance of the better performing models, such as hyperparameter tuning and ensemble methods. - 7.
**Final Model**. Describes the choice of a final model, including configuration and performance. It is a good idea to demonstrate saving and loading the model and demonstrate the ability to make predictions on a holdout dataset. - 8.
**Extensions**. Describes areas that were considered but not addressed in the project that could be explored in the future. - 9.
**Resources**. Describes relevant references to data, code, APIs, papers, and more.

These could be sections in a post or report, or sections of a slide presentation.

This section provides more resources on the topic if you are looking to go deeper.

- Applied Machine Learning Process
- How to Use a Machine Learning Checklist to Get Accurate Predictions
- Build a Machine Learning Portfolio

In this post, you discovered how to demonstrate basic competence for using deep learning for predictive modeling.

Specifically, you learned:

- Explaining deep learning math, theory, and methods is not sufficient to demonstrate competence.
- Developing a portfolio of completed small projects allows you to demonstrate your ability to develop and deliver skillful models.
- Using a systematic five-step project template to execute projects and a nine-step template for presenting results allows you to both methodically complete projects and clearly communicate findings.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Demonstrate Your Basic Skills with Deep Learning appeared first on Machine Learning Mastery.

]]>The post Why Training a Neural Network Is Hard appeared first on Machine Learning Mastery.

]]>Fitting a neural network involves using a training dataset to update the model weights to create a good mapping of inputs to outputs.

This training process is solved using an optimization algorithm that searches through a space of possible values for the neural network model weights for a set of weights that results in good performance on the training dataset.

In this post, you will discover the challenge of training a neural network framed as an optimization problem.

After reading this post, you will know:

- Training a neural network involves using an optimization algorithm to find a set of weights to best map inputs to outputs.
- The problem is hard, not least because the error surface is non-convex and contains local minima, flat spots, and is highly multidimensional.
- The stochastic gradient descent algorithm is the best general algorithm to address this challenging problem.

**Kick-start your project** with my new book Better Deep Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into four parts; they are:

- Learning as Optimization
- Challenging Optimization
- Features of the Error Surface
- Implications for Training

Deep learning neural network models learn to map inputs to outputs given a training dataset of examples.

The training process involves finding a set of weights in the network that proves to be good, or good enough, at solving the specific problem.

This training process is iterative, meaning that it progresses step by step with small updates to the model weights each iteration and, in turn, a change in the performance of the model each iteration.

The iterative training process of neural networks solves an optimization problem that finds for parameters (model weights) that result in a minimum error or loss when evaluating the examples in the training dataset.

Optimization is a directed search procedure and the optimization problem that we wish to solve when training a neural network model is very challenging.

This raises the question as to what exactly is so challenging about this optimization problem?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Training deep learning neural networks is very challenging.

The best general algorithm known for solving this problem is stochastic gradient descent, where model weights are updated each iteration using the backpropagation of error algorithm.

Optimization in general is an extremely difficult task. […] When training neural networks, we must confront the general non-convex case.

— Page 282, Deep Learning, 2016.

An optimization process can be understood conceptually as a search through a landscape for a candidate solution that is sufficiently satisfactory.

A point on the landscape is a specific set of weights for the model, and the elevation of that point is an evaluation of the set of weights, where valleys represent good models with small values of loss.

This is a common conceptualization of optimization problems and the landscape is referred to as an “*error surface*.”

In general, E(w) [the error function of the weights] is a multidimensional function and impossible to visualize. If it could be plotted as a function of w [the weights], however, E [the error function] might look like a landscape with hills and valleys …

— Page 113, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

The optimization algorithm iteratively steps across this landscape, updating the weights and seeking out good or low elevation areas.

For simple optimization problems, the shape of the landscape is a big bowl and finding the bottom is easy, so easy that very efficient algorithms can be designed to find the best solution.

These types of optimization problems are referred to mathematically as convex.

The error surface we wish to navigate when optimizing the weights of a neural network is not a bowl shape. It is a landscape with many hills and valleys.

These type of optimization problems are referred to mathematically as non-convex.

In fact, there does not exist an algorithm to solve the problem of finding an optimal set of weights for a neural network in polynomial time. Mathematically, the optimization problem solved by training a neural network is referred to as NP-complete (e.g. they are very hard to solve).

We prove this problem NP-complete and thus demonstrate that learning in neural networks has no efficient general solution.

— Neural Network Design and the Complexity of Learning, 1988.

There are many types of non-convex optimization problems, but the specific type of problem we are solving when training a neural network is particularly challenging.

We can characterize the difficulty in terms of the features of the landscape or error surface that the optimization algorithm may encounter and must navigate in order to be able to deliver a good solution.

There are many aspects of the optimization of neural network weights that make the problem challenging, but three often-mentioned features of the error landscape are the presence of local minima, flat regions, and the high-dimensionality of the search space.

Backpropagation can be very slow particularly for multilayered networks where the cost surface is typically non-quadratic, non-convex, and high dimensional with many local minima and/or flat regions.

— Page 13, Neural Networks: Tricks of the Trade, 2012.

Local minimal or local optima refer to the fact that the error landscape contains multiple regions where the loss is relatively low.

These are valleys, where solutions in those valleys look good relative to the slopes and peaks around them. The problem is, in the broader view of the entire landscape, the valley has a relatively high elevation and better solutions may exist.

It is hard to know whether the optimization algorithm is in a valley or not, therefore, it is good practice to start the optimization process with a lot of noise, allowing the landscape to be sampled widely before selecting a valley to fall into.

By contrast, the lowest point in the landscape is referred to as the “*global minima*“.

Neural networks may have one or more global minima, and the challenge is that the difference between the local and global minima may not make a lot of difference.

The implication of this is that often finding a “*good enough*” set of weights is more tractable and, in turn, more desirable than finding a global optimal or best set of weights.

Nonlinear networks usually have multiple local minima of differing depths. The goal of training is to locate one of these minima.

— Page 14, Neural Networks: Tricks of the Trade, 2012.

A classical approach to addressing the problem of local minima is to restart the search process multiple times with a different starting point (random initial weights) and allow the optimization algorithm to find a different, and hopefully better, local minima. This is called “*multiple restarts*”.

Random Restarts: One of the simplest ways to deal with local minima is to train many different networks with different initial weights.

— Page 121, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

A flat region or saddle point is a point on the landscape where the gradient is zero.

These are flat regions at the bottom of valleys or regions between peaks. The problem is that a zero gradient means that the optimization algorithm does not know which direction to move in order to improve the model.

… the presence of saddlepoints, or regions where the error function is very flat, can cause some iterative algorithms to become ‘stuck’ for extensive periods of time, thereby mimicking local minima.

— Page 255, Neural Networks for Pattern Recognition, 1995.

Nevertheless, recent work may suggest that perhaps local minima and flat regions may be less of a challenge than was previously believed.

Do neural networks enter and escape a series of local minima? Do they move at varying speed as they approach and then pass a variety of saddle points? […] we present evidence strongly suggesting that the answer to all of these questions is no.

— Qualitatively characterizing neural network optimization problems, 2015.

The optimization problem solved when training a neural network is high-dimensional.

Each weight in the network represents another parameter or dimension of the error surface. Deep neural networks often have millions of parameters, making the landscape to be navigated by the algorithm extremely high-dimensional, as compared to more traditional machine learning algorithms.

The problem of navigating a high-dimensional space is that the addition of each new dimension dramatically increases the distance between points in the space, or hypervolume. This is often referred to as the “curse of dimensionality”.

This phenomenon is known as the curse of dimensionality. Of particular concern is that the number of possible distinct configurations of a set of variables increases exponentially as the number of variables increases.

— Page 155, Deep Learning, 2016.

The challenging nature of optimization problems to be solved when using deep learning neural networks has implications when training models in practice.

In general, stochastic gradient descent is the best algorithm available, and this algorithm makes no guarantees.

There is no formula to guarantee that (1) the network will converge to a good solution, (2) convergence is swift, or (3) convergence even occurs at all.

— Page 13, Neural Networks: Tricks of the Trade, 2012.

We can summarize these implications as follows:

**Possibly Questionable Solution Quality**. The optimization process may or may not find a good solution and solutions can only be compared relatively, due to deceptive local minima.**Possibly Long Training Time**. The optimization process may take a long time to find a satisfactory solution, due to the iterative nature of the search.**Possible Failure**. The optimization process may fail to progress (get stuck) or fail to locate a viable solution, due to the presence of flat regions.

The task of effective training is to carefully configure, test, and tune the hyperparameters of the model and the learning process itself to best address this challenge.

Thankfully, modern advancements can dramatically simplify the search space and accelerate the search process, often discovering models much larger, deeper, and with better performance than previously thought possible.

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning, 2016.
- Neural Networks: Tricks of the Trade, 2012.
- Neural Networks for Pattern Recognition, 1995.
- Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

- Training a 3-Node Neural Network is NP-Complete, 1992.
- Qualitatively characterizing neural network optimization problems, 2015.
- Neural Network Design and the Complexity of Learning, 1988.

- The hard thing about deep learning, 2016.
- Saddle point, Wikipedia.
- Curse of dimensionality, Wikipedia.
- NP-completeness, Wikipedia.

In this post, you discovered the challenge of training a neural network framed as an optimization problem.

Specifically, you learned:

- Training a neural network involves using an optimization algorithm to find a set of weights to best map inputs to outputs.
- The problem is hard, not least because the error surface is non-convex and contains local minima, flat spots, and is highly multidimensional.
- The stochastic gradient descent algorithm is the best general algorithm to address this challenging problem.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Why Training a Neural Network Is Hard appeared first on Machine Learning Mastery.

]]>The post How to use Learning Curves to Diagnose Machine Learning Model Performance appeared first on Machine Learning Mastery.

]]>Learning curves are a widely used diagnostic tool in machine learning for algorithms that learn from a training dataset incrementally. The model can be evaluated on the training dataset and on a hold out validation dataset after each update during training and plots of the measured performance can created to show learning curves.

Reviewing learning curves of models during training can be used to diagnose problems with learning, such as an underfit or overfit model, as well as whether the training and validation datasets are suitably representative.

In this post, you will discover learning curves and how they can be used to diagnose the learning and generalization behavior of machine learning models, with example plots showing common learning problems.

After reading this post, you will know:

- Learning curves are plots that show changes in learning performance over time in terms of experience.
- Learning curves of model performance on the train and validation datasets can be used to diagnose an underfit, overfit, or well-fit model.
- Learning curves of model performance can be used to diagnose whether the train or validation datasets are not relatively representative of the problem domain.

**Kick-start your project** with my new book Better Deep Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Learning Curves
- Diagnosing Model Behavior
- Diagnosing Unrepresentative Datasets

Generally, a learning curve is a plot that shows time or experience on the x-axis and learning or improvement on the y-axis.

Learning curves (LCs) are deemed effective tools for monitoring the performance of workers exposed to a new task. LCs provide a mathematical representation of the learning process that takes place as task repetition occurs.

— Learning curve models and applications: Literature review and research directions, 2011.

For example, if you were learning a musical instrument, your skill on the instrument could be evaluated and assigned a numerical score each week for one year. A plot of the scores over the 52 weeks is a learning curve and would show how your learning of the instrument has changed over time.

**Learning Curve**: Line plot of learning (y-axis) over experience (x-axis).

Learning curves are widely used in machine learning for algorithms that learn (optimize their internal parameters) incrementally over time, such as deep learning neural networks.

The metric used to evaluate learning could be maximizing, meaning that better scores (larger numbers) indicate more learning. An example would be classification accuracy.

It is more common to use a score that is minimizing, such as loss or error whereby better scores (smaller numbers) indicate more learning and a value of 0.0 indicates that the training dataset was learned perfectly and no mistakes were made.

During the training of a machine learning model, the current state of the model at each step of the training algorithm can be evaluated. It can be evaluated on the training dataset to give an idea of how well the model is “*learning*.” It can also be evaluated on a hold-out validation dataset that is not part of the training dataset. Evaluation on the validation dataset gives an idea of how well the model is “*generalizing*.”

**Train Learning Curve**: Learning curve calculated from the training dataset that gives an idea of how well the model is learning.**Validation Learning Curve**: Learning curve calculated from a hold-out validation dataset that gives an idea of how well the model is generalizing.

It is common to create dual learning curves for a machine learning model during training on both the training and validation datasets.

In some cases, it is also common to create learning curves for multiple metrics, such as in the case of classification predictive modeling problems, where the model may be optimized according to cross-entropy loss and model performance is evaluated using classification accuracy. In this case, two plots are created, one for the learning curves of each metric, and each plot can show two learning curves, one for each of the train and validation datasets.

**Optimization Learning Curves**: Learning curves calculated on the metric by which the parameters of the model are being optimized, e.g. loss.**Performance Learning Curves**: Learning curves calculated on the metric by which the model will be evaluated and selected, e.g. accuracy.

Now that we are familiar with the use of learning curves in machine learning, let’s look at some common shapes observed in learning curve plots.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The shape and dynamics of a learning curve can be used to diagnose the behavior of a machine learning model and in turn perhaps suggest at the type of configuration changes that may be made to improve learning and/or performance.

There are three common dynamics that you are likely to observe in learning curves; they are:

- Underfit.
- Overfit.
- Good Fit.

We will take a closer look at each with examples. The examples will assume that we are looking at a minimizing metric, meaning that smaller relative scores on the y-axis indicate more or better learning.

Underfitting refers to a model that cannot learn the training dataset.

Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set.

— Page 111, Deep Learning, 2016.

An underfit model can be identified from the learning curve of the training loss only.

It may show a flat line or noisy values of relatively high loss, indicating that the model was unable to learn the training dataset at all.

An example of this is provided below and is common when the model does not have a suitable capacity for the complexity of the dataset.

An underfit model may also be identified by a training loss that is decreasing and continues to decrease at the end of the plot.

This indicates that the model is capable of further learning and possible further improvements and that the training process was halted prematurely.

A plot of learning curves shows underfitting if:

- The training loss remains flat regardless of training.
- The training loss continues to decrease until the end of training.

Overfitting refers to a model that has learned the training dataset too well, including the statistical noise or random fluctuations in the training dataset.

… fitting a more flexible model requires estimating a greater number of parameters. These more complex models can lead to a phenomenon known as overfitting the data, which essentially means they follow the errors, or noise, too closely.

— Page 22, An Introduction to Statistical Learning: with Applications in R, 2013.

The problem with overfitting, is that the more specialized the model becomes to training data, the less well it is able to generalize to new data, resulting in an increase in generalization error. This increase in generalization error can be measured by the performance of the model on the validation dataset.

This is an example of overfitting the data, […]. It is an undesirable situation because the fit obtained will not yield accurate estimates of the response on new observations that were not part of the original training data set.

— Page 24, An Introduction to Statistical Learning: with Applications in R, 2013.

This often occurs if the model has more capacity than is required for the problem, and, in turn, too much flexibility. It can also occur if the model is trained for too long.

A plot of learning curves shows overfitting if:

- The plot of training loss continues to decrease with experience.
- The plot of validation loss decreases to a point and begins increasing again.

The inflection point in validation loss may be the point at which training could be halted as experience after that point shows the dynamics of overfitting.

The example plot below demonstrates a case of overfitting.

A good fit is the goal of the learning algorithm and exists between an overfit and underfit model.

A good fit is identified by a training and validation loss that decreases to a point of stability with a minimal gap between the two final loss values.

The loss of the model will almost always be lower on the training dataset than the validation dataset. This means that we should expect some gap between the train and validation loss learning curves. This gap is referred to as the “generalization gap.”

A plot of learning curves shows a good fit if:

- The plot of training loss decreases to a point of stability.
- The plot of validation loss decreases to a point of stability and has a small gap with the training loss.

Continued training of a good fit will likely lead to an overfit.

The example plot below demonstrates a case of a good fit.

Learning curves can also be used to diagnose properties of a dataset and whether it is relatively representative.

An unrepresentative dataset means a dataset that may not capture the statistical characteristics relative to another dataset drawn from the same domain, such as between a train and a validation dataset. This can commonly occur if the number of samples in a dataset is too small, relative to another dataset.

There are two common cases that could be observed; they are:

- Training dataset is relatively unrepresentative.
- Validation dataset is relatively unrepresentative.

An unrepresentative training dataset means that the training dataset does not provide sufficient information to learn the problem, relative to the validation dataset used to evaluate it.

This may occur if the training dataset has too few examples as compared to the validation dataset.

This situation can be identified by a learning curve for training loss that shows improvement and similarly a learning curve for validation loss that shows improvement, but a large gap remains between both curves.

An unrepresentative validation dataset means that the validation dataset does not provide sufficient information to evaluate the ability of the model to generalize.

This may occur if the validation dataset has too few examples as compared to the training dataset.

This case can be identified by a learning curve for training loss that looks like a good fit (or other fits) and a learning curve for validation loss that shows noisy movements around the training loss.

It may also be identified by a validation loss that is lower than the training loss. In this case, it indicates that the validation dataset may be easier for the model to predict than the training dataset.

This section provides more resources on the topic if you are looking to go deeper.

- How to Diagnose Overfitting and Underfitting of LSTM Models
- Overfitting and Underfitting With Machine Learning Algorithms

In this post, you discovered learning curves and how they can be used to diagnose the learning and generalization behavior of machine learning models.

Specifically, you learned:

- Learning curves are plots that show changes in learning performance over time in terms of experience.
- Learning curves of model performance on the train and validation datasets can be used to diagnose an underfit, overfit, or well-fit model.
- Learning curves of model performance can be used to diagnose whether the train or validation datasets are not relatively representative of the problem domain.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to use Learning Curves to Diagnose Machine Learning Model Performance appeared first on Machine Learning Mastery.

]]>The post Recommendations for Deep Learning Neural Network Practitioners appeared first on Machine Learning Mastery.

]]>Nevertheless, neural networks remain challenging to configure and train.

In his 2012 paper titled “*Practical Recommendations for Gradient-Based Training of Deep Architectures*” published as a preprint and a chapter of the popular 2012 book “*Neural Networks: Tricks of the Trade*,” Yoshua Bengio, one of the fathers of the field of deep learning, provides practical recommendations for configuring and tuning neural network models.

In this post, you will step through this long and interesting paper and pick out the most relevant tips and tricks for modern deep learning practitioners.

After reading this post, you will know:

- The early foundations for the deep learning renaissance including pretraining and autoencoders.
- Recommendations for the initial configuration for the range of neural network hyperparameters.
- How to effectively tune neural network hyperparameters and tactics to tune models more efficiently.

**Kick-start your project** with my new book Better Deep Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- Required Reading for Practitioners
- Paper Overview
- Beginnings of Deep Learning
- Learning via Gradient Descent
- Hyperparameter Recommendations

In 2012, a second edition of the popular practical book “Neural Networks: Tricks of the Trade” was published.

The first edition was published in 1999 and contained 17 chapters (each written by different academics and experts) on how to get the most out of neural network models. The updated second edition added 13 more chapters, including an important chapter (chapter 19) by Yoshua Bengio titled “*Practical Recommendations for Gradient-Based Training of Deep Architectures*.”

The time that this second edition was published was an important time in the renewed interest in neural networks and the start of what has become “*deep learning*.” Yoshua Bengio’s chapter is important because it provides recommendations for developing neural network models, including the details for, at the time, very modern deep learning methods.

Although the chapter can be read as part of the second edition, Bengio also published a preprint of the chapter to the arXiv website, that can be accessed here:

The chapter is also important as it provides a valuable foundation for what became the de facto textbook on deep learning four years later, titled simply “Deep Learning,” for which Bengio was a co-author.

This chapter (I’ll refer to it as a paper from now on) is required reading for all neural network practitioners.

In this post, we will step through each section of the paper and point out some of the most salient recommendations.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The goal of the paper is to provide practitioners with practical recommendations for developing neural network models.

There are many types of neural network models and many types of practitioners, so the goal is broad and the recommendations are not specific to a given type of neural network or predictive modeling problem. This is good in that we can apply the recommendations liberally on our projects, but also frustrating as specific examples from literature or case studies are not given.

The focus of these recommendations is on the configuration of model hyperparameters, specifically those related to the stochastic gradient descent learning algorithm.

This chapter is meant as a practical guide with recommendations for some of the most commonly used hyper-parameters, in particular in the context of learning algorithms based on backpropagated gradient and gradient-based optimization.

Recommendations are presented in the context of the dawn of the field of deep learning, where modern methods and fast GPU hardware facilitated the development of networks with more depth and, in turn, more capability than had been seen before. Bengio draws this renaissance back to 2006 (six years before the time of writing) and the development of greedy layer-wise pretraining methods, that later (after this paper was written) were replaced by extensive use of ReLU, Dropout, BatchNorm, and other methods that aided in developing very deep models.

The 2006 Deep Learning breakthrough centered on the use of unsupervised learning to help learning internal representations by providing a local training signal at each level of a hierarchy of features.

The paper is divided into six main sections, with section three providing the main reading focus on recommendations for configuring hyperparameters. The full table of contents for the paper is provided below.

- Abstract
- 1 Introduction
- 1.1 Deep Learning and Greedy Layer-Wise Pretraining
- 1.2 Denoising and Contractive AutoEncoders
- 1.3 Online Learning and Optimization of Generalization Error

- 2 Gradients
- 2.1 Gradient Descent and Learning Rate
- 2.2 Gradient Computation and Automatic Differentiation

- 3 Hyper-Parameters
- 3.1 Neural Network HyperParameters
- 3.1.1 Hyper-Parameters of the Approximate Optimization

- 3.2 Hyper-Parameters of the Model and Training Criterion
- 3.3 Manual Search and Grid Search
- 3.3.1 General guidance for the exploration of hyper-parameters
- 3.3.2 Coordinate Descent and MultiResolution Search
- 3.3.3 Automated and Semi-automated Grid Search
- 3.3.4 Layer-wise optimization of hyperparameters

- 3.4 Random Sampling of HyperParameters

- 3.1 Neural Network HyperParameters
- 4 Debugging and Analysis
- 4.1 Gradient Checking and Controlled Overfitting
- 4.2 Visualizations and Statistics

- 5 Other Recommendations
- 5.1 Multi-core machines, BLAS and GPUs
- 5.2 Sparse High-Dimensional Inputs
- 5.3 Symbolic Variables, Embeddings, Multi-Task Learning and MultiRelational Learning

- 6 Open Questions
- 6.1 On the Added Difficulty of Training Deeper Architectures
- 6.2 Adaptive Learning Rates and Second-Order Methods
- 6.3 Conclusion

We will not touch on each section, but instead focus on the beginning of the paper and specifically the recommendations for hyperparameters and model tuning.

The introduction section spends some time on the beginnings of deep learning, which is fascinating if viewed as a historical snapshot of the field.

At the time, the deep learning renaissance was driven by the development of neural network models with many more layers than could be used previously based on techniques such as greedy layer-wise pretraining and representation learning via autoencoders.

One of the most commonly used approaches for training deep neural networks is based on greedy layer-wise pre-training.

Not only was the approach important because it allowed the development of deeper models, but also the unsupervised form allowed the use of unlabeled examples, e.g. semi-supervised learning, which too was a breakthrough.

Another important motivation for feature learning and Deep Learning is that they can be done with unlabeled examples …

As such, reuse (literal reuse) was a major theme.

The notion of reuse, which explains the power of distributed representations is also at the heart of the theoretical advantages behind Deep Learning.

Although a single or two-layer neural network of sufficient capacity can be shown to approximate any function in theory, he offers a gentle reminder that deep networks provide a computational short-cut to approximating more complex functions. This is an important reminder and helps in motivating the development of deep models.

Theoretical results clearly identify families of functions where a deep representation can be exponentially more efficient than one that is insufficiently deep.

Time is spent stepping through two of the major “*deep learning*” breakthroughs: greedy layer-wise pretraining (both supervised and unsupervised) and autoencoders (both denoising and contrastive).

The third breakthrough, RBMs were left for discussion in another chapter of the book written by Hinton, the developer of the method.

- Restricted Boltzmann Machine (RBM).
- Greedy Layer-Wise Pretraining (Unsupervised and Supervised).
- Autoencoders (Denoising and Contrastive).

Although milestones, none of these techniques are preferred and used widely today (six years later) in the development of deep learning, and with perhaps with the exception of autoencoders, none are vigorously researched as they once were.

Section two provides a foundation on gradients and gradient learning algorithms, the main optimization technique used to fit neural network weights to training datasets.

This includes the important distinction between batch and stochastic gradient descent, and approximations via mini-batch gradient descent, today all simply referred to as stochastic gradient descent.

**Batch Gradient Descent**. Gradient is estimated using all examples in the training dataset.**Mini-Batch Gradient Descent**. Gradient is estimated using subsets of samples in the training dataset.**Stochastic (Online) Gradient Descent**. Gradient is estimated using each single pattern in the training dataset.

The mini-batch variant is offered as a way to achieve the speed of convergence offered by stochastic gradient descent with the improved estimate of the error gradient offered by batch gradient descent.

Larger batch sizes slow down convergence.

On the other hand, as B [the batch size] increases, the number of updates per computation done decreases, which slows down convergence (in terms of error vs number of multiply-add operations performed) because less updates can be done in the same computing time.

Smaller batch sizes offer a regularizing effect due to the introduction of statistical noise in the gradient estimate.

… smaller values of B [the batch size] may benefit from more exploration in parameter space and a form of regularization both due to the “noise” injected in the gradient estimator, which may explain the better test results sometimes observed with smaller B.

This time was also the introduction and wider adoption of automatic differentiation in the development of neural network models.

The gradient can be either computed manually or through automatic differentiation.

This was of particular interest to Bengio given his involvement in the development of the Theano Python mathematical library and pylearn2 deep learning library, both now defunct, succeeded perhaps by TensorFlow and Keras respectively.

Manually implementing differentiation for neural networks is easy to mess up and errors can be hard to debug and cause sub-optimal performance.

When implementing gradient descent algorithms with manual differentiation the result tends to be verbose, brittle code that lacks modularity – all bad things in terms of software engineering.

Automatic differentiation is painted as a more robust approach to developing neural networks as graphs of mathematical operations, each of which knows how to differentiate, which can be defined symbolically.

A better approach is to express the flow graph in terms of objects that modularize how to compute outputs from inputs as well as how to compute the partial derivatives necessary for gradient descent.

The flexibility of the graph-based approach to defining models and the reduced likelihood of error in calculating error derivatives means that this approach has become a standard, at least in the underlying mathematical libraries, for modern open source neural network libraries.

The main focus of the paper is on the configuration of the hyperparameters that control the convergence and generalization of the model under stochastic gradient descent.

The section starts off with the importance of using a separate validation dataset from the train and test sets for tuning model hyperparameters.

For any hyper-parameter that has an impact on the effective capacity of a learner, it makes more sense to select its value based on out-of-sample data (outside the training set), e.g., a validation set performance, online error, or cross-validation error.

And on the importance of not including the validation dataset in the evaluation of the performance of the model.

Once some out-of-sample data has been used for selecting hyper-parameter values, it cannot be used anymore to obtain an unbiased estimator of generalization performance, so one typically uses a test set (or double cross-validation, in the case of small datasets) to estimate generalization error of the pure learning algorithm (with hyper-parameter selection hidden inside).

Cross-validation is often not used with neural network models given that they can take days, weeks, or even months to train. Nevertheless, on smaller datasets where cross-validation can be used, the double cross-validation technique is suggested, where hyperparameter tuning is performed within each cross-validation fold.

Double cross-validation applies recursively the idea of cross-validation, using an outer loop cross-validation to evaluate generalization error and then applying an inner loop cross-validation inside each outer loop split’s training subset (i.e., splitting it again into training and validation folds) in order to select hyper-parameters for that split.

A suite of learning hyperparameters is then introduced, sprinkled with recommendations.

The hyperparameters in the suite are:

**Initial Learning Rate**. The proportion that weights are updated; 0.01 is a good start.**Learning Sate Schedule**. Decrease in learning rate over time; 1/T is a good start.**Mini-batch Size**. Number of samples used to estimate the gradient; 32 is a good start.**Training Iterations**. Number of updates to the weights; set large and use early stopping.**Momentum**. Use history from prior weight updates; set large (e.g. 0.9).**Layer-Specific Hyperparameters**. Possible, but rarely done.

The learning rate is presented as the most important parameter to tune. Although a value of 0.01 is a recommended starting point, dialing it in for a specific dataset and model is required.

This is often the single most important hyperparameter and one should always make sure that it has been tuned […] A default value of 0.01 typically works for standard multi-layer neural networks but it would be foolish to rely exclusively on this default value.

He goes so far to say that if only one parameter can be tuned, then it would be the learning rate.

If there is only time to optimize one hyper-parameter and one uses stochastic gradient descent, then this is the hyper-parameter that is worth tuning.

The batch size is presented as a control on the speed of learning, not about tuning test set performance (generalization error).

In theory, this hyper-parameter should impact training time and not so much test performance, so it can be optimized separately of the other hyperparameters, by comparing training curves (training and validation error vs amount of training time), after the other hyper-parameters (except learning rate) have been selected.

Model hyperparameters are then introduced, again sprinkled with recommendations.

They are:

**Number of Nodes**. Control over the capacity of the model; use larger models with regularization.**Weight Regularization**. Penalize models with large weights; try L2 generally or L1 for sparsity.**Activity Regularization**. Penalize model for large activations; try L1 for sparse representations.**Activation Function**. Used as the output of nodes in hidden layers; use sigmoidal functions (logistic and tang) or rectifier (now the standard).**Weight Initialization**. The starting point for the optimization process; influenced by activation function and size of the prior layer.**Random Seeds**. Stochastic nature of optimization process; average models from multiple runs.**Preprocessing**. Prepare data prior to modeling; at least standardize and remove correlations.

Configuring the number of nodes in a layer is challenging and perhaps one of the most asked questions by beginners. He suggests that using the same number of nodes in each hidden layer might be a good starting point.

In a large comparative study, we found that using the same size for all layers worked generally better or the same as using a decreasing size (pyramid-like) or increasing size (upside down pyramid), but of course this may be data-dependent.

He also recommends using an overcomplete configuration for the first hidden layer.

For most tasks that we worked on, we find that an overcomplete (larger than the input vector) first hidden layer works better than an undercomplete one.

Given the focus on layer-wise training and autoencoder, the sparsity of the representation (output of hidden layers) was a focus at the time. Hence the recommendation of using activity regularization that may still be useful in larger encoder-decoder models.

Sparse representations may be advantageous because they encourage representations that disentangle the underlying factors of representation.

At the time, the linear rectifier activation function was just beginning to be used and had not widely been adopted. Today, using the rectifier (ReLU) is the standard given that models using it readily out-perform models using logistic or hyperbolic tangent nonlinearities.

The default configurations do well for most neural networks on most problems.

Nevertheless, hyperparameter tuning is required to get the most out of a given model on a given dataset.

Tuning hyperparameters can be challenging both because of the computational resources required and because it can be easy to overfit the validation dataset, resulting in misleading findings.

One has to think of hyperparameter selection as a difficult form of learning: there is both an optimization problem (looking for hyper-parameter configurations that yield low validation error) and a generalization problem: there is uncertainty about the expected generalization after optimizing validation performance, and it is possible to overfit the validation error and get optimistically biased estimators of performance when comparing many hyper-parameter configurations.

Tuning one hyperparameter for a model and plotting the results often results in a U-shaped curve showing the pattern of poor performance, good performance, and back up to poor performance (e.g. minimizing loss or error). The goal is to find the bottom of the “*U*.”

The problem is, many hyperparameters interact and the bottom of the “*U*” can be noisy.

Although to first approximation we expect a kind of U-shaped curve (when considering only a single hyper-parameter, the others being fixed), this curve can also have noisy variations, in part due to the use of finite data sets.

To aid in this search, he then provides three valuable tips to consider generally when tuning model hyperparameters:

**Best value on the border**. Consider expanding the search if a good value is found on the edge of the interval searched.**Scale of values considered**. Consider searching on a log scale, at least at first (e.g. 0.1, 0.01, 0.001, etc.).**Computational considerations**. Consider giving up fidelity of the result in order to accelerate the search.

Three systematic hyperparameter search strategies are suggested:

**Coordinate Descent**. Dial-in each hyperparameter one at a time.**Multi-Resolution Search**. Iteratively zoom in the search interval.**Grid Search**. Define an n-dimensional grid of values and test each in turn.

These strategies can be used separately or even combined.

The grid search is perhaps the most commonly understood and widely used method for tuning model hyperparameters. It is exhaustive, but parallelizable, a benefit that can be exploited using cheap cloud computing infrastructure.

The advantage of the grid search, compared to many other optimization strategies (such as coordinate descent), is that it is fully parallelizable.

Often, the process is repeated via iterative grid searches, combining the multi-resolution and grid search.

Typically, a single grid search is not enough and practitioners tend to proceed with a sequence of grid searches, each time adjusting the ranges of values considered based on the previous results obtained.

He also suggests keeping a human in the loop to keep an eye out for bugs and use pattern recognition to identify trends and change the shape of the search space.

Humans can get very good at performing hyperparameter search, and having a human in the loop also has the advantage that it can help detect bugs or unwanted or unexpected behavior of a learning algorithm.

Nevertheless, it is important to automate as much as possible to ensure the process is repeatable for new problems and models in the future.

The grid search is exhaustive and slow.

A serious problem with the grid search approach to find good hyper-parameter configurations is that it scales exponentially badly with the number of hyperparameters considered.

He suggests using a random sampling strategy, which has been shown to be effective. The interval of each hyperparameter can be searched uniformly. This distribution can be biased by including priors, such as the choice of sensible defaults.

The idea of random sampling is to replace the regular grid by a random (typically uniform) sampling. Each tested hyper-parameter configuration is selected by independently sampling each hyper-parameter from a prior distribution (typically uniform in the log-domain, inside the interval of interest).

The paper ends with more general recommendations, including techniques for debugging the learning process, speeding up training with GPU hardware, and remaining open questions.

This section provides more resources on the topic if you are looking to go deeper.

- Neural Networks: Tricks of the Trade: Tricks of the Trade, First Edition, 1999.
- Neural Networks: Tricks of the Trade: Tricks of the Trade, Second Edition, 2012.
- Practical Recommendations for Gradient-Based Training of Deep Architectures, Preprint, 2012.
- Deep Learning, 2016.
- Automatic Differentiation, Wikipedia.

In this post, you discovered the salient recommendations, tips, and tricks from Yoshua Bengio’s 2012 paper titled “*Practical Recommendations for Gradient-Based Training of Deep Architectures*.”

Have you read this paper? What were your thoughts?

Let me know in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Recommendations for Deep Learning Neural Network Practitioners appeared first on Machine Learning Mastery.

]]>The post 8 Tricks for Configuring Backpropagation to Train Better Neural Networks appeared first on Machine Learning Mastery.

]]>The optimization solved by training a neural network model is very challenging and although these algorithms are widely used because they perform so well in practice, there are no guarantees that they will converge to a good model in a timely manner.

The challenge of training neural networks really comes down to the challenge of configuring the training algorithms.

In this post, you will discover tips and tricks for getting the most out of the backpropagation algorithm when training neural network models.

After reading this post, you will know:

- The challenge of training a neural network is really the balance between learning the training dataset and generalizing to new examples beyond the training dataset.
- Eight specific tricks that you can use to train better neural network models, faster.
- Second order optimization algorithms that can also be used to train neural networks under certain circumstances.

**Kick-start your project** with my new book Better Deep Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- Efficient BackProp Overview
- Learning and Generalization
- 8 Practical Tricks for Backpropagation
- Second Order Optimization Algorithms
- Discussion and Conclusion

The 1998 book titled “Neural Networks: Tricks of the Trade” provides a collection of chapters by academics and neural network practitioners that describe best practices for configuring and using neural network models.

The book was updated at the cusp of the deep learning renaissance and a second edition was released in 2012 including 13 new chapters.

The first chapter in both editions is titled “*Efficient BackProp*” written by Yann LeCun, Leon Bottou, (both at Facebook AI), Genevieve Orr, and Klaus-Robert Muller (also co-editors of the book).

The chapter is also available online for free as a pre-print.

- Efficient BackProp, Preprint, 1998.

The chapter was also summarized in a preface in both editions of the book titled “*Speed Learning*.”

It is an important chapter and document as it provides a near-exhaustive summary of how to best configure backpropagation under stochastic gradient descent as of 1998, and much of the advice is just as relevant today.

In this post, we will focus on this chapter or paper and attempt to distill the most relevant advice for modern deep learning practitioners.

For reference, the chapter is divided into 10 sections; they are:

- 1.1: Introduction
- 1.2: Learning and Generalization
- 1.3: Standard Backpropagation
- 1.4: A Few Practical Tricks
- 1.5: Convergence of Gradient Descent
- 1.6: Classical Second Order Optimization Methods
- 1.7: Tricks to Compute the Hessian Information in Multilayer Networks
- 1.8: Analysis of the Hessian in Multi-layer Networks
- 1.9: Applying Second Order Methods to Multilayer Networks
- 1.10: Discussion and Conclusion

We will focus on the tips and tricks for configuring backpropagation and stochastic gradient descent.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The chapter begins with a description of the general problem of the dual challenge of learning and generalization with neural network models.

The authors motivate the article by highlighting that the backpropagation algorithm is the most widely used algorithm to train neural network models because it works and because it is efficient.

Backpropagation is a very popular neural network learning algorithm because it is conceptually simple, computationally efficient, and because it often works. However, getting it to work well, and sometimes to work at all, can seem more of an art than a science.

The authors also remind us that training neural networks with backpropagation is really hard. Although the algorithm is both effective and efficient, it requires the careful configuration of multiple model properties and model hyperparameters, each of which requires deep knowledge of the algorithm and experience to set correctly.

And yet, there are no rules to follow to “*best*” configure a model and training process.

Designing and training a network using backprop requires making many seemingly arbitrary choices such as the number and types of nodes, layers, learning rates, training and test sets, and so forth. These choices can be critical, yet there is no foolproof recipe for deciding them because they are largely problem and data dependent.

The goal of training a neural network model is most challenging because it requires solving two hard problems at once:

**Learning**the training dataset in order to best minimize the loss.**Generalizing**the model performance in order to make predictions on unseen examples.

There is a trade-off between these concerns, as a model that learns too well will generalize poorly, and a model that generalizes well may be underfit. The goal of training a neural network well is to find a happy balance between these two concerns.

This chapter is focused on strategies for improving the process of minimizing the cost function. However, these strategies must be used in conjunction with methods for maximizing the network’s ability to generalize, that is, to predict the correct targets for patterns the learning system has not previously seen.

Interestingly, the problem of training a neural network model is cast in terms of the bias-variance trade-off, often used to describe machine learning algorithms in general.

When fitting a neural network model, these terms can be defined as:

**Bias**: A measure of how the network output averaged across all datasets differs from the desired function.**Variance**: A measure of how much the network output varies across datasets.

This framing casts defining the capacity of the model as a choice of bias, controlling the range of functions that can be learned. It casts variance as a function of the training process and the balance struck between overfitting the training dataset and generalization error.

This framing can also help in understanding the dynamics of model performance during training. That is, from a model with large bias and small variance in the beginning of training to a model with lower bias and higher variance at the end of training.

Early in training, the bias is large because the network output is far from the desired function. The variance is very small because the data has had little influence yet. Late in training, the bias is small because the network has learned the underlying function.

These are the normal dynamics of the model, although when training, we must guard against training the model too much and overfitting the training dataset. This makes the model fragile, pushing the bias down, specializing the model to training examples and, in turn, causing much larger variance.

However, if trained too long, the network will also have learned the noise specific to that dataset. This is referred to as overtraining. In such a case, the variance will be large because the noise varies between datasets.

A focus on the backpropagation algorithm means a focus on “*learning*” at the expense of temporally ignoring “*generalization*” that can be addressed later with the introduction of regularization techniques.

A focus on learning means a focus on minimizing loss both quickly (fast learning) and effectively (learning well).

The idea of this chapter, therefore, is to present minimization strategies (given a cost function) and the tricks associated with increasing the speed and quality of the minimization.

The focus of the chapter is a sequence of practical tricks for backpropagation to better train neural network models.

There are eight tricks; they are:

- 1.4.1: Stochastic Versus Batch Learning
- 1.4.2: Shuffling the Examples
- 1.4.3: Normalizing the Inputs
- 1.4.4: The Sigmoid
- 1.4.5: Choosing Target Values
- 1.4.6: Initializing the Weights
- 1.4.7: Choosing Learning Rates
- 1.4.8: Radial Basis Function vs Sigmoid

The section starts off with a comment that the optimization problem that we are trying to solve with stochastic gradient descent and backpropagation is challenging.

Backpropagation can be very slow particularly for multilayered networks where the cost surface is typically non-quadratic, non-convex, and high dimensional with many local minima and/or flat regions.

The authors go on to highlight that in choosing stochastic gradient descent and the backpropagation algorithms to optimize and update weights, we have no grantees of performance.

There is no formula to guarantee that (1) the network will converge to a good solution, (2) convergence is swift, or (3) convergence even occurs at all.

These comments provide the context for the tricks that also make no guarantees but instead increase the likelihood of finding a better model, faster.

Let’s take a closer look at each trick in turn.

Many of the tricks are focused on sigmoid (s-shaped) activation functions, which are no longer best practice for use in hidden layers, having been replaced by the rectified linear activation function. As such, we will spend less time on sigmoid-related tricks.

This tip highlights the choice between using either stochastic or batch gradient descent when training your model.

Stochastic gradient descent, also called online gradient descent, refers to a version of the algorithm where the error gradient is estimated from a single randomly selected example from the training dataset and the model parameters (weights) are then updated.

It has the effect of training the model fast, although it can result in large, noisy updates to model weights.

Stochastic learning is generally the preferred method for basic backpropagation for the following three reasons:

1. Stochastic learning is usually much faster than batch learning.

2. Stochastic learning also often results in better solutions.

3. Stochastic learning can be used for tracking changes.

Batch gradient descent involves estimating the error gradient using the average from all examples in the training dataset. It is faster to execute and is better understood from a theoretical perspective, but results in slower learning.

Despite the advantages of stochastic learning, there are still reasons why one might consider using batch learning:

1. Conditions of convergence are well understood.

2. Many acceleration techniques (e.g. conjugate gradient) only operate in batch learning.

3. Theoretical analysis of the weight dynamics and convergence rates are simpler.

Generally, the authors recommend using stochastic gradient descent where possible because it offers faster training of the model.

Despite the advantages of batch updates, stochastic learning is still often the preferred method particularly when dealing with very large data sets because it is simply much faster.

They suggest making use of a learning rate decay schedule in order to counter the noisy effect of the weight updates seen during stochastic gradient descent.

… noise, which is so critical for finding better local minima also prevents full convergence to the minimum. […] So in order to reduce the fluctuations we can either decrease (anneal) the learning rate or have an adaptive batch size.

They also suggest using mini-batches of samples to reduce the noise of the weight updates. This is where the error gradient is estimated across a small subset of samples from the training dataset instead of one sample in the case of stochastic gradient descent or all samples in the case of batch gradient descent.

This variation later became known as Mini-Batch Gradient Descent and is the default when training neural networks.

Another method to remove noise is to use “mini-batches”, that is, start with a small batch size and increase the size as training proceeds.

This tip highlights the importance that the order of examples shown to the model during training has on the training process.

Generally, the authors highlight that the learning algorithm performs better when the next example used to update the model is different from the previous example. Ideally, it is the most different or unfamiliar to the model.

Networks learn the fastest from the most unexpected sample. Therefore, it is advisable to choose a sample at each iteration that is the most unfamiliar to the system.

One simple way to implement this trick is to ensure that successive examples used to update the model parameters are from different classes.

… a very simple trick that crudely implements this idea is to simply choose successive examples that are from different classes since training examples belonging to the same class will most likely contain similar information.

This trick can also be implemented by showing and re-showing examples to the model it gets the most wrong or makes the most error on when making a prediction. This approach can be effective, but can also lead to disaster if the examples that are over-represented during training are outliers.

Choose Examples with Maximum Information Content

1. Shuffle the training set so that successive training examples never (rarely) belong to the same class.

2. Present input examples that produce a large error more frequently than examples that produce a small error

This tip highlights the importance of data preparation prior to training a neural network model.

The authors point out that neural networks often learn faster when the examples in the training dataset sum to zero. This can be achieved by subtracting the mean value from each input variable, called centering.

Convergence is usually faster if the average of each input variable over the training set is close to zero.

They also comment that this centering of inputs also improves the convergence of the model when applied to the inputs to hidden layers from prior layers. This is fascinating as it lays the foundation for the Batch Normalization technique developed and made widely popular nearly 15 years later.

Therefore, it is good to shift the inputs so that the average over the training set is close to zero. This heuristic should be applied at all layers which means that we want the average of the outputs of a node to be close to zero because these outputs are the inputs to the next layer

The authors also comment on the need to normalize the spread of the input variables. This can be achieved by dividing the values by their standard deviation. For variables that have a Gaussian distribution, centering and normalizing values in this way means that they will be reduced to a standard Gaussian with a mean of zero and a standard deviation of one.

Scaling speeds learning because it helps to balance out the rate at which the weights connected to the input nodes learn.

Finally, they suggest de-correlating the input variables. This means removing any linear dependence between the input variables and can be achieved using a Principal Component Analysis as a data transform.

Principal component analysis (also known as the Karhunen-Loeve expansion) can be used to remove linear correlations in inputs

This tip on data preparation can be summarized as follows:

Transforming the Inputs

1. The average of each input variable over the training set should be close to zero.

2. Scale input variables so that their covariances are about the same.

3. Input variables should be uncorrelated if possible.

These recommended three steps of data preparation of centering, normalizing, and de-correlating are summarized nicely in a figure, reproduced from the book below:

The centering of input variables may or may not be the best approach when using the more modern ReLU activation functions in the hidden layers of your network, so I’d recommend evaluating both standardization and normalization procedures when preparing data for your model.

This tip recommends the use of sigmoid activation functions in the hidden layers of your network.

Nonlinear activation functions are what give neural networks their nonlinear capabilities. One of the most common forms of activation function is the sigmoid …

Specifically, the authors refer to a sigmoid activation function as any S-shaped function, such as the logistic (referred to as sigmoid) or hyperbolic tangent function (referred to as tanh).

Symmetric sigmoids such as hyperbolic tangent often converge faster than the standard logistic function.

The authors recommend modifying the default functions (if needed) so that the midpoint of the function is at zero.

The use of logistic and tanh activation functions for the hidden layers is no longer a sensible default as the performance models that use ReLU converge much faster.

This tip highlights a more careful consideration of the choice of target variables.

In the case of binary classification problems, target variables may be in the set {0, 1} for the limits of the logistic activation function or in the set {-1, 1} for the hyperbolic tangent function when using the cross-entropy or hinge loss functions respectively, even in modern neural networks.

The authors suggest that using values at the extremes of the activation function may make learning the problem more challenging.

Common wisdom might seem to suggest that the target values be set at the value of the sigmoid’s asymptotes. However, this has several drawbacks.

They suggest that achieving values at the point of saturation of the activation function (edges) may require larger and larger weights, which could make the model unstable.

One approach to addressing this is to use target values away from the edge of the output function.

Choose target values at the point of the maximum second derivative on the sigmoid so as to avoid saturating the output units.

I recall that in the 1990s, it was common advice to use target values in the set of {0.1 and 0.9} with the logistic function instead of {0 and 1}.

This tip highlights the importance of the choice of weight initialization scheme and how it is tightly related to the choice of activation function.

In the context of the sigmoid activation function, they suggest that the initial weights for the network should be chosen to activate the function in the linear region (e.g. the line part not the curve part of the S-shape).

The starting values of the weights can have a significant effect on the training process. Weights should be chosen randomly but in such a way that the sigmoid is primarily activated in its linear region.

This advice may also apply to the weight activation for the ReLU where the linear part of the function is positive.

This highlights the important impact that initial weights have on learning, where large weights saturate the activation function, resulting in unstable learning, and small weights result in very small gradients and, in turn, slow learning. Ideally, we seek model weights that are over the linear (non-curvy) part of the activation function.

… weights that range over the sigmoid’s linear region have the advantage that (1) the gradients are large enough that learning can proceed and (2) the network will learn the linear part of the mapping before the more difficult nonlinear part.

The authors suggest a random weight initialization scheme that uses the number of nodes in the previous layer, the so-called fan-in. This is interesting as it is a precursor of what became known as the Xavier weight initialization scheme.

This tip highlights the importance of choosing the learning rate.

The learning rate is the amount that the model weights are updated each iteration of the algorithm. A small learning rate can cause slower convergence but perhaps a better result, whereas a larger learning rate can result in faster convergence but perhaps to a less optimal result.

The authors suggest decreasing the learning rate when the weight values begin changing back and forth, e.g. oscillating.

Most of those schemes decrease the learning rate when the weight vector “oscillates”, and increase it when the weight vector follows a relatively steady direction.

They comment that this is a hard strategy when using online gradient descent as, by default, the weights will oscillate a lot.

The authors also recommend using one learning rate for each parameter in the model. The goal is to help each part of the model to converge at the same rate.

… it is clear that picking a different learning rate (eta) for each weight can improve the convergence. […] The main philosophy is to make sure that all the weights in the network converge roughly at the same speed.

They refer to this property as “*equalizing the learning speeds*” of each model parameter.

Equalize the Learning Speeds

– give each weight its own learning rate

– learning rates should be proportional to the square root of the number of inputs to the unit

– weights in lower layers should typically be larger than in the higher layers

In addition to using a learning rate per parameter, the authors also recommend using momentum and using adaptive learning rates.

It’s interesting that these recommendations later became enshrined in methods like AdaGrad and Adam that are now popular defaults.

This final tip is perhaps less relevant today, and I recommend trying radial basis functions (RBF) instead of sigmoid activation functions in some cases.

The authors suggest that training RBF units can be faster than training units using a sigmoid activation.

Unlike sigmoidal units which can cover the entire space, a single RBF unit covers only a small local region of the input space. This can be an advantage because learning can be faster.

After these tips, the authors go on to provide a theoretical grounding for why many of these tips are a good idea and are expected to result in better or faster convergence when training a neural network model.

Specifically, the tips supported by this analysis are:

- Subtract the means from the input variables
- Normalize the variances of the input variables.
- De-correlate the input variables.
- Use a separate learning rate for each weight.

The remainer of the chapter focuses on the use of second order optimization algorithms for training neural network models.

This may not be everyone’s cup of tea and requires a background and good memory of matrix calculus. You may want to skip it.

You may recall that the first derivative is the slope of a function (how steep it is) and that backpropagation uses the first derivative to update the models in proportion to their output error. These methods are referred to as first order optimization algorithms, e.g. optimization algorithms that use the first derivative of the error in the output of the model.

You may also recall from calculus that the second order derivative is the rate of change in the first order derivative, or in this case, the gradient of the error gradient itself. It gives an idea of how curved the loss function is for the current set of weights. Algorithms that use the second derivative are referred to as second order optimization algorithms.

The authors go on to introduce five second order optimization algorithms, specifically:

- Newton
- Conjugate Gradient
- Gauss-Newton
- Levenberg Marquardt
- Quasi-Newton (BFGS)

These algorithms require access to the Hessian matrix or an approximation of the Hessian matrix. You may also recall the Hessian matrix if you covered a theoretical introduction to the backpropagation algorithm. In a hand-wavy way, we use the Hessian to describe the second order derivatives for the model weights.

The authors proceed to outline a number of methods that can be used to approximate the Hessian matrix (for use in second order optimization algorithms), such as: finite difference, square Jacobian approximation, the diagonal of the Hessian, and more.

They then go on to analyze the Hessian in multilayer neural networks and the effectiveness of second order optimization algorithms.

In summary, they highlight that perhaps second order methods are more appropriate for smaller neural network models trained using batch gradient descent.

Classical second-order methods are impractical in almost all useful cases.

The chapter ends with a very useful summary of tips for getting the most out of backpropagation when training neural network models.

This summary is reproduced below:

– shuffle the examples

– center the input variables by subtracting the mean

– normalize the input variable to a standard deviation of 1

– if possible, de-correlate the input variables.

– pick a network with the sigmoid function shown in figure 1.4

– set the target values within the range of the sigmoid, typically +1 and -1.

– initialize the weights to random values (as prescribed by 1.16).

This section provides more resources on the topic if you are looking to go deeper.

- Neural Networks: Tricks of the Trade, First Edition, 1998.
- Neural Networks: Tricks of the Trade, Second Edition, 2012.
- Efficient BackProp, Preprint, 1998.
- Hessian matrix, Wikipedia.

In this post, you discovered tips and tricks for getting the most out of the backpropagation algorithm when training neural network models.

Have you tried any of these tricks on your projects?

Let me know about your results in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 8 Tricks for Configuring Backpropagation to Train Better Neural Networks appeared first on Machine Learning Mastery.

]]>The post Neural Networks: Tricks of the Trade Review appeared first on Machine Learning Mastery.

]]>There are decades of tips and tricks spread across hundreds of research papers, source code, and in the heads of academics and practitioners.

The book “Neural Networks: Tricks of the Trade” originally published in 1998 and updated in 2012 at the cusp of the deep learning renaissance ties together the disparate tips and tricks into a single volume. It includes advice that is required reading for all deep learning neural network practitioners.

In this post, you will discover the book “*Neural Networks: Tricks of the Trade*” that provides advice by neural network academics and practitioners on how to get the most out of your models.

After reading this post, you will know:

- The motivation for why the book was written.
- A breakdown of the chapters and topics in the first and second editions.
- A list and summary of the must-read chapters for every neural network practitioner.

**Kick-start your project** with my new book Better Deep Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

Neural Networks: Tricks of the Trade is a collection of papers on techniques to get better performance from neural network models.

The first edition was published in 1998 comprised of five parts and 17 chapters. The second edition was published right on the cusp of the new deep learning renaissance in 2012 and includes three more parts and 13 new chapters.

If you are a deep learning practitioner, then it is a must read book.

I own and reference both editions.

The motivation for the book was to collate the empirical and theoretically grounded tips, tricks, and best practices used to get the best performance from neural network models in practice.

The author’s concern is that many of the useful tips and tricks are tacit knowledge in the field, trapped in peoples heads, code bases, or at the end of conference papers and that beginners to the field should be aware of them.

It is our belief that researchers and practitioners acquire, through experience and word-of-mouth, techniques and heuristics that help them successfully apply neural networks to difficult real-world problems. […] they are usually hidden in people’s heads or in the back pages of space-constrained conference papers.

The book is an effort to try to group the tricks together, after the success of a workshop at the 1996 NIPS conference with the same name.

This book is an outgrowth of a 1996 NIPS workshop called Tricks of the Trade whose goal was to begin the process of gathering and documenting these tricks. The interest that the workshop generated motivated us to expand our collection and compile it into this book.

— Page 1, Neural Networks: Tricks of the Trade, Second Edition, 2012.

The first edition of the book was put together (edited) by Genevieve Orr and Klaus-Robert Muller comprised of five parts and 17 chapters and was published 20 years ago in 1998.

Each part includes a useful preface that summarizes what to expect in the upcoming chapters, and each chapter written by one or more academics in the field.

The breakdown of this first edition was as follows:

- Chapter 1: Efficient BackProp

- Chapter 2: Early Stopping – But When?
- Chapter 3: A Simple Trick for Estimating the Weight Decay Parameter
- Chapter 4: Controlling the Hyperparameter Search on MacKay’s Bayesian Neural Network Framework
- Chapter 5: Adaptive Regularization in Neural Network Modeling
- Chapter 6: Large Ensemble Averaging

- Chapter 7: Square Unit Augmented, Radically Extended, Multilayer Perceptrons
- Chapter 8: A Dozen Tricks with Multitask Learning
- Chapter 9: Solving the Ill-Conditioning on Neural Network Learning
- Chapter 10: Centering Neural Network Gradient Factors
- Chapter 11: Avoiding Roundoff Error in Backpropagating Derivatives

- Chapter 12: Transformation Invariance in Pattern Recognition – Tangent Distance and Tangent Propagation
- Chapter 13: Combining Neural Networks and Context-Driven Search for On-Line Printed Handwriting Recognition in the Newton
- Chapter 14: Neural Network Classification and Prior Class Probabilities
- Chapter 15: Applying Divide and Conquer to Large Scale Pattern Recognition Tasks

- Chapter 16: Forecasting the Economy with Neural Nets: A Survey of Challenges and Solutions
- Chapter 17: How to Train Neural Networks

It is an expensive book, and if you can pick-up a cheap second-hand copy of this first edition, then I highly recommend it.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The second edition of the book was released in 2012, seemingly right at the beginning of the large push that became “deep learning.” As such, the book captures the new techniques at the time such as layer-wise pretraining and restricted Boltzmann machines.

It was too early to focus on the ReLU, ImageNet with CNNs, and use of large LSTMs.

Nevertheless, the second edition included three new parts and 13 new chapters.

The breakdown of the additions in the second edition are as follows:

- Chapter 18: Stochastic Gradient Descent Tricks
- Chapter 19: Practical Recommendations for Gradient-Based Training of Deep Architectures
- Chapter 20: Training Deep and Recurrent Networks with Hessian-Free Optimization
- Chapter 21: Implementing Neural Networks Efficiently

- Chapter 22: Learning Feature Representations with K-Means
- Chapter 23: Deep Big Multilayer Perceptrons for Digit Recognition
- Chapter 24: A Practical Guide to Training Restricted Boltzmann Machines
- Chapter 25: Deep Boltzmann Machines and the Centering Trick
- Chapter 26: Deep Learning via Semi-supervised Embedding

- Chapter 27: A Practical Guide to Applying Echo State Networks
- Chapter 28: Forecasting with Recurrent Neural Networks: 12 Tricks
- Chapter 29: Solving Partially Observable Reinforcement Learning Problems with Recurrent Neural Networks
- Chapter 30: 10 Steps and Some Tricks to Set up Neural Reinforcement Controllers

The whole book is a good read, although I don’t recommend reading all of it if you are looking for quick and useful tips that you can use immediately.

This is because many of the chapters focus on the writers’ pet projects, or on highly specialized methods. Instead, I recommend reading four specific chapters, two from the first edition and two from the second.

The second edition of the book is worth purchasing for these four chapters alone, and I highly recommend picking up a copy for yourself, your team, or your office.

Fortunately, there are pre-print PDFs of these chapters available for free online.

The recommended chapters are:

**Chapter 1**: Efficient BackProp, by Yann LeCun, et al.**Chapter 2**: Early Stopping – But When?, by Lutz Prechelt.**Chapter 18**: Stochastic Gradient Descent Tricks, by Leon Bottou.**Chapter 19**: Practical Recommendations for Gradient-Based Training of Deep Architectures, by Yoshua Bengio.

Let’s take a closer look at each of these chapters in turn.

This chapter focuses on providing very specific tips to get the most out of the stochastic gradient descent optimization algorithm and the backpropagation weight update algorithm.

Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work.

— Page 9, Neural Networks: Tricks of the Trade, First Edition, 1998.

The chapter proceeds to provide a dense and theoretically supported list of tips for configuring the algorithm, preparing input data, and more.

The chapter is so dense that it is hard to summarize, although a good list of recommendations is provided in the “*Discussion and Conclusion*” section at the end, quoted from the book below:

– shuffle the examples

– center the input variables by subtracting the mean

– normalize the input variable to a standard deviation of 1

– if possible, decorrelate the input variables.

– pick a network with the sigmoid function shown in figure 1.4

– set the target values within the range of the sigmoid, typically +1 and -1.

– initialize the weights to random values as prescribed by 1.16.The preferred method for training the network should be picked as follows:

– if the training set is large (more than a few hundred samples) and redundant, and if the task is classification, use stochastic gradient with careful tuning, or use the stochastic diagonal Levenberg Marquardt method.

– if the training set is not too large, or if the task is regression, use conjugate gradient.

— Pages 47-48, Neural Networks: Tricks of the Trade, First Edition, 1998.

The field of applied neural networks has come a long way in the twenty years since this was published (e.g. the comments on sigmoid activation functions are no longer relevant), yet the basics have not changed.

This chapter is required reading for all deep learning practitioners.

This chapter describes the simple yet powerful regularization method called early stopping that will halt the training of a neural network when the performance of the model begins to degrade on a hold-out validation dataset.

Validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before convergence to avoid the overfitting (“early stopping”)

— Page 55, Neural Networks: Tricks of the Trade, First Edition, 1998.

The challenge of early stopping is the choice and configuration of the trigger used to stop the training process, and the systematic configuration of early stopping is the focus of the chapter.

The general early stopping criteria are described as:

**GL**: stop as soon as the generalization loss exceeds a specified threshold.**PQ**: stop as soon as the quotient of generalization loss and progress exceeds a threshold.**UP**: stop when the generalization error increases in strips.

Three recommendations are provided, e.g. “*the trick*“:

1. Use fast stopping criteria unless small improvements of network performance (e.g. 4%) are worth large increases of training time (e.g. factor 4).

2. To maximize the probability of finding a “good” solution (as opposed to maximizing the average quality of solutions), use a GL criterion.

3. To maximize the average quality of solutions, use a PQ criterion if the net- work overfits only very little or an UP criterion otherwise.

— Page 60, Neural Networks: Tricks of the Trade, First Edition, 1998.

The rules are analyzed empirically over a large number of training runs and test problems. The crux of the finding is that being more patient with the early stopping criteria results in better hold-out performance at the cost of additional computational complexity.

I conclude slower stopping criteria allow for small improvements in generalization (here: about 4% on average), but cost much more training time (here: about factor 4 longer on average).

— Page 55, Neural Networks: Tricks of the Trade, First Edition, 1998.

This chapter focuses on a detailed review of the stochastic gradient descent optimization algorithm and tips to help get the most out of it.

This chapter provides background material, explains why SGD is a good learning algorithm when the training set is large, and provides useful recommendations.

— Page 421, Neural Networks: Tricks of the Trade, Second Edition, 2012.

There is a lot of overlap with *Chapter 1: Efficient BackProp*, and although the chapter calls out tips along the way with boxes, a useful list of tips is not summarized at the end of the chapter.

Nevertheless, it is a compulsory read for all neural network practitioners.

Below is my own summary of the tips called out in boxes throughout the chapter, mostly quoting directly from the second edition:

- Use stochastic gradient descent (batch=1) when training time is the bottleneck.
- Randomly shuffle the training examples.
- Use preconditioning techniques.
- Monitor both the training cost and the validation error.
- Check the gradients using finite differences.
- Experiment with the learning rates [with] a small sample of the training set.
- Leverage the sparsity of the training examples.
- Use a decaying learning rate.
- Try averaged stochastic gradient (i.e. a specific variant of the algorithm).

Some of these tips are pithy without context; I recommend reading the chapter.

This chapter focuses on the effective training of neural networks and early deep learning models.

It ties together the classical advice from Chapters 1 and 29 but adds comments on (at the time) recent deep learning developments like greedy layer-wise pretraining, modern hardware like GPUs, modern efficient code libraries like BLAS, and advice from real projects tuning the training of models, like the order to train hyperparameters.

This chapter is meant as a practical guide with recommendations for some of the most commonly used hyper-parameters, in particular in the context of learning algorithms based on backpropagated gradient and gradient-based optimization.

— Page 437, Neural Networks: Tricks of the Trade, Second Edition, 2012.

It’s also long, divided into six main sections:

**Deep Learning Innovations**. Including greedy layer-wise pretraining, denoising autoencoders, and online learning.**Gradients**. Including mini-batch gradient descent and automatic differentiation.**Hyperparameters**. Including learning rate, mini-batch size, epochs, momentum, nodes, weight regularization, activity regularization, hyperparameter search, and recommendations.**Debugging**and Analysis. Including monitoring loss for overfitting, visualization, and statistics.**Other Recommendations**. Including GPU hardware and use of efficient linear algebra libraries such as BLAS.**Open Questions**. Including the difficulty of training deep models and adaptive learning rates.

There’s far too much for me to summarize; the chapter is dense with useful advice for configuring and tuning neural network models.

Without a doubt, this is required reading and provided the seeds for the recommendations later described in the 2016 book Deep Learning, of which Yoshua Bengio was one of three authors.

The chapter finishes on a strong, optimistic note.

The practice summarized here, coupled with the increase in available computing power, now allows researchers to train neural networks on a scale that is far beyond what was possible at the time of the first edition of this book, helping to move us closer to artificial intelligence.

— Page 473, Neural Networks: Tricks of the Trade, Second Edition, 2012.

- Neural Networks: Tricks of the Trade, First Edition, 1998.
- Neural Networks: Tricks of the Trade, Second Edition, 2012.

- Neural Networks: Tricks of the Trade, Second Edition, 2012. Springer Homepage.
- Neural Networks: Tricks of the Trade, Second Edition, 2012. Google Books

- Efficient BackProp, 1998.
- Early Stopping – But When?, 1998.
- Stochastic Gradient Descent Tricks, 2012.
- Practical Recommendations for Gradient-Based Training of Deep Architectures, 2012.

In this post, you discovered the book “*Neural Networks: Tricks of the Trade*” that provides advice from neural network academics and practitioners on how to get the most out of your models.

Have you read some or all of this book? What do you think of it?

Let me know in the comments below.

The post Neural Networks: Tricks of the Trade Review appeared first on Machine Learning Mastery.

]]>The post How to Get Better Deep Learning Results (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>Configuring neural network models is often referred to as a “*dark art*.”

This is because there are no hard and fast rules for configuring a network for a given problem. We cannot analytically calculate the optimal model type or model configuration for a given dataset.

Fortunately, there are techniques that are known to address specific issues when configuring and training a neural network that are available in modern deep learning libraries such as Keras.

In this crash course, you will discover how you can confidently get better performance from your deep learning models in seven days.

This is a big and important post. You might want to bookmark it.

**Kick-start your project** with my new book Better Deep Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Jan/2020**: Updated API for Keras 2.3 and TensorFlow 2.0.

Before we get started, let’s make sure you are in the right place.

The list below provides some general guidelines as to who this course was designed for.

You need to know:

- Your way around basic Python and NumPy.
- The basics of Keras for deep learning.

You do NOT need to know:

- How to be a math wiz!
- How to be a deep learning expert!

This crash course will take you from a developer that knows a little deep learning to a developer who can get better performance on your deep learning project.

Note: This crash course assumes you have a working Python 2 or 3 SciPy environment with at least NumPy and Keras 2 installed. If you need help with your environment, you can follow the step-by-step tutorial here:

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below are seven lessons that will allow you to confidently improve the performance of your deep learning model:

**Lesson 01**: Better Deep Learning Framework**Lesson 02**: Batch Size**Lesson 03**: Learning Rate Schedule**Lesson 04**: Batch Normalization**Lesson 05**: Weight Regularization**Lesson 06**: Adding Noise**Lesson 07**: Early Stopping

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help (hint, I have all of the answers directly on this blog; use the search box).

I do provide more help in the form of links to related posts because I want you to build up some confidence and inertia.

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

**Note**: This is just a crash course. For a lot more detail and fleshed out tutorials, see my book on the topic titled “Better Deep Learning.”

In this lesson, you will discover a framework that you can use to systematically improve the performance of your deep learning model.

Modern deep learning libraries such as Keras allow you to define and start fitting a wide range of neural network models in minutes with just a few lines of code.

Nevertheless, it is still challenging to configure a neural network to get good performance on a new predictive modeling problem.

There are three types of problems that are straightforward to diagnose with regard to the poor performance of a deep learning neural network model; they are:

**Problems with Learning**. Problems with learning manifest in a model that cannot effectively learn a training dataset or shows slow progress or bad performance when learning the training dataset.**Problems with Generalization**. Problems with generalization manifest in a model that overfits the training dataset and makes poor performance on a holdout dataset.**Problems with Predictions**. Problems with predictions manifest as the stochastic training algorithm having a strong influence on the final model, causing a high variance in behavior and performance.

The sequential relationship between the three areas in the proposed breakdown allows the issue of deep learning model performance to be first isolated, then targeted with a specific technique or methodology.

We can summarize techniques that assist with each of these problems as follows:

**Better Learning**. Techniques that improve or accelerate the adaptation of neural network model weights in response to a training dataset.**Better Generalization**. Techniques that improve the performance of a neural network model on a holdout dataset.**Better Predictions**. Techniques that reduce the variance in the performance of a final model.

You can use this framework to first diagnose the type of problem that you have and then identify a technique to evaluate to attempt to address your problem.

For this lesson, you must list two techniques or areas of focus that belong to each of the three areas of the framework.

Having trouble? Note that we will be looking some examples from two of the three areas as part of this mini-course.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to control the speed of learning with the batch size.

In this lesson, you will discover the importance of the batch size when training neural networks.

Neural networks are trained using gradient descent where the estimate of the error used to update the weights is calculated based on a subset of the training dataset.

The number of examples from the training dataset used in the estimate of the error gradient is called the batch size and is an important hyperparameter that influences the dynamics of the learning algorithm.

The choice of batch size controls how quickly the algorithm learns, for example:

**Batch Gradient Descent**. Batch size is set to the number of examples in the training dataset, more accurate estimate of error but longer time between weight updates.**Stochastic Gradient Descent**. Batch size is set to 1, noisy estimate of error but frequent updates to weights.**Minibatch Gradient Descent**. Batch size is set to a value more than 1 and less than the number of training examples, trade-off between batch and stochastic gradient descent.

Keras allows you to configure the batch size via the *batch_size* argument to the *fit()* function, for example:

# fit model history = model.fit(trainX, trainy, epochs=1000, batch_size=len(trainX))

The example below demonstrates a Multilayer Perceptron with batch gradient descent on a binary classification problem.

# example of batch gradient descent from sklearn.datasets import make_circles from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=1000, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(50, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) # compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=1000, batch_size=len(trainX), verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss learning curves pyplot.subplot(211) pyplot.title('Cross-Entropy Loss', pad=-40) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy learning curves pyplot.subplot(212) pyplot.title('Accuracy', pad=-40) pyplot.plot(history.history['accuracy'], label='train') pyplot.plot(history.history['val_accuracy'], label='test') pyplot.legend() pyplot.show()

For this lesson, you must run the code example with each type of gradient descent (batch, minibatch, and stochastic) and describe the effect that it has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to fine tune a model during training with a learning rate schedule

In this lesson, you will discover how to configure an adaptive learning rate schedule to fine tune the model during the training run.

The amount of change to the model during each step of this search process, or the step size, is called the “*learning rate*” and provides perhaps the most important hyperparameter to tune for your neural network in order to achieve good performance on your problem.

Configuring a fixed learning rate is very challenging and requires careful experimentation. An alternative to using a fixed learning rate is to instead vary the learning rate over the training process.

Keras provides the *ReduceLROnPlateau* learning rate schedule that will adjust the learning rate when a plateau in model performance is detected, e.g. no change for a given number of training epochs. For example:

# define learning rate schedule rlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_delta=1E-7, verbose=1)

This callback is designed to reduce the learning rate after the model stops improving with the hope of fine-tuning model weights during training.

The example below demonstrates a Multilayer Perceptron with a learning rate schedule on a binary classification problem, where the learning rate will be reduced by an order of magnitude if no change is detected in validation loss over 5 training epochs.

# example of a learning rate schedule from sklearn.datasets import make_circles from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from keras.callbacks import ReduceLROnPlateau from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=1000, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(50, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) # compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy']) # define learning rate schedule rlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_delta=1E-7, verbose=1) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=300, verbose=0, callbacks=[rlrp]) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss learning curves pyplot.subplot(211) pyplot.title('Cross-Entropy Loss', pad=-40) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy learning curves pyplot.subplot(212) pyplot.title('Accuracy', pad=-40) pyplot.plot(history.history['accuracy'], label='train') pyplot.plot(history.history['val_accuracy'], label='test') pyplot.legend() pyplot.show()

For this lesson, you must run the code example with and without the learning rate schedule and describe the effect that the learning rate schedule has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how you can accelerate the training process with batch normalization

In this lesson, you will discover how to accelerate the training process of your deep learning neural network using batch normalization.

Batch normalization, or batchnorm for short, is proposed as a technique to help coordinate the update of multiple layers in the model.

The authors of the paper introducing batch normalization refer to change in the distribution of inputs during training as “*internal covariate shift*“. Batch normalization was designed to counter the internal covariate shift by scaling the output of the previous layer, specifically by standardizing the activations of each input variable per mini-batch, such as the activations of a node from the previous layer.

Keras supports Batch Normalization via a separate *BatchNormalization* layer that can be added between the hidden layers of your model. For example:

model.add(BatchNormalization())

The example below demonstrates a Multilayer Perceptron model with batch normalization on a binary classification problem.

# example of batch normalization from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from keras.optimizers import SGD from keras.layers import BatchNormalization from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=1000, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(50, input_dim=2, activation='relu')) model.add(BatchNormalization()) model.add(Dense(1, activation='sigmoid')) # compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=300, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss learning curves pyplot.subplot(211) pyplot.title('Cross-Entropy Loss', pad=-40) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy learning curves pyplot.subplot(212) pyplot.title('Accuracy', pad=-40) pyplot.plot(history.history['accuracy'], label='train') pyplot.plot(history.history['val_accuracy'], label='test') pyplot.legend() pyplot.show()

For this lesson, you must run the code example with and without batch normalization and describe the effect that batch normalization has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to reduce overfitting using weight regularization.

In this lesson, you will discover how to reduce overfitting of your deep learning neural network using weight regularization.

A model with large weights is more complex than a model with smaller weights. It is a sign of a network that may be overly specialized to training data.

The learning algorithm can be updated to encourage the network toward using small weights.

One way to do this is to change the calculation of loss used in the optimization of the network to also consider the size of the weights. This is called weight regularization or weight decay.

Keras supports weight regularization via the *kernel_regularizer* argument on a layer, which can be configured to use the L1 or L2 vector norm, for example:

model.add(Dense(500, input_dim=2, activation='relu', kernel_regularizer=l2(0.01)))

The example below demonstrates a Multilayer Perceptron model with weight decay on a binary classification problem.

# example of weight decay from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from keras.regularizers import l2 from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu', kernel_regularizer=l2(0.01))) model.add(Dense(1, activation='sigmoid')) # compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss learning curves pyplot.subplot(211) pyplot.title('Cross-Entropy Loss', pad=-40) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy learning curves pyplot.subplot(212) pyplot.title('Accuracy', pad=-40) pyplot.plot(history.history['accuracy'], label='train') pyplot.plot(history.history['val_accuracy'], label='test') pyplot.legend() pyplot.show()

For this lesson, you must run the code example with and without weight regularization and describe the effect that it has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to reduce overfitting by adding noise to your model

In this lesson, you will discover that adding noise to a neural network during training can improve the robustness of the network, resulting in better generalization and faster learning.

Training a neural network with a small dataset can cause the network to memorize all training examples, in turn leading to poor performance on a holdout dataset.

One approach to making the input space smoother and easier to learn is to add noise to inputs during training.

The addition of noise during the training of a neural network model has a regularization effect and, in turn, improves the robustness of the model.

Noise can be added to your model in Keras via the *GaussianNoise* layer. For example:

model.add(GaussianNoise(0.1))

Noise can be added to a model at the input layer or between hidden layers.

The example below demonstrates a Multilayer Perceptron model with added noise between the hidden layers on a binary classification problem.

# example of adding noise from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from keras.layers import GaussianNoise from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(GaussianNoise(0.1)) model.add(Dense(1, activation='sigmoid')) # compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss learning curves pyplot.subplot(211) pyplot.title('Cross-Entropy Loss', pad=-40) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy learning curves pyplot.subplot(212) pyplot.title('Accuracy', pad=-40) pyplot.plot(history.history['accuracy'], label='train') pyplot.plot(history.history['val_accuracy'], label='test') pyplot.legend() pyplot.show()

For this lesson, you must run the code example with and without the addition of noise and describe the effect that it has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to reduce overfitting using early stopping.

In this lesson, you will discover that stopping the training of a neural network early before it has overfit the training dataset can reduce overfitting and improve the generalization of deep neural networks.

A major challenge in training neural networks is how long to train them.

Too little training will mean that the model will underfit the train and the test sets. Too much training will mean that the model will overfit the training dataset and have poor performance on the test set.

A compromise is to train on the training dataset but to stop training at the point when performance on a validation dataset starts to degrade. This simple, effective, and widely used approach to training neural networks is called early stopping.

Keras supports early stopping via the *EarlyStopping* callback that allows you to specify the metric to monitor during training.

# patient early stopping es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)

The example below demonstrates a Multilayer Perceptron with early stopping on a binary classification problem that will stop when the validation loss has not improved for 200 training epochs.

# example of early stopping from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from keras.callbacks import EarlyStopping from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) # compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # patient early stopping es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es]) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss learning curves pyplot.subplot(211) pyplot.title('Cross-Entropy Loss', pad=-40) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy learning curves pyplot.subplot(212) pyplot.title('Accuracy', pad=-40) pyplot.plot(history.history['accuracy'], label='train') pyplot.plot(history.history['val_accuracy'], label='test') pyplot.legend() pyplot.show()

For this lesson, you must run the code example with and without early stopping and describe the effect it has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

This was your final lesson.

(

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

- A framework that you can use to systematically diagnose and improve the performance of your deep learning model.
- Batch size can be used to control the precision of the estimated error and the speed of learning during training.
- Learning rate schedule can be used to fine tune the model weights during training.
- Batch normalization can be used to dramatically accelerate the training process of neural network models.
- Weight regularization will penalize models based on the size of the weights and reduce overfitting.
- Adding noise will make the model more robust to differences in input and reduce overfitting
- Early stopping will halt the training process at the right time and reduce overfitting.

This is just the beginning of your journey with deep learning performance improvement. Keep practicing and developing your skills.

Take the next step and check out my book on getting better performance with deep learning.

How did you do with the mini-course?

Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?

Let me know. Leave a comment below.

The post How to Get Better Deep Learning Results (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to the Challenge of Training Deep Learning Neural Network Models appeared first on Machine Learning Mastery.

]]>This is achieved by updating the weights of the network in response to the errors the model makes on the training dataset. Updates are made to continually reduce this error until either a good enough model is found or the learning process gets stuck and stops.

The process of training neural networks is the most challenging part of using the technique in general and is by far the most time consuming, both in terms of effort required to configure the process and computational complexity required to execute the process.

In this post, you will discover the challenge of finding model parameters for deep learning neural networks.

After reading this post, you will know:

- Neural networks learn a mapping function from inputs to outputs that can be summarized as solving the problem of function approximation.
- Unlike other machine learning algorithms, the parameters of a neural network must be found by solving a non-convex optimization problem with many good solutions and many misleadingly good solutions.
- The stochastic gradient descent algorithm is used to solve the optimization problem where model parameters are updated each iteration using the backpropagation algorithm.

**Kick-start your project** with my new book Better Deep Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into four parts; they are:

- Neural Nets Learn a Mapping Function
- Learning Network Weights Is Hard
- Navigating the Error Surface
- Components of the Learning Algorithm

Deep learning neural networks learn a mapping function.

Developing a model requires historical data from the domain that is used as training data. This data is comprised of observations or examples from the domain with input elements that describe the conditions and an output element that captures what the observation means.

For example, a problem where the output is a quantity would be described generally as a regression predictive modeling problem. Whereas a problem where the output is a label would be described generally as a classification predictive modeling problem.

A neural network model uses the examples to learn how to map specific sets of input variables to the output variable. It must do this in such a way that this mapping works well for the training dataset, but also works well on new examples not seen by the model during training. This ability to work well on specific examples and new examples is called the ability of the model to generalize.

A multilayer perceptron is just a mathematical function mapping some set of input values to output values.

— Page 5, Deep Learning, 2016.

We can describe the relationship between the input variables and the output variables as a complex mathematical function. For a given model problem, we must believe that a true mapping function exists to best map input variables to output variables and that a neural network model can do a reasonable job at approximating the true unknown underlying mapping function.

A feedforward network defines a mapping and learns the value of the parameters that result in the best function approximation.

— Page 168, Deep Learning, 2016.

As such, we can describe the broader problem that neural networks solve as “*function approximation*.” They learn to approximate an unknown underlying mapping function given a training dataset. They do this by learning weights and the model parameters, given a specific network structure that we design.

It is best to think of feedforward networks as function approximation machines that are designed to achieve statistical generalization, occasionally drawing some insights from what we know about the brain, rather than as models of brain function.

— Page 169, Deep Learning, 2016.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Finding the parameters for neural networks in general is hard.

For many simpler machine learning algorithms, we can calculate an optimal model given the training dataset.

For example, we can use linear algebra to calculate the specific coefficients of a linear regression model and a training dataset that best minimizes the squared error.

Similarly, we can use optimization algorithms that offer convergence guarantees when finding an optimal set of model parameters for nonlinear algorithms such as logistic regression or support vector machines.

Finding parameters for many machine learning algorithms involves solving a convex optimization problem: that is an error surface that is shaped like a bowl with a single best solution.

This is not the case for deep learning neural networks.

We can neither directly compute the optimal set of weights for a model, nor can we get global convergence guarantees to find an optimal set of weights.

Stochastic gradient descent applied to non-convex loss functions has no […] convergence guarantee, and is sensitive to the values of the initial parameters.

— Page 177, Deep Learning, 2016.

In fact, training a neural network is the most challenging part of using the technique.

It is quite common to invest days to months of time on hundreds of machines in order to solve even a single instance of the neural network training problem.

— Page 274, Deep Learning, 2016.

The use of nonlinear activation functions in the neural network means that the optimization problem that we must solve in order to find model parameters is not convex.

It is not a simple bowl shape with a single best set of weights that we are guaranteed to find. Instead, there is a landscape of peaks and valleys with many good and many misleadingly good sets of parameters that we may discover.

Solving this optimization is challenging, not least because the error surface contains many local optima, flat spots, and cliffs.

An iterative process must be used to navigate the non-convex error surface of the model. A naive algorithm that navigates the error is likely to become misled, lost, and ultimately stuck, resulting in a poorly performing model.

Neural network models can be thought to learn by navigating a non-convex error surface.

A model with a specific set of weights can be evaluated on the training dataset and the average error over all training datasets can be thought of as the error of the model. A change to the model weights will result in a change to the model error. Therefore, we seek a set of weights that result in a model with a small error.

This involves repeating the steps of evaluating the model and updating the model parameters in order to step down the error surface. This process is repeated until a set of parameters is found that is good enough or the search process gets stuck.

This is a search or an optimization process and we refer to optimization algorithms that operate in this way as gradient optimization algorithms, as they naively follow along the error gradient. They are computationally expensive, slow, and their empirical behavior means that using them in practice is more art than science.

The algorithm that is most commonly used to navigate the error surface is called stochastic gradient descent, or SGD for short.

Nearly all of deep learning is powered by one very important algorithm: stochastic gradient descent or SGD.

— Page 151, Deep Learning, 2016.

Other global optimization algorithms designed for non-convex optimization problems could be used, such as a genetic algorithm, but stochastic gradient descent is more efficient as it uses the gradient information specifically to update the model weights via an algorithm called backpropagation.

[Backpropagation] describes a method to calculate the derivatives of the network training error with respect to the weights by a clever application of the derivative chain-rule.

— Page 49, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

Backpropagation refers to a technique from calculus to calculate the derivative (e.g. the slope or the gradient) of the model error for specific model parameters, allowing model weights to be updated to move down the gradient. As such, the algorithm used to train neural networks is also often referred to as simply backpropagation.

Actually, back-propagation refers only to the method for computing the gradient, while another algorithm, such as stochastic gradient descent, is used to perform learning using this gradient.

— Page 204, Deep Learning, 2016.

Stochastic gradient descent can be used to find the parameters for other machine learning algorithms, such as linear regression, and it is used when working with very large datasets, although if there are sufficient resources, then convex-based optimization algorithms are significantly more efficient.

Training a deep learning neural network model using stochastic gradient descent with backpropagation involves choosing a number of components and hyperparameters. In this section, we’ll take a look at each in turn.

An error function must be chosen, often called the objective function, cost function, or the loss function. Typically, a specific probabilistic framework for inference is chosen called Maximum Likelihood. Under this framework, the commonly chosen loss functions are cross entropy for classification problems and mean squared error for regression problems.

**Loss Function**. The function used to estimate the performance of a model with a specific set of weights on examples from the training dataset.

The search or optimization process requires a starting point from which to begin model updates. The starting point is defined by the initial model parameters or weights. Because the error surface is non-convex, the optimization algorithm is sensitive to the initial starting point. As such, small random values are chosen as the initial model weights, although different techniques can be used to select the scale and distribution of these values. These techniques are referred to as “*weight initialization*” methods.

**Weight Initialization**. The procedure by which the initial small random values are assigned to model weights at the beginning of the training process.

When updating the model, a number of examples from the training dataset must be used to calculate the model error, often referred to simply as “*loss*.” All examples in the training dataset may be used, which may be appropriate for smaller datasets. Alternately, a single example may be used which may be appropriate for problems where examples are streamed or where the data changes often. A hybrid approach may be used where the number of examples from the training dataset may be chosen and used to estimate the error gradient. The choice of the number of examples is referred to as the batch size.

**Batch Size**. The number of examples used to estimate the error gradient before updating the model parameters.

Once an error gradient has been estimated, the derivative of the error can be calculated and used to update each parameter. There may be statistical noise in the training dataset and in the estimate of the error gradient. Also, the depth of the model (number of layers) and the fact that model parameters are updated separately means that it is hard to calculate exactly how much to change each model parameter to best move down the whole model down the error gradient.

Instead, a small portion of the update to the weights is performed each iteration. A hyperparameter called the “*learning rate*” controls how much to update model weights and, in turn, controls how fast a model learns on the training dataset.

**Learning Rate**: The amount that each model parameter is updated per cycle of the learning algorithm.

The training process must be repeated many times until a good or good enough set of model parameters is discovered. The total number of iterations of the process is bounded by the number of complete passes through the training dataset after which the training process is terminated. This is referred to as the number of training “*epochs*.”

**Epochs**. The number of complete passes through the training dataset before the training process is terminated.

There are many extensions to the learning algorithm, although these five hyperparameters generally control the learning algorithm for deep learning neural networks.

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning, 2016.
- Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
- Neural Networks for Pattern Recognition, 1995.

In this post, you discovered the challenge of finding model parameters for deep learning neural networks.

Specifically, you learned:

- Neural networks learn a mapping function from inputs to outputs that can be summarized as solving the problem of function approximation.
- Unlike other machine learning algorithms, the parameters of a neural network must be found by solving a non-convex optimization problem with many good solutions and many misleadingly good solutions.
- The stochastic gradient descent algorithm is used to solve the optimization problem where model parameters are updated each iteration using the backpropagation algorithm.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to the Challenge of Training Deep Learning Neural Network Models appeared first on Machine Learning Mastery.

]]>