Visualizing the vanishing gradient problem

Deep learning was a recent invention. Partially, it is due to improved computation power that allows us to use more layers of perceptrons in a neural network. But at the same time, we can train a deep network only after we know how to work around the vanishing gradient problem.

In this tutorial, we visually examine why vanishing gradient problem exists.

After completing this tutorial, you will know

  • What is a vanishing gradient
  • Which configuration of neural network will be susceptible to vanishing gradient
  • How to run manual training loop in Keras
  • How to extract weights and gradients from Keras model

Let’s get started

Visualizing the vanishing gradient problem

Visualizing the vanishing gradient problem
Photo by Alisa Anton, some rights reserved.

Tutorial overview

This tutorial is divided into 5 parts; they are:

  1. Configuration of multilayer perceptron models
  2. Example of vanishing gradient problem
  3. Looking at the weights of each layer
  4. Looking at the gradients of each layer
  5. The Glorot initialization

Configuration of multilayer perceptron models

Because neural networks are trained by gradient descent, people believed that a differentiable function is required to be the activation function in neural networks. This caused us to conventionally use sigmoid function or hyperbolic tangent as activation.

For a binary classification problem, if we want to do logistic regression such that 0 and 1 are the ideal output, sigmoid function is preferred as it is in this range:
$$
\sigma(x) = \frac{1}{1+e^{-x}}
$$
and if we need sigmoidal activation at the output, it is natural to use it in all layers of the neural network. Additionally, each layer in a neural network has a weight parameter. Initially, the weights have to be randomized and naturally we would use some simple way to do it, such as using uniform random or normal distribution.

Example of vanishing gradient problem

To illustrate the problem of vanishing gradient, let’s try with an example. Neural network is a nonlinear function. Hence it should be most suitable for classification of nonlinear dataset. We make use of scikit-learn’s make_circle() function to generate some data:

This is not difficult to classify. A naive way is to build a 3-layer neural network, which can give a quite good result:

Note that we used rectified linear unit (ReLU) in the hidden layer above. By default, the dense layer in Keras will be using linear activation (i.e. no activation) which mostly is not useful. We usually use ReLU in modern neural networks. But we can also try the old school way as everyone does two decades ago:

The accuracy is much worse. It turns out, it is even worse by adding more layers (at least in my experiment):

Your result may vary given the stochastic nature of the training algorithm. You may see the 5-layer sigmoidal network performing much worse than 3-layer or not. But the idea here is you can’t get back the high accuracy as we can achieve with rectified linear unit activation by merely adding layers.

Looking at the weights of each layer

Shouldn’t we get a more powerful neural network with more layers?

Yes, it should be. But it turns out as we adding more layers, we triggered the vanishing gradient problem. To illustrate what happened, let’s see how are the weights look like as we trained our network.

In Keras, we are allowed to plug-in a callback function to the training process. We are going create our own callback object to intercept and record the weights of each layer of our multilayer perceptron (MLP) model at the end of each epoch.

We derive the Callback class and define the on_epoch_end() function. This class will need the created model to initialize. At the end of each epoch, it will read each layer and save the weights into numpy array.

For the convenience of experimenting different ways of creating a MLP, we make a helper function to set up the neural network model:

We deliberately create a neural network with 4 hidden layers so we can see how each layer respond to the training. We will vary the activation function of each hidden layer as well as the weight initialization. To make things easier to tell, we are going to name each layer instead of letting Keras to assign a name. The input is a coordinate on the xy-plane hence the input shape is a vector of 2. The output is binary classification. Therefore we use sigmoid activation to make the output fall in the range of 0 to 1.

Then we can compile() the model to provide the evaluation metrics and pass on the callback in the fit() call to train the model:

Here we create the neural network by calling make_mlp() first. Then we set up our callback object. Since the weights of each layer in the neural network are initialized at creation, we deliberately call the callback function to remember what they are initialized to. Then we call the compile() and fit() from the model as usual, with the callback object provided.

After we fit the model, we can evaluate it with the entire dataset:

Here it means the log-loss is 0.665 and the accuracy is 0.588 for this model of having all layers using sigmoid activation.

What we can further look into is how the weight behaves along the iterations of training. All the layers except the first and the last are having their weight as a 5×5 matrix. We can check the mean and standard deviation of the weights to get a sense of how the weights look like:

This results in the following figure:

We see the mean weight moved quickly only in first 10 iterations or so. Only the weights of the first layer getting more diversified as its standard deviation is moving up.

We can restart with the hyperbolic tangent (tanh) activation on the same process:

The log-loss and accuracy are both improved. If we look at the plot, we don’t see the abrupt change in the mean and standard deviation in the weights but instead, that of all layers are slowly converged.

Similar case can be seen in ReLU activation:

Looking at the gradients of each layer

We see the effect of different activation function in the above. But indeed, what matters is the gradient as we are running gradient decent during training. The paper by Xavier Glorot and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks”, suggested to look at the gradient of each layer in each training iteration as well as the standard deviation of it.

Bradley (2009) found that back-propagated gradients were smaller as one moves from the output layer towards the input layer, just after initialization. He studied networks with linear activation at each layer, finding that the variance of the back-propagated gradients decreases as we go backwards in the network

— “Understanding the difficulty of training deep feedforward neural networks” (2010)

To understand how the activation function related to the gradient as perceived during training, we need to run the training loop manually.

In Tensorflow-Keras, a training loop can be run by turning on the gradient tape, and then make the neural network model produce an output, which afterwards we can obtain the gradient by automatic differentiation from the gradient tape. Subsequently we can update the parameters (weights and biases) according to the gradient descent update rule.

Because the gradient is readily obtained in this loop, we can make a copy of it. The following is how we implement the training loop and at the same time, keep a copy of the gradients:

The key in the function above is the nested for-loop. In which, we launch tf.GradientTape() and pass in a batch of data to the model to get a prediction, which is then evaluated using the loss function. Afterwards, we can pull out the gradient from the tape by comparing the loss with the trainable weight from the model. Next, we update the weights using the optimizer, which will handle the learning weights and momentums in the gradient descent algorithm implicitly.

As a refresh, the gradient here means the following. For a loss value $L$ computed and a layer with weights $W=[w_1, w_2, w_3, w_4, w_5]$ (e.g., on the output layer) then the gradient is the matrix

$$
\frac{\partial L}{\partial W} = \Big[\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \frac{\partial L}{\partial w_3}, \frac{\partial L}{\partial w_4}, \frac{\partial L}{\partial w_5}\Big]
$$

But before we start the next iteration of training, we have a chance to further manipulate the gradient: We match the gradient with the weights, to get the name of each, then save a copy of the gradient as numpy array. We sample the weight and loss only once per epoch, but you can change that to sample in a higher frequency.

With these, we can plot the gradient across epochs. In the following, we create the model (but not calling compile() because we would not call fit() afterwards) and run the manual training loop, then plot the gradient as well as the standard deviation of the gradient:

It reported a weak classification result:

and the plot we obtained shows vanishing gradient:

From the plot, the loss is not significantly decreased. The mean of gradient (i.e., mean of all elements in the gradient matrix) has noticeable value only for the last layer while all other layers are virtually zero. The standard deviation of the gradient is at the level of between 0.01 and 0.001 approximately.

Repeat this with tanh activation, we see a different result, which explains why the performance is better:

From the plot of the mean of the gradients, we see the gradients from every layer are wiggling equally. The standard deviation of the gradient are also an order of magnitude larger than the case of sigmoid activation, at around 0.1 to 0.01.

Finally, we can also see the similar in rectified linear unit (ReLU) activation. And in this case the loss dropped quickly, hence we see it as the more efficient activation to use in neural networks:

The following is the complete code:

The Glorot initialization

We didn’t demonstrate in the code above, but the most famous outcome from the paper by Glorot and Bengio is the Glorot initialization. Which suggests to initialize the weights of a layer of the neural network with uniform distribution:

The normalization factor may therefore be important when initializing deep networks because of the multiplicative effect through layers, and we suggest the following initialization procedure to approximately satisfy our objectives of maintaining activation variances and back-propagated gradients variance as one moves up or down the network. We call it the normalized initialization:
$$
W \sim U\Big[-\frac{\sqrt{6}}{\sqrt{n_j+n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j+n_{j+1}}}\Big]
$$

— “Understanding the difficulty of training deep feedforward neural networks” (2010)

This is derived from the linear activation on the condition that the standard deviation of the gradient is keeping consistent across the layers. In the sigmoid and tanh activation, the linear region is narrow. Therefore we can understand why ReLU is the key to workaround the vanishing gradient problem. Comparing to replacing the activation function, changing the weight initialization is less pronounced in helping to resolve the vanishing gradient problem. But this can be an exercise for you to explore to see how this can help improving the result.

Further readings

The Glorot and Bengio paper is available at:

The vanishing gradient problem is well known enough in machine learning that many books covered it. For example,

Previously we have posts about vanishing and exploding gradients:

You may also find the following documentation helpful to explain some syntax we used above:

Summary

In this tutorial, you visually saw how a rectified linear unit (ReLU) can help resolving the vanishing gradient problem.

Specifically, you learned:

  • How the problem of vanishing gradient impact the performance of a neural network
  • Why ReLU activation is the solution to vanishing gradient problem
  • How to use a custom callback to extract data in the middle of training loop in Keras
  • How to write a custom training loop
  • How to read the weight and gradient from a layer in the neural network

Develop Better Deep Learning Models Today!

Better Deep Learning

Train Faster, Reduce Overftting, and Ensembles

...with just a few lines of python code

Discover how in my new Ebook:
Better Deep Learning

It provides self-study tutorials on topics like:
weight decay, batch normalization, dropout, model stacking and much more...

Bring better deep learning to your projects!

Skip the Academics. Just Results.

See What's Inside

15 Responses to Visualizing the vanishing gradient problem

  1. Avatar
    Jeremy November 19, 2021 at 11:32 am #

    Dr. Tam,

    This was another outstanding article! I am finishing my MS in Applied Mathematics in a few weeks and while we have attacked vanishing gradients (in a couple of courses) in great mathematical detail, we never did a graphical analysis of the problem. Your work here really drove home the point and I appreciate it! Hope to see a book from you one day that gathers all of your hard-earned knowledge in one place.

    • Avatar
      Adrian Tam November 20, 2021 at 1:32 am #

      Thanks for the appreciation. Glad you liked it.

  2. Avatar
    The_PP November 21, 2021 at 5:42 am #

    This was a very great article! I just wanted to ask how I should get back into coding AI, it’s been about a year (due to many reasons) and I now barely have time as I am studying Mechanical Engineering (although we haven’t gotten to ML yet – In Matlab).

    Do you have any fast-track courses? I don’t want to do fast.ai again, it took me way too much time to complete.

    • Avatar
      Adrian Tam November 21, 2021 at 7:57 am #

      I think the closest match are the mini-course series on this blog.

    • Avatar
      Guillermo Álvarez May 13, 2022 at 6:33 am #

      I mean, for example the gradients that you show in this tutorial in the example with Relu activation, except for Relu5, the other layers seems to be similar averages in the first epochs (maybe not the first) and in the last epochs.

  3. Avatar
    Mohit Prashanth January 15, 2022 at 8:02 am #

    Hi Adrian,
    Thank you so much for writing this article. This and other articles on this website have been extremely helpful to me. I have two questions regarding this article.
    1) Why haven’t you used mean of the absolute value of gradients to determine vanishing gradients? For example, if you have two neurons in a layer, and one changes by -0.25 and the other by 0.25, taking the mean of this gives 0, although the gradients are not vanishing.
    2) As the training progresses, do we expect the mean of absolute value of gradients to reduce to zero or close to zero (indicating that the weights are not going to change anymore)?

  4. Avatar
    Guillermo Álvarez May 11, 2022 at 7:14 am #

    Great article! Thanks a lot.
    There is one thing that is hard to understand for me. Why the gradients have similar averages on each epoch? As I understand the gradients will be larges at the beginning of the training and as the parameters improve the gradients will be each time closer to zero. That is something that I always have seen but never understood.
    Thanks again for this website

    • Avatar
      James Carmichael May 13, 2022 at 1:04 am #

      Hi Guillermo…How does your final gradient value compare to the first one?

      • Avatar
        Guillermo Álvarez May 14, 2022 at 3:04 am #

        I replied in another thread, sorry. I mean, for example the gradients that you show in this tutorial in the example with Relu activation, except for Relu5, the other layers seems to be similar averages in the first epochs (maybe not the first) and in the last epochs.

  5. Avatar
    Anuj Lahoty March 12, 2023 at 12:05 am #

    Amazing explanation and very helpful.

    • Avatar
      James Carmichael March 12, 2023 at 10:09 am #

      Thank you Anuj! We greatly appreciate your feedback and support!

  6. Avatar
    Jason March 21, 2023 at 4:02 pm #

    I tried running the sample code and had following errors. Love this site many great explanations. I will try to fix the errors in time but wanted to share ran on Google Colab.

    KeyError Traceback (most recent call last)
    in
    189 model = make_mlp(“tanh”, initializer, “tanh”)
    190 print(“Before training: Accuracy”, accuracy_score(y, (model(X) > 0.5)))
    –> 191 gradhistory, losshistory = train_model(X, y, model)
    192 print(“After training: Accuracy”, accuracy_score(y, (model(X) > 0.5)))
    193 plot_gradient(gradhistory, losshistory)

    11 frames
    in train_model(X, y, model, n_epochs, batch_size)
    158
    159 grads = tape.gradient(loss_value, model.trainable_weights)
    –> 160 optimizer.apply_gradients(zip(grads, model.trainable_weights))
    161
    162 if step == 0:

    /usr/local/lib/python3.9/dist-packages/keras/optimizers/optimizer_experimental/optimizer.py in apply_gradients(self, grads_and_vars, name, skip_gradients_aggregation, **kwargs)
    1138 if not skip_gradients_aggregation and experimental_aggregate_gradients:
    1139 grads_and_vars = self.aggregate_gradients(grads_and_vars)
    -> 1140 return super().apply_gradients(grads_and_vars, name=name)
    1141
    1142 def _apply_weight_decay(self, variables):

    /usr/local/lib/python3.9/dist-packages/keras/optimizers/optimizer_experimental/optimizer.py in apply_gradients(self, grads_and_vars, name)
    632 self._apply_weight_decay(trainable_variables)
    633 grads_and_vars = list(zip(grads, trainable_variables))
    –> 634 iteration = self._internal_apply_gradients(grads_and_vars)
    635
    636 # Apply variable constraints after applying gradients.

    /usr/local/lib/python3.9/dist-packages/keras/optimizers/optimizer_experimental/optimizer.py in _internal_apply_gradients(self, grads_and_vars)
    1164
    1165 def _internal_apply_gradients(self, grads_and_vars):
    -> 1166 return tf.__internal__.distribute.interim.maybe_merge_call(
    1167 self._distributed_apply_gradients_fn,
    1168 self._distribution_strategy,

    /usr/local/lib/python3.9/dist-packages/tensorflow/python/distribute/merge_call_interim.py in maybe_merge_call(fn, strategy, *args, **kwargs)
    49 “””
    50 if strategy_supports_no_merge_call():
    —> 51 return fn(strategy, *args, **kwargs)
    52 else:
    53 return distribution_strategy_context.get_replica_context().merge_call(

    /usr/local/lib/python3.9/dist-packages/keras/optimizers/optimizer_experimental/optimizer.py in _distributed_apply_gradients_fn(self, distribution, grads_and_vars, **kwargs)
    1214
    1215 for grad, var in grads_and_vars:
    -> 1216 distribution.extended.update(
    1217 var, apply_grad_to_update_var, args=(grad,), group=False
    1218 )

    /usr/local/lib/python3.9/dist-packages/tensorflow/python/distribute/distribute_lib.py in update(self, var, fn, args, kwargs, group)
    2635 fn, autograph_ctx.control_status_ctx(), convert_by_default=False)
    2636 with self._container_strategy().scope():
    -> 2637 return self._update(var, fn, args, kwargs, group)
    2638 else:
    2639 return self._replica_ctx_update(

    /usr/local/lib/python3.9/dist-packages/tensorflow/python/distribute/distribute_lib.py in _update(self, var, fn, args, kwargs, group)
    3708 # The implementations of _update() and _update_non_slot() are identical
    3709 # except _update() passes var as the first argument to fn().
    -> 3710 return self._update_non_slot(var, fn, (var,) + tuple(args), kwargs, group)
    3711
    3712 def _update_non_slot(self, colocate_with, fn, args, kwargs, should_group):

    /usr/local/lib/python3.9/dist-packages/tensorflow/python/distribute/distribute_lib.py in _update_non_slot(self, colocate_with, fn, args, kwargs, should_group)
    3714 # once that value is used for something.
    3715 with UpdateContext(colocate_with):
    -> 3716 result = fn(*args, **kwargs)
    3717 if should_group:
    3718 return result

    /usr/local/lib/python3.9/dist-packages/tensorflow/python/autograph/impl/api.py in wrapper(*args, **kwargs)
    593 def wrapper(*args, **kwargs):
    594 with ag_ctx.ControlStatusCtx(status=ag_ctx.Status.UNSPECIFIED):
    –> 595 return func(*args, **kwargs)
    596
    597 if inspect.isfunction(func) or inspect.ismethod(func):

    /usr/local/lib/python3.9/dist-packages/keras/optimizers/optimizer_experimental/optimizer.py in apply_grad_to_update_var(var, grad)
    1211 return self._update_step_xla(grad, var, id(self._var_key(var)))
    1212 else:
    -> 1213 return self._update_step(grad, var)
    1214
    1215 for grad, var in grads_and_vars:

    /usr/local/lib/python3.9/dist-packages/keras/optimizers/optimizer_experimental/optimizer.py in _update_step(self, gradient, variable)
    214 return
    215 if self._var_key(variable) not in self._index_dict:
    –> 216 raise KeyError(
    217 f”The optimizer cannot recognize variable {variable.name}. ”
    218 “This usually means you are trying to call the optimizer to ”

    KeyError: ‘The optimizer cannot recognize variable tanh1/kernel:0. This usually means you are trying to call the optimizer to update different parts of the model separately. Please call optimizer.build(variables) with the full list of trainable variables before the training loop or use legacy optimizer `tf.keras.optimizers.legacy.{self.__class__.__name__}.’

    • Avatar
      James Carmichael March 22, 2023 at 10:01 am #

      Hi Jason…Please clarify whether you typed the code in or copied and pasted it? Also, you may want to try it in Google Colab.

  7. Avatar
    Michael May 8, 2023 at 9:40 am #

    Great analysis on the vanishing gradient problem. Appreciated. 🙂

    By the way, I had the same coding error as well, and I pasted the code in.

Leave a Reply