Understand the Impact of Learning Rate on Neural Network Performance

Deep learning neural networks are trained using the stochastic gradient descent optimization algorithm.

The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process.

The learning rate may be the most important hyperparameter when configuring your neural network. Therefore it is vital to know how to investigate the effects of the learning rate on model performance and to build an intuition about the dynamics of the learning rate on model behavior.

In this tutorial, you will discover the effects of the learning rate, learning rate schedules, and adaptive learning rates on model performance.

After completing this tutorial, you will know:

  • How large learning rates result in unstable training and tiny rates result in a failure to train.
  • Momentum can accelerate training and learning rate schedules can help to converge the optimization process.
  • Adaptive learning rates can accelerate training and alleviate some of the pressure of choosing a learning rate and learning rate schedule.

Kick-start your project with my new book Better Deep Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Updated Feb/2019: Fixed issue where callbacks were mistakenly defined on compile() instead of fit() functions.
  • Updated Oct/2019: Updated for Keras 2.3 and TensorFlow 2.0.
  • Update Jan/2020: Updated for changes in scikit-learn v0.22 API.
Understand the Dynamics of Learning Rate on Model Performance With Deep Learning Neural Networks

Understand the Dynamics of Learning Rate on Model Performance With Deep Learning Neural Networks
Photo by Abdul Rahman some rights reserved

Tutorial Overview

This tutorial is divided into six parts; they are:

  1. Learning Rate and Gradient Descent
  2. Configure the Learning Rate in Keras
  3. Multi-Class Classification Problem
  4. Effect of Learning Rate and Momentum
  5. Effect of Learning Rate Schedules
  6. Effect of Adaptive Learning Rates

Learning Rate and Gradient Descent

Deep learning neural networks are trained using the stochastic gradient descent algorithm.

Stochastic gradient descent is an optimization algorithm that estimates the error gradient for the current state of the model using examples from the training dataset, then updates the weights of the model using the back-propagation of errors algorithm, referred to as simply backpropagation.

The amount that the weights are updated during training is referred to as the step size or the “learning rate.”

Specifically, the learning rate is a configurable hyperparameter used in the training of neural networks that has a small positive value, often in the range between 0.0 and 1.0.

The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more training epochs given the smaller changes made to the weights each update, whereas larger learning rates result in rapid changes and require fewer training epochs.

A learning rate that is too large can cause the model to converge too quickly to a suboptimal solution, whereas a learning rate that is too small can cause the process to get stuck.

The challenge of training deep learning neural networks involves carefully selecting the learning rate. It may be the most important hyperparameter for the model.

The learning rate is perhaps the most important hyperparameter. If you have time to tune only one hyperparameter, tune the learning rate.

— Page 429, Deep Learning, 2016.

Now that we are familiar with what the learning rate is, let’s look at how we can configure the learning rate for neural networks.

For more on what the learning rate is and how it works, see the post:

Configure the Learning Rate in Keras

The Keras deep learning library allows you to easily configure the learning rate for a number of different variations of the stochastic gradient descent optimization algorithm.

Stochastic Gradient Descent

Keras provides the SGD class that implements the stochastic gradient descent optimizer with a learning rate and momentum.

First, an instance of the class must be created and configured, then specified to the “optimizer” argument when calling the fit() function on the model.

The default learning rate is 0.01 and no momentum is used by default.

The learning rate can be specified via the “lr” argument and the momentum can be specified via the “momentum” argument.

The class also supports learning rate decay via the “decay” argument.

With learning rate decay, the learning rate is calculated each update (e.g. end of each mini-batch) as follows:

Where lrate is the learning rate for the current epoch, initial_lrate is the learning rate specified as an argument to SGD, decay is the decay rate which is greater than zero and iteration is the current update number.

Learning Rate Schedule

Keras supports learning rate schedules via callbacks.

The callbacks operate separately from the optimization algorithm, although they adjust the learning rate used by the optimization algorithm. It is recommended to use the SGD when using a learning rate schedule callback.

Callbacks are instantiated and configured, then specified in a list to the “callbacks” argument of the fit() function when training the model.

Keras provides the ReduceLROnPlateau that will adjust the learning rate when a plateau in model performance is detected, e.g. no change for a given number of training epochs. This callback is designed to reduce the learning rate after the model stops improving with the hope of fine-tuning model weights.

The ReduceLROnPlateau requires you to specify the metric to monitor during training via the “monitor” argument, the value that the learning rate will be multiplied by via the “factor” argument and the “patience” argument that specifies the number of training epochs to wait before triggering the change in learning rate.

For example, we can monitor the validation loss and reduce the learning rate by an order of magnitude if validation loss does not improve for 100 epochs:

Keras also provides LearningRateScheduler callback that allows you to specify a function that is called each epoch in order to adjust the learning rate.

You can define your Python function that takes two arguments (epoch and current learning rate) and returns the new learning rate.

Adaptive Learning Rate Gradient Descent

Keras also provides a suite of extensions of simple stochastic gradient descent that support adaptive learning rates.

Because each method adapts the learning rate, often one learning rate per model weight, little configuration is often required.

Three commonly used adaptive learning rate methods include:

RMSProp Optimizer

Adagrad Optimizer

Adam Optimizer

Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Multi-Class Classification Problem

We will use a small multi-class classification problem as the basis to demonstrate the effect of learning rate on model performance.

The scikit-learn class provides the make_blobs() function that can be used to create a multi-class classification problem with the prescribed number of samples, input variables, classes, and variance of samples within a class.

The problem has two input variables (to represent the x and y coordinates of the points) and a standard deviation of 2.0 for points within each group. We will use the same random state (seed for the pseudorandom number generator) to ensure that we always get the same data points.

The results are the input and output elements of a dataset that we can model.

In order to get a feeling for the complexity of the problem, we can plot each point on a two-dimensional scatter plot and color each point by class value.

The complete example is listed below.

Running the example creates a scatter plot of the entire dataset. We can see that the standard deviation of 2.0 means that the classes are not linearly separable (separable by a line), causing many ambiguous points.

This is desirable as it means that the problem is non-trivial and will allow a neural network model to find many different “good enough” candidate solutions.

Scatter Plot of Blobs Dataset With Three Classes and Points Colored by Class Value

Scatter Plot of Blobs Dataset With Three Classes and Points Colored by Class Value

Effect of Learning Rate and Momentum

In this section, we will develop a Multilayer Perceptron (MLP) model to address the blobs classification problem and investigate the effect of different learning rates and momentum.

Learning Rate Dynamics

The first step is to develop a function that will create the samples from the problem and split them into train and test datasets.

Additionally, we must also one hot encode the target variable so that we can develop a model that predicts the probability of an example belonging to each class.

The prepare_data() function below implements this behavior, returning train and test sets split into input and output elements.

Next, we can develop a function to fit and evaluate an MLP model.

First, we will define a simple MLP model that expects two input variables from the blobs problem, has a single hidden layer with 50 nodes, and an output layer with three nodes to predict the probability for each of the three classes. Nodes in the hidden layer will use the rectified linear activation function (ReLU), whereas nodes in the output layer will use the softmax activation function.

We will use the stochastic gradient descent optimizer and require that the learning rate be specified so that we can evaluate different rates. The model will be trained to minimize cross entropy.

The model will be fit for 200 training epochs, found with a little trial and error, and the test set will be used as the validation dataset so we can get an idea of the generalization error of the model during training.

Once fit, we will plot the accuracy of the model on the train and test sets over the training epochs.

The fit_model() function below ties together these elements and will fit a model and plot its performance given the train and test datasets as well as a specific learning rate to evaluate.

We can now investigate the dynamics of different learning rates on the train and test accuracy of the model.

In this example, we will evaluate learning rates on a logarithmic scale from 1E-0 (1.0) to 1E-7 and create line plots for each learning rate by calling the fit_model() function.

Tying all of this together, the complete example is listed below.

Running the example creates a single figure that contains eight line plots for the eight different evaluated learning rates. Classification accuracy on the training dataset is marked in blue, whereas accuracy on the test dataset is marked in orange.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The plots show oscillations in behavior for the too-large learning rate of 1.0 and the inability of the model to learn anything with the too-small learning rates of 1E-6 and 1E-7.

We can see that the model was able to learn the problem well with the learning rates 1E-1, 1E-2 and 1E-3, although successively slower as the learning rate was decreased. With the chosen model configuration, the results suggest a moderate learning rate of 0.1 results in good model performance on the train and test sets.

Line Plots of Train and Test Accuracy for a Suite of Learning Rates on the Blobs Classification Problem

Line Plots of Train and Test Accuracy for a Suite of Learning Rates on the Blobs Classification Problem

Momentum Dynamics

Momentum can smooth the progression of the learning algorithm that, in turn, can accelerate the training process.

We can adapt the example from the previous section to evaluate the effect of momentum with a fixed learning rate. In this case, we will choose the learning rate of 0.01 that in the previous section converged to a reasonable solution, but required more epochs than the learning rate of 0.1

The fit_model() function can be updated to take a “momentum” argument instead of a learning rate argument, that can be used in the configuration of the SGD class and reported on the resulting plot.

The updated version of this function is listed below.

It is common to use momentum values close to 1.0, such as 0.9 and 0.99.

In this example, we will demonstrate the dynamics of the model without momentum compared to the model with momentum values of 0.5 and the higher momentum values.

Tying all of this together, the complete example is listed below.

Running the example creates a single figure that contains four line plots for the different evaluated momentum values. Classification accuracy on the training dataset is marked in blue, whereas accuracy on the test dataset is marked in orange.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the addition of momentum does accelerate the training of the model. Specifically, momentum values of 0.9 and 0.99 achieve reasonable train and test accuracy within about 50 training epochs as opposed to 200 training epochs when momentum is not used.

In all cases where momentum is used, the accuracy of the model on the holdout test dataset appears to be more stable, showing less volatility over the training epochs.

Line Plots of Train and Test Accuracy for a Suite of Momentums on the Blobs Classification Problem

Line Plots of Train and Test Accuracy for a Suite of Momentums on the Blobs Classification Problem

Effect of Learning Rate Schedules

We will look at two learning rate schedules in this section.

The first is the decay built into the SGD class and the second is the ReduceLROnPlateau callback.

Learning Rate Decay

The SGD class provides the “decay” argument that specifies the learning rate decay.

It may not be clear from the equation or the code as to the effect that this decay has on the learning rate over updates. We can make this clearer with a worked example.

The function below implements the learning rate decay as implemented in the SGD class.

We can use this function to calculate the learning rate over multiple updates with different decay values.

We will compare a range of decay values [1E-1, 1E-2, 1E-3, 1E-4] with an initial learning rate of 0.01 and 200 weight updates.

The complete example is listed below.

Running the example creates a line plot showing learning rates over updates for different decay values.

We can see that in all cases, the learning rate starts at the initial value of 0.01. We can see that a small decay value of 1E-4 (red) has almost no effect, whereas a large decay value of 1E-1 (blue) has a dramatic effect, reducing the learning rate to below 0.002 within 50 epochs (about one order of magnitude less than the initial value) and arriving at the final value of about 0.0004 (about two orders of magnitude less than the initial value).

We can see that the change to the learning rate is not linear. We can also see that changes to the learning rate are dependent on the batch size, after which an update is performed. In the example from the previous section, a default batch size of 32 across 500 examples results in 16 updates per epoch and 3,200 updates across the 200 epochs.

Using a decay of 0.1 and an initial learning rate of 0.01, we can calculate the final learning rate to be a tiny value of about 3.1E-05.

Line Plot of the Effect of Decay on Learning Rate Over Multiple Weight Updates

Line Plot of the Effect of Decay on Learning Rate Over Multiple Weight Updates

We can update the example from the previous section to evaluate the dynamics of different learning rate decay values.

Fixing the learning rate at 0.01 and not using momentum, we would expect that a very small learning rate decay would be preferred, as a large learning rate decay would rapidly result in a learning rate that is too small for the model to learn effectively.

The fit_model() function can be updated to take a “decay” argument that can be used to configure decay for the SGD class.

The updated version of the function is listed below.

We can evaluate the same four decay values of [1E-1, 1E-2, 1E-3, 1E-4] and their effect on model accuracy.

The complete example is listed below.

Running the example creates a single figure that contains four line plots for the different evaluated learning rate decay values. Classification accuracy on the training dataset is marked in blue, whereas accuracy on the test dataset is marked in orange.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the large decay values of 1E-1 and 1E-2 indeed decay the learning rate too rapidly for this model on this problem and result in poor performance. The smaller decay values do result in better performance, with the value of 1E-4 perhaps causing in a similar result as not using decay at all. In fact, we can calculate the final learning rate with a decay of 1E-4 to be about 0.0075, only a little bit smaller than the initial value of 0.01.

Line Plots of Train and Test Accuracy for a Suite of Decay Rates on the Blobs Classification Problem

Line Plots of Train and Test Accuracy for a Suite of Decay Rates on the Blobs Classification Problem

Drop Learning Rate on Plateau

The ReduceLROnPlateau will drop the learning rate by a factor after no change in a monitored metric for a given number of epochs.

We can explore the effect of different “patience” values, which is the number of epochs to wait for a change before dropping the learning rate. We will use the default learning rate of 0.01 and drop the learning rate by an order of magnitude by setting the “factor” argument to 0.1.

It will be interesting to review the effect on the learning rate over the training epochs. We can do that by creating a new Keras Callback that is responsible for recording the learning rate at the end of each training epoch. We can then retrieve the recorded learning rates and create a line plot to see how the learning rate was affected by drops.

We can create a custom Callback called LearningRateMonitor. The on_train_begin() function is called at the start of training, and in it we can define an empty list of learning rates. The on_epoch_end() function is called at the end of each training epoch and in it we can retrieve the optimizer and the current learning rate from the optimizer and store it in the list. The complete LearningRateMonitor callback is listed below.

The fit_model() function developed in the previous sections can be updated to create and configure the ReduceLROnPlateau callback and our new LearningRateMonitor callback and register them with the model in the call to fit.

The function will also take “patience” as an argument so that we can evaluate different values.

We will want to create a few plots in this example, so instead of creating subplots directly, the fit_model() function will return the list of learning rates as well as loss and accuracy on the training dataset for each training epochs.

The function with these updates is listed below.

The patience in the ReduceLROnPlateau controls how often the learning rate will be dropped.

We will test a few different patience values suited for this model on the blobs problem and keep track of the learning rate, loss, and accuracy series from each run.

At the end of the run, we will create figures with line plots for each of the patience values for the learning rates, training loss, and training accuracy for each patience value.

We can create a helper function to easily create a figure with subplots for each series that we have recorded.

Tying these elements together, the complete example is listed below.

Running the example creates three figures, each containing a line plot for the different patience values.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The first figure shows line plots of the learning rate over the training epochs for each of the evaluated patience values. We can see that the smallest patience value of two rapidly drops the learning rate to a minimum value within 25 epochs, the largest patience of 15 only suffers one drop in the learning rate.

From these plots, we would expect the patience values of 5 and 10 for this model on this problem to result in better performance as they allow the larger learning rate to be used for some time before dropping the rate to refine the weights.

Line Plots of Learning Rate Over Epochs for Different Patience Values Used in the ReduceLROnPlateau Schedule

Line Plots of Learning Rate Over Epochs for Different Patience Values Used in the ReduceLROnPlateau Schedule

The next figure shows the loss on the training dataset for each of the patience values.

The plot shows that the patience values of 2 and 5 result in a rapid convergence of the model, perhaps to a sub-optimal loss value. In the case of a patience level of 10 and 15, loss drops reasonably until the learning rate is dropped below a level that large changes to the loss can be seen. This occurs halfway for the patience of 10 and nearly the end of the run for patience 15.

Line Plots of Training Loss Over Epochs for Different Patience Values Used in the ReduceLROnPlateau Schedule

Line Plots of Training Loss Over Epochs for Different Patience Values Used in the ReduceLROnPlateau Schedule

The final figure shows the training set accuracy over training epochs for each patience value.

We can see that indeed the small patience values of 2 and 5 epochs results in premature convergence of the model to a less-than-optimal model at around 65% and less than 75% accuracy respectively. The larger patience values result in better performing models, with the patience of 10 showing convergence just before 150 epochs, whereas the patience 15 continues to show the effects of a volatile accuracy given the nearly completely unchanged learning rate.

These plots show how a learning rate that is decreased a sensible way for the problem and chosen model configuration can result in both a skillful and converged stable set of final weights, a desirable property in a final model at the end of a training run.

Line Plots of Training Accuracy Over Epochs for Different Patience Values Used in the ReduceLROnPlateau Schedule

Line Plots of Training Accuracy Over Epochs for Different Patience Values Used in the ReduceLROnPlateau Schedule

Effect of Adaptive Learning Rates

Learning rates and learning rate schedules are both challenging to configure and critical to the performance of a deep learning neural network model.

Keras provides a number of different popular variations of stochastic gradient descent with adaptive learning rates, such as:

  • Adaptive Gradient Algorithm (AdaGrad).
  • Root Mean Square Propagation (RMSprop).
  • Adaptive Moment Estimation (Adam).

Each provides a different methodology for adapting learning rates for each weight in the network.

There is no single best algorithm, and the results of racing optimization algorithms on one problem are unlikely to be transferable to new problems.

We can study the dynamics of different adaptive learning rate methods on the blobs problem. The fit_model() function can be updated to take the name of an optimization algorithm to evaluate, which can be specified to the “optimizer” argument when the MLP model is compiled. The default parameters for each method will then be used. The updated version of the function is listed below.

We can explore the three popular methods of RMSprop, AdaGrad and Adam and compare their behavior to simple stochastic gradient descent with a static learning rate.

We would expect the adaptive learning rate versions of the algorithm to perform similarly or better, perhaps adapting to the problem in fewer training epochs, but importantly, to result in a more stable model.

Tying these elements together, the complete example is listed below.

Running the example creates a single figure that contains four line plots for the different evaluated optimization algorithms. Classification accuracy on the training dataset is marked in blue, whereas accuracy on the test dataset is marked in orange.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Again, we can see that SGD with a default learning rate of 0.01 and no momentum does learn the problem, but requires nearly all 200 epochs and results in volatile accuracy on the training data and much more so on the test dataset. The plots show that all three adaptive learning rate methods learning the problem faster and with dramatically less volatility in train and test set accuracy.

Both RMSProp and Adam demonstrate similar performance, effectively learning the problem within 50 training epochs and spending the remaining training time making very minor weight updates, but not converging as we saw with the learning rate schedules in the previous section.

Line Plots of Train and Test Accuracy for a Suite of Adaptive Learning Rate Methods on the Blobs Classification Problem

Line Plots of Train and Test Accuracy for a Suite of Adaptive Learning Rate Methods on the Blobs Classification Problem

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Posts

Papers

Books

API

Articles

Summary

In this tutorial, you discovered the effects of the learning rate, learning rate schedules, and adaptive learning rates on model performance.

Specifically, you learned:

  • How large learning rates result in unstable training and tiny rates result in a failure to train.
  • Momentum can accelerate training and learning rate schedules can help to converge the optimization process.
  • Adaptive learning rates can accelerate training and alleviate some of the pressure of choosing a learning rate and learning rate schedule.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Better Deep Learning Models Today!

Better Deep Learning

Train Faster, Reduce Overftting, and Ensembles

...with just a few lines of python code

Discover how in my new Ebook:
Better Deep Learning

It provides self-study tutorials on topics like:
weight decay, batch normalization, dropout, model stacking and much more...

Bring better deep learning to your projects!

Skip the Academics. Just Results.

See What's Inside

64 Responses to Understand the Impact of Learning Rate on Neural Network Performance

  1. Avatar
    tang February 10, 2019 at 1:46 pm #

    Thanks for your post, and i have a question. I use adam as the optimizer, and I use the LearningRateMonitor CallBack to record the lr on each epoch. the result is always 0.001. Is that because adam is adaptive for each parameter of the model??

    • Avatar
      Jason Brownlee February 11, 2019 at 7:55 am #

      Correct.

      • Avatar
        tang February 12, 2019 at 5:38 pm #

        Is that means we can’t record the change of learning rates when we use adam as optimizer?

        • Avatar
          Jason Brownlee February 13, 2019 at 7:54 am #

          Correct. Use SGD.

          • Avatar
            Bryan October 8, 2021 at 2:47 pm #

            Dear Dr. Jason Brownlee,

            What a pleasure to read your blog. Your content is very informative and useful. Thank you very much for your hard work.

            I have a question regarding the adaptive optimizer. As you said, it is better to use SGD to evaluate the best learning rate (lr) on our model. In this case, for example, if I found $lr = 0.01$ using SGD, and I found Adagrad is the best optimizer for my model, is that means I should use $Adagrad$ with $lr = 0.01. If that is the case, I wonder why we need to define the learning rate for adaptive optimizers since they are supposed to be adapative.

            I look forward to reading your answer.
            Thank you very much

          • Adrian Tam
            Adrian Tam October 13, 2021 at 5:17 am #

            Adagrad needs to have an initial learning rate. And it is less sensitive to the learning rate, but not totally insensitive about it. You may find changing the learning rate have slight impact to the result in this case.

  2. Avatar
    Corentin February 12, 2019 at 9:04 am #

    Great tutorial Jason, as usual.

    Could you write a blog post about hyper parameter tuning using “hpsklearn” and/or hyperopt?

    That would be awesome!

    Thanks.

  3. Avatar
    Wonbin February 15, 2019 at 10:39 pm #

    Fantastic post Jason, Thanks!

    Does it make sense or could we expect an improved performance from doing learning rate decay with adaptive learning decay methods like Adam?

    Thanks.

    • Avatar
      Jason Brownlee February 16, 2019 at 6:19 am #

      Not really as each weight has its own learning rate.

  4. Avatar
    abbas March 15, 2019 at 1:34 pm #

    jason! again the post was awesome,while running the code
    from sklearn.datasets.samples_generator from keras.layers import Dense

    i got the error
    File “”, line 2
    from sklearn.datasets.samples_generator from keras.layers import Dense
    ^
    SyntaxError: invalid syntax

    • Avatar
      Jason Brownlee March 15, 2019 at 2:31 pm #

      Thanks.

      Perhaps double check that you copied all of the code, and with the correct indenting.

  5. Avatar
    abbas March 15, 2019 at 2:08 pm #

    we cant change learning rate and momentum for Adam and Rmsprop right?its mean they are pre-defined and fix?i just want to know if they adapt themselve according to the model??

  6. Avatar
    sukhpal March 16, 2019 at 10:07 pm #

    sir how we can plot in a single plot instead of showing results in various subplot

  7. Avatar
    sukhpal April 2, 2019 at 4:57 pm #

    sir please provide the code for plot of various optimizer on single plot

  8. Avatar
    td April 8, 2019 at 10:53 am #

    When lr is decayed by 10 (e.g., when training a CIFAR-10 ResNet), the accuracy increases suddenly. Can you please tell me what exactly happens to the weights when the lr is decayed? For example, one would think that the step size is decreasing, so the weights would change more slowly. But at the same time, the gradient value likely increased rapidly (since the loss plateaus before the lr decay — which means that the training process was likely at some kind of local minima or a saddle point; hence, gradient values would be small and the loss is oscillating around some value). So, my question is, when lr decays by 10, do the CNN weights change rapidly or slowly??

    Thanks. Any thoughts would be greatly appreciated!

    • Avatar
      Jason Brownlee April 8, 2019 at 1:57 pm #

      When the lr is decayed, less updates are performed to model weights – it’s very simple.

      When you say 10, do you mean a factor of 10?

      If you subtract 10 fro, 0.001, you will get a large negative number, which is a bad idea for a learning rate.

      • Avatar
        td April 9, 2019 at 12:17 am #

        Thanks for the response. I meant a factor of 10 of course.

        • Avatar
          Jason Brownlee April 9, 2019 at 6:28 am #

          A decay on the learning rate means smaller changes to the weights, and in turn model performance.

  9. Avatar
    sukhpal April 9, 2019 at 3:47 pm #

    sir please provide the code for single plot for various subplot

  10. Avatar
    James August 23, 2019 at 1:55 am #

    Thanks for the great tutorial! Would you mind explaining how to decide which metric to monitor when you using ReduceLROnPlateau? For example, what are advantage/disadvantage to monitor val_loss vs val_acc? Thanks!

  11. Avatar
    Mark November 12, 2019 at 6:28 am #

    I appreciate your blog. Tnx, Mark

  12. Avatar
    Agamemnon December 6, 2019 at 4:52 am #

    Just a typo suggestion: I believe “weight decay” should read “learning rate decay”.

    Thanks for the article!

  13. Avatar
    fethiye February 1, 2020 at 10:29 pm #

    Hi, great blog thanks. I have a question. You initialize model in for loop with model = Sequential. Is it enough for initializing. Why don’t you use keras.backend.clear_session() for clear everything for backend?

  14. Avatar
    Hassaan Ghalib February 5, 2020 at 10:42 pm #

    Hi, Thanks for the amazing post. Learned a lot!
    Please make a minor spelling correction in the below line in Learning Rate Schedule
    section.

    “model.fig(…, callbacks=[rlrop])”

  15. Avatar
    Mark May 25, 2020 at 8:05 am #

    Do you have a tutorial on specifying a user defined cost function for a keras NN, I am particularly interested in how you present it to the system.

    • Avatar
      Jason Brownlee May 25, 2020 at 1:23 pm #

      I do not, sorry.

      This will give you ideas based on a custom metric:
      https://machinelearningmastery.com/custom-metrics-deep-learning-keras-python/

      • Avatar
        Mark Littlewood May 25, 2020 at 6:05 pm #

        Interesting link, one prthe custom loss required problem I ran into was that the custom loss required tensors as its data and I was not up to scratch on representing data as tensors but your piece suggests you use ‘backend’ to get keras to somehow convert them ?

        • Avatar
          Jason Brownlee May 26, 2020 at 6:18 am #

          Yes, you can manipulate the tensors using the backend functions.

      • Avatar
        Mark Littlewood May 25, 2020 at 6:11 pm #

        I will give this a try

        import tensorflow.keras.backend as K
        import numpy as np

        a = np.array([1,2,3])
        b = K.constant(a)
        print(b)

        #

        print(K.eval(b))

        # array([1., 2., 3.], dtype=float32)

  16. Avatar
    N Kumaran June 29, 2020 at 9:42 pm #

    Learning rate of CNN optimizer is 0.0001, corresponding batch size is 16 and the training time efficiency is 1.82 ms. For RNN the learning rate is 0.001, the batch size is 1 and time efficiency?

    Any one can say efficiency of RNN, where it is learning rate is 0.001 and batch size is one

    • Avatar
      Jason Brownlee June 30, 2020 at 6:27 am #

      RNN are not super efficient, but often more capable.

  17. Avatar
    Peter July 19, 2020 at 6:04 pm #

    Nice post sir!
    It was really explanatory .

    Noticed the function in the LearningRateScheduler code block lacks a colon.

    Thanks
    Peter.

  18. Avatar
    Pedro August 4, 2020 at 5:14 am #

    Nice post!

    When using Adam, is it legit or recommended to change the learning rate once the model reaches a plateu to see if there is a better performance?

    For example, if the model starts with a lr of 0.001 and after 200 epochs it converges to some point. Then, compile the model again with a lower learning rate, load the best weights and then run the model again to see what can be obtained.

    • Avatar
      Jason Brownlee August 4, 2020 at 6:44 am #

      No. Adam adapts the rate for you. That is the benefit of the method.

  19. Avatar
    Abhi Bhagat September 11, 2020 at 3:18 pm #

    Section : Learning Rate Decay

    The Highlighted word in :

    We can see that the large decay values of 1E-1 and 1E-2 indeed decay the learning rate too rapidly for this model on this problem and result in poor performance. The **larger** decay values do result in better performance, with the value of 1E-4 perhaps causing in a similar result as not using decay at all. In fact, we can calculate the final learning rate with a decay of 1E-4 to be about 0.0075, only a little bit smaller than the initial value of 0.01.

    Typo there : **larger** must me changed to “smaller” .

    please fix.

  20. Avatar
    Firas Obeid October 11, 2020 at 2:19 am #

    Jason,
    Do we decrease LR and increase epochs proportionally same as we treat number of trees and LR in ensemble models? As an overview…

    • Avatar
      Jason Brownlee October 11, 2020 at 6:54 am #

      It might help. Not always. Try on your model/data and see if it helps.

  21. Avatar
    Shobi February 9, 2021 at 7:20 am #

    Hi Jason,

    Thank you so much for your nice article.

    Is learning rate decay a regularization technique? If so, how?

    Thank you!

    • Avatar
      Jason Brownlee February 9, 2021 at 7:48 am #

      Not really, it slows down learning but I would not consider it a regularization technique per se.

      • Avatar
        Shobi February 10, 2021 at 2:00 am #

        Thank you so much for your kind response. What about Reduce Learning rate on Plateau?

        • Avatar
          Jason Brownlee February 10, 2021 at 8:11 am #

          What about it? Do you mean how to do it or do you mean is it a regularization method?

          • Avatar
            Shuaib February 15, 2021 at 2:07 am #

            Hi Jason,

            I meant, would you consider Reduce Learning rate on Plateau as a regularization method?

            Thanks!

          • Avatar
            Jason Brownlee February 15, 2021 at 5:48 am #

            Yes, I guess so.

  22. Avatar
    aditya June 22, 2021 at 8:27 am #

    Amazing blog, thank you!

  23. Avatar
    Hussein Kassem June 27, 2023 at 8:45 pm #

    This sentence “A learning rate that is too large can cause the model to converge too quickly to a suboptimal solution, whereas a learning rate that is too small can cause the process to get stuck.” is incorrect.
    When the learning rate is set too high, the algorithm can overshoot the minimum and oscillate around it or even diverge. This means that the algorithm may not be able to reach convergence at all, let alone converge to a suboptimal solution.

    • Avatar
      James Carmichael June 28, 2023 at 9:58 am #

      Thank you for your feedbak Hussein!

Leave a Reply