The choice of optimization algorithm for your deep learning model can mean the difference between good results in minutes, hours, and days.

The Adam optimization algorithm is an extension to stochastic gradient descent that has recently seen broader adoption for deep learning applications in computer vision and natural language processing.

In this post, you will get a gentle introduction to the Adam optimization algorithm for use in deep learning.

After reading this post, you will know:

- What the Adam algorithm is and some benefits of using the method to optimize your models.
- How the Adam algorithm works and how it is different from the related methods of AdaGrad and RMSProp.
- How the Adam algorithm can be configured and commonly used configuration parameters.

Let’s get started.

## What is the Adam optimization algorithm?

Adam is an optimization algorithm that can used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.

Adam was presented by Diederik Kingma from OpenAI and Jimmy Ba from the University of Toronto in their 2015 ICLR paper (poster) titled “Adam: A Method for Stochastic Optimization“. I will quote liberally from their paper in this post, unless stated otherwise.

The algorithm is called Adam. It is not an acronym and is not written as “ADAM”.

… the name Adam is derived from adaptive moment estimation.

When introducing the algorithm, the authors list the attractive benefits of using Adam on non-convex optimization problems, as follows:

- Straightforward to implement.
- Computationally efficient.
- Little memory requirements.
- Invariant to diagonal rescale of the gradients.
- Well suited for problems that are large in terms of data and/or parameters.
- Appropriate for non-stationary objectives.
- Appropriate for problems with very noisy/or sparse gradients.
- Hyper-parameters have intuitive interpretation and typically require little tuning.

## How Does Adam Work?

Adam is different to classical stochastic gradient descent.

Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training.

A learning rate is maintained for each network weight (parameter) and separately adapted as learning unfolds.

The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

The authors describe Adam as combining the advantages of two other extensions of stochastic gradient descent. Specifically:

**Adaptive Gradient Algorithm**(AdaGrad) that maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).**Root Mean Square Propagation**(RMSProp) that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).

Adam realizes the benefits of both AdaGrad and RMSProp.

Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance).

Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages.

The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.

The paper is quite readable and I would encourage you to read it if you are interested in the specific implementation details.

## Adam is Effective

Adam is a popular algorithm in the field of deep learning because it achieves good results fast.

Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods.

In the original paper, Adam was demonstrated empirically to show that convergence meets the expectations of the theoretical analysis. Adam was applied to the logistic regression algorithm on the MNIST character recognition and IMDB sentiment analysis datasets, a Multilayer Perceptron algorithm on the MNIST dataset and Convolutional Neural Networks on the CIFAR-10 image recognition dataset. They conclude:

Using large models and datasets, we demonstrate Adam can efficiently solve practical deep learning problems.

Sebastian Ruder developed a comprehensive review of modern gradient descent optimization algorithms titled “An overview of gradient descent optimization algorithms” published first as a blog post, then a technical report in 2016.

The paper is basically a tour of modern methods. In his section titled “*Which optimizer to use?*“, he recommends using Adam.

Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. […] its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.

In the Stanford course on deep learning for computer vision titled “CS231n: Convolutional Neural Networks for Visual Recognition” developed by Andrej Karpathy, et al., the Adam algorithm is again suggested as the default optimization method for deep learning applications.

In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative.

And later stated more plainly:

The two recommended updates to use are either SGD+Nesterov Momentum or Adam.

Adam is being adapted for benchmarks in deep learning papers.

For example, it was used in the paper “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” on attention in image captioning and “DRAW: A Recurrent Neural Network For Image Generation” on image generation.

Do you know of any other examples of Adam? Let me know in the comments.

## Adam Configuration Parameters

**alpha**. Also referred to as the learning rate or step size. The proportion that weights are updated (e.g. 0.001). Larger values (e.g. 0.3) results in faster initial learning before the rate is updated. Smaller values (e.g. 1.0E-5) slow learning right down during training**beta1**. The exponential decay rate for the first moment estimates (e.g. 0.9).**beta2**. The exponential decay rate for the second-moment estimates (e.g. 0.999). This value should be set close to 1.0 on problems with a sparse gradient (e.g. NLP and computer vision problems).**epsilon**. Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8).

Further, learning rate decay can also be used with Adam. The paper uses a decay rate alpha = alpha/sqrt(t) updted each epoch (t) for the logistic regression demonstration.

The Adam paper suggests:

Good default settings for the tested machine learning problems are alpha=0.001, beta1=0.9, beta2=0.999 and epsilon=10−8

The TensorFlow documentation suggests some tuning of epsilon:

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

We can see that the popular deep learning libraries generally use the default parameters recommended by the paper.

- TensorFlow: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08.

Keras: lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0. - Blocks: learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-08, decay_factor=1.
- Lasagne: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08
- Caffe: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08
- MxNet: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8
- Torch: learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8

Do you know of any other standard configurations for Adam? Let me know in the comments.

## Further Reading

This section lists resources to learn more about the Adam optimization algorithm.

- Adam: A Method for Stochastic Optimization, 2015
- Stochastic gradient descent on Wikipedia
- An overview of gradient descent optimization algorithms, 2016
- ADAM: A Method for Stochastic Optimization (a review)
- Optimization for Deep Networks (slides)
- Adam: A Method for Stochastic Optimization (slides)

Do you know of any other good resources on Adam? Let me know in the comments.

## Summary

In this post, you discovered the Adam optimization algorithm for deep learning.

Specifically, you learned:

- Adam is a replacement optimization algorithm for stochastic gradient descent for training deep learning models.
- Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems.
- Adam is relatively easy to configure where the default configuration parameters do well on most problems.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The name thing is a little strange. What was so wrong with AdaMomE? The abbreviated name is only useful if it encapsulates the name, adaptive moment estimation. I think part of the process of writing useful papers is coming up with an abbreviation that will not irritate others in the field, such as anyone named Adam.

Adam is catchy.

Using it already for a year , don’t see any reason to use anything different . My main issue with deep learning remains the fact that a lot of efficiency is lost due to the fact that neural nets have a lot of redundant symmetry built in that leads to multiple equivalent local optima . There must be a way to address this mathematically . It puzzles me that nobody had done anything about . If you did this in combinatorics (Traveling Salesman Problems type of problems ), this would qualify as a horrendous model formulation .

It would be great to see what you can dig up on the topic. Neural nets have been studied for a long time by some really bright people.

Those bright people may excel in statistics , but non linear non convex optimization is a very specialized field where other very bright people excel . The same applies to Integer and Combinatorial optimization : very specialized field .The days of “homo universalis” are long gone . By the way , although I am impressed by recent results in deep learning , I am not so deeply impressed by the technology . It’s just an unconstrained very big non linear optimization problem , so what ? And the thing is , you should not even try to find the true optimum , because that is 100% sure to overfit . So , in the end , we have to conclude that true learning aka generalization is not the same as optimizing some objective function , Basically , we still don’t know what “learning is” , but we know that iit s not “deep learning” . I have a hunch that this (deep learning) approach to “general AI” will fail .

I would argue deep learning methods only address the perception part of AI.

But what you describe is a result of using to many nodes, you fear over-fitting.

With the proper amount of nodes they dont become ‘beasts’ of redundant logic.

But i guess a lot of people are missing the point about what to train, with what data, and with the best neural network for that task.

I just red an article in which someone improved natural language to text, because he thought about those thinks, and as a result he didnt require deep nets , he was also able to train easily for any language (as in contrast to the most common 5). With a better speech to text score.

I think with the advancement in hardware people forget often about the ‘beauty’ of properly efficient coding, the same counts for neural network designs

I often say that the points of biggest leverage are in the framing of the problem.

Hi Jason. Thanks for you amazing tutorials. I have already read some, and already putting some into practice as well. As many other blogs on the net, I found yours by searching on google “how to predict data after training a model”, since I am trying to work on a personal project using LSTM. Surely enough I ran into your great informational blog.

One thing I wanted to comment on, is the fact that you mention about not being necessary to get a phd to become a master in machine learning, which I find to be a biased proposition all depending on the goal of the reader. However, most phd graduates I have found online – to mention some, yourself, Sebastian as you recommended in this post, Andrew Ng, Matt Mazur, Michael Nielsen, Adrian Rosebrock, some of the people I follow and write amazing content all have phd’s. Excluding Siraj, a current youtube blogger that makes amazing videos on machine learning – one of the few I have seen thus far that does not hold a phd, not even a bachelors. My point and question to you is.. Without a phd, would you have had the skills to make all this content found in your website? And I’m not referring to just being adept in the topic of AI, but also writing amazing in depth topics, creating amazing design for the site, book for others to read, and even clean codes in python?

As a different note, about me, for the past ten years, my profession has been in Information technology. I currently work as a systems administrator for a medium size enterprise, but for the past three years since I started college, I grew this passion toward programming, which eventually grew into machine learning. I became obsessed with Neural Networks and its back prop, and currently are now obsessed with learning more about LSTM’s. I have been testing with one of your codes. Although I still struggle with knowing how to predict data. Without being able to predict data, I feel lost. For example, most articles I find, including yours (Sorry if I haven’t found my answer yet in your site), only show how to train data, and test data. So for example, this is what I find;

x= 0001 y= 0010

x= 0010 y= 0011

x[m1,,,,,,m]

y[m1,,,,,,m]

—Usually the output I get printer—

But to this day, I haven’t learned how to feed unknown data to a network and it to predict the next unknown output such as;

if x== 0100, then, what will ‘y’ be? that is, without feeding the network the next possible, rather its suppose to tell me based on the pattern learned before.

If a training set == m, and test set also == m, then I should be able to ask for a result == n. Maybe you can guide towards the right direction?

I am currently in the first semester of a bachelor in Computer Science, and always have in the back of my head in pursuing all the way towards a phd, this is, to become an amazing writer of my own content in the field of machine learning – not Just become a “so so” data scientist, although I am still very far from getting to that level. Frankly, what really calls my attention in pursuing a higher degree, is the fact that the math learned in school, is harder to pick up as a hobby. Which is my case; this is my every day hobby.

Thanks for everything Jason, its now time to continue reading through your blog… :-p.

Making a site and educational material like this is not the same as delivering results with ML at work.

The same as the difference from a dev and a college professor teaching development. Very different skill sets.

Consider this post on finalizing a model in order to make predictions:

http://machinelearningmastery.com/train-final-machine-learning-model/

Does that help?

Hey Jason! What about Nadam vs Adam?

Nadam si a Keras optimizer which is essentially ‘Adam’+ Nesterov momentum.

Also what is Nesterov momentum?

And how can we figure out a good epsilon for a particular problem?

Thanks a lot!

Great question. I don’t know much about it sorry.

https://i.stack.imgur.com/wBIpz.png

hope this helps 🙂

here http://cs229.stanford.edu/proj2015/054_report.pdf you can find the paper

Thanks for sharing.

Thank you for your great article. If I use Adam as an optimizer, do I still need to do learning rate scheduleing during the training?

Thanks a lot

No. I don’t believe so.

As a prospective author who very likely will suggest a gentleman named Adam as a possible reviewer, I reject the author’s spelling of “Adam” and am using ADAM, which I call an optimization, “Algorithm to Decay by Average Moments” which uses the original authors’ term “decay” for what Tensorflow calls “loss.”

The variance here seems incorrect. I don’t mean incorrect as in different from the paper; I mean that it doesn’t truly seem to resemble variance; shouldn’t variance take into account the mean as well?

Here it appears the variance will continue to grow throughout the entire process of training. Wouldn’t we want the variance to shrink when we encounter hyper-surfaces with little change and growing variance on hyper-surfaces that are volatile?