Crash Course in Recurrent Neural Networks for Deep Learning

Another type of neural network is dominating difficult machine learning problems involving sequences of inputs: recurrent neural networks.

Recurrent neural networks have connections that have loops, adding feedback and memory to the networks over time. This memory allows this type of network to learn and generalize across sequences of inputs rather than individual patterns.

A powerful type of Recurrent Neural Network called the Long Short-Term Memory Network has been shown to be particularly effective when stacked into a deep configuration, achieving state-of-the-art results on a diverse array of problems from language translation to automatic captioning of images and videos.

In this post, you will get a crash course in recurrent neural networks for deep learning, acquiring just enough understanding to start using LSTM networks in Python with Keras.

After reading this post, you will know:

  • The limitations of Multilayer Perceptrons that are addressed by recurrent neural networks
  • The problems that must be addressed to make Recurrent Neural networks useful
  • The details of the Long Short-Term Memory networks used in applied deep learning

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Crash Course in Recurrent Neural Networks for Deep Learning

Crash course in recurrent neural networks for deep learning
Photo by Martin Fisch, some rights reserved.

Support for Sequences in Neural Networks

Some problem types are best framed by involving either a sequence as an input or an output.

For example, consider a univariate time series problem, like the price of a stock over time. This dataset can be framed as a prediction problem for a classical feed-forward multilayer perceptron network by defining a window’s size (e.g., 5) and training the network to learn to make short-term predictions from the fixed-sized window of inputs.

This would work but is very limited. The window of inputs adds memory to the problem but is limited to just a fixed number of points and must be chosen with sufficient knowledge of the problem. A naive window would not capture the broader trends over minutes, hours, and days that might be relevant to making a prediction. From one prediction to the next, the network only knows about the specific inputs it is provided.

Univariate time series prediction is important, but there are even more interesting problems that involve sequences.

Consider the following taxonomy of sequence problems that require mapping an input to output (taken from Andrej Karpathy).

  • One-to-Many: sequence output for image captioning
  • Many-to-One: sequence input for sentiment classification
  • Many-to-Many: sequence in and out for machine translation
  • Synched Many-to-Many: synced sequences in and out for video classification

You can also see that a one-to-one example of an input to output would be an example of a classical feed-forward neural network for a prediction task like image classification.

Support for sequences in neural networks is an important class of problem and one where deep learning has recently shown impressive results. State-of-the-art results have been using a type of network specifically designed for sequence problems called recurrent neural networks.

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Recurrent Neural Networks

Recurrent neural networks or RNNs are a special type of neural network designed for sequence problems.

Given a standard feed-forward multilayer Perceptron network, a recurrent neural network can be thought of as the addition of loops to the architecture. For example, in a given layer, each neuron may pass its signal latterly (sideways) in addition to forward to the next layer. The output of the network may feed back as an input to the network with the next input vector. And so on.

The recurrent connections add state or memory to the network and allow it to learn broader abstractions from the input sequences.

The field of recurrent neural networks is well established with popular methods. For the techniques to be effective on real problems, two major issues needed to be resolved for the network to be useful.

  1. How to train the network with backpropagation
  2. How to stop gradients vanishing or exploding during training

1. How to Train Recurrent Neural Networks

The staple technique for training feed-forward neural networks is to backpropagate error and update the network weights.

Backpropagation breaks down in a recurrent neural network because of the recurrent or loop connections.

This was addressed with a modification of the backpropagation technique called Backpropagation Through Time or BPTT.

Instead of performing backpropagation on the recurrent network as stated, the structure of the network is unrolled, where copies of the neurons that have recurrent connections are created. For example, a single neuron with a connection to itself (A->A) could be represented as two neurons with the same weight values (A->B).

This allows the cyclic graph of a recurrent neural network to be turned into an acyclic graph like a classic feed-forward neural network, and backpropagation can be applied.

2. How to Have Stable Gradients During Training

When backpropagation is used in very deep neural networks and unrolled recurrent neural networks, the gradients that are calculated to update the weights can become unstable.

They can become very large numbers called exploding gradients, or very small numbers called the vanishing gradient problem. These large numbers, in turn, are used to update the weights in the network, making training unstable and the network unreliable.

This problem is alleviated in deep multilayer perceptron networks through the use of the rectifier transfer function and even more exotic but now less popular approaches of using unsupervised pre-training of layers.

In recurrent neural network architectures, this problem has been alleviated using a new type of architecture called the Long Short-Term Memory Networks that allows deep recurrent networks to be trained.

Long Short-Term Memory Networks

The Long Short-Term Memory or LSTM network is a recurrent neural network trained using Backpropagation Through Time and overcomes the vanishing gradient problem.

As such, it can be used to create large (stacked) recurrent networks that, in turn, can be used to address difficult sequence problems in machine learning and achieve state-of-the-art results.

Instead of neurons, LSTM networks have memory blocks connected into layers.

A block has components that make it smarter than a classical neuron and a memory for recent sequences. A block contains gates that manage the block’s state and output. A unit operates upon an input sequence, and each gate within a unit uses the sigmoid activation function to control whether it is triggered or not, making the change of state and addition of information flowing through the unit conditional.

There are three types of gates within a memory unit:

  • Forget Gate: conditionally decides what information to discard from the unit.
  • Input Gate: conditionally decides which values from the input to update the memory state.
  • Output Gate: conditionally decides what to output based on input and the memory of the unit.

Each unit is like a mini state machine where the gates of the units have weights that are learned during the training procedure.

You can see how you may achieve sophisticated learning and memory from a layer of LSTMs, and it is not hard to imagine how higher-order abstractions may be layered with such multiple layers.


You have covered a lot of ground in this post. Below are some resources that you can use to go deeper into the topic of recurrent neural networks for deep learning.

Resources to learn more about recurrent neural networks and LSTMs.

Popular tutorials for implementing LSTMs.

Primary sources on LSTMs.

People to follow doing great work with LSTMs.


In this post, you discovered sequence problems and recurrent neural networks that can be used to address them.

Specifically, you learned:

  • The limitations of classical feed-forward neural networks and how recurrent neural networks can overcome these problems
  • The practical problems in training recurrent neural networks and how they are overcome
  • The Long Short-Term Memory network used to create deep recurrent neural networks

Do you have any questions about deep recurrent neural networks, LSTMs, or this post? Ask your question in the comments, and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more...

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

See What's Inside

29 Responses to Crash Course in Recurrent Neural Networks for Deep Learning

  1. Avatar
    Marcel August 8, 2016 at 8:34 pm #

    Thanks a million Jason! Machine Learning Mastery is exactly what I need! 🙂

  2. Avatar
    Supriya June 16, 2017 at 9:29 pm #

    I am doing chatbot implementation using Keras. In that from internet I found four LSTM layers were added in sequential model. Can you guide me how many LSTM nodes are needed for chatbot implementation? Is it good to have more LSTM nodes in the algorithm? How do I test its accuracy?

  3. Avatar
    Muhammad Shaban August 25, 2018 at 9:43 pm #

    This post is not really a crash course. It is just an introduction to RNN and LSTM. Moreover following statement in introduction section is also misleading. There are only few references to some tutorial and that’s all.

    “In this post you will get a crash course in recurrent neural networks for deep learning, acquiring just enough understanding to start using LSTM networks in Python with Keras.”

  4. Avatar
    Sahil September 27, 2018 at 7:26 pm #

    Amazing post. But I have a question
    You wrote, ” the structure of the network is unrolled”. How this is done?

    • Avatar
      Jason Brownlee September 28, 2018 at 6:07 am #

      In Keras, you can specify the “unroll” argument to the LSTM layer.

  5. Avatar
    Jas October 13, 2018 at 8:31 pm #

    Great work, Jason. Avid follower of all your blog posts!

  6. Avatar
    Vinit October 17, 2018 at 1:29 pm #

    Question regarding the BPTT algorithm: if we unroll A->A into A->B with the same weights, how do we avoid an infinite regress getting A->B->C->D->E…?

    Do we limit the number of times to look back, e.g. to 5? If we do that, then we’re just adding a hyper-parameter similar to the window size, right?

    If we don’t do that, we are forced to unfold it till the first item in the sequence, say the month’s data, or from the start of the sentence. In that case, do we set the initial value of the feedback to 0s? Also, what do we do when we have extremely long sequences? Do we need a way to localize the sequences to the immediate neighbours?

    Been trying to wrap my head around RNNs for a while, but I always get stuck at some of these implementational details (trying to build one without frameworks to really understand the workings)…

  7. Avatar
    Suraj April 12, 2019 at 3:59 am #

    Is the ResNet (residual networks) different from RNNs or LSTM? Or ResNet is another variant of RNN?

    • Avatar
      Jason Brownlee April 12, 2019 at 7:53 am #


      ResNet is a way of designing a deep CNN. It is a model architecture, rather than a type of neural net.

  8. Avatar
    Robert Feyerharm August 31, 2019 at 12:19 am #

    Question: What are the advantages of modeling a recurrent neural network vs. using another form of predictive model (such as a gradient boosting machine) with time lag variables (y = X_t + X_t-1 + X_t-2 + . . .?

    • Avatar
      Jason Brownlee August 31, 2019 at 6:11 am #

      It can help on problems where the output is a nonlinear function of recent inputs. More than just an autoregression, e.g. something lumpy where the output may be conditional on one or more prior inputs, bu we don’t know which at which times. Like words in a sentence.

  9. Avatar
    Pete December 11, 2019 at 8:16 pm #

    I was hoping to get the know-how of neural networks, but again I found the incomprehensible jargon. Could you please explain what do you mean by:
    – ‘passing signal latterly’,
    – ‘the recurrent connections add state or memory’ – connections of what?,and what are ‘recurrent connections’?, are they different from normal connections between neurons?,
    – ‘the structure of the network is unrolled’ – can you also unfold the network?, how about squeezing?, collapsing?
    – ‘the gradients’ – do you mean weights?
    – ‘Rectifier transfer function’ – the formula would help,
    – ‘blocks’ – aren’t they just groups of neurons,
    – ‘a unit operates upon an input’ – Jesus Christ, a unit of what?
    and so on..

    Really, you should consider who do write for. If you write for your colleagues, who know already what you’re saying, then it makes no sense. If you write for non-specialist, newbies, you should really work on your language.

    • Avatar
      Jason Brownlee December 12, 2019 at 6:16 am #

      Thanks for your feedback. Perhaps I should better spell out the audience, clearly this post assumes a background in simple neural nets and I did not make that clear.

      I generally don’t write for absolute beginners, instead for practitioners.

  10. Avatar
    Rawan February 17, 2020 at 11:47 am #

    I am working on traffic flow prediction online Machine Learning Algorithms project and i want to choose a suitable DNN model for my project. which one should i use to give me accurate output???
    note: my data set in CSV format and as you wrote MLP is used for these issues. But is there other more accurate model to use??

    Thank you :))

    • Avatar
      Rawan February 17, 2020 at 11:50 am #

      Also, my choice must depend on complexity and time training.

    • Avatar
      Jason Brownlee February 17, 2020 at 1:31 pm #

      Perhaps start with an MLP and try a suite of model configurations to discover what works best. E.g. different numbers of layers/nodes/activation functions, etc.

  11. Avatar
    HRSH September 10, 2020 at 8:17 pm #

    Thank You … Very Good

  12. Avatar
    Mohsin Khan September 25, 2020 at 7:36 pm #

    Hello Jason, can we say that simple RNN is a traditional and shallow neural network, but LSTM RNN is a deep learning one ???

  13. Avatar
    arbiter007 June 1, 2023 at 1:52 am #

    btw, I think “latterly” should be “laterally” i.e., “sideways”

    Otherwise, great stuff!


    • Avatar
      James Carmichael June 1, 2023 at 5:10 am #

      Thank you for your feedback and support!

Leave a Reply