Gentle Introduction to Models for Sequence Prediction with Recurrent Neural Networks

Sequence prediction is a problem that involves using historical sequence information to predict the next value or values in the sequence.

The sequence may be symbols like letters in a sentence or real values like those in a time series of prices. Sequence prediction may be easiest to understand in the context of time series forecasting as the problem is already generally understood.

In this post, you will discover the standard sequence prediction models that you can use to frame your own sequence prediction problems.

After reading this post, you will know:

  • How sequence prediction problems are modeled with recurrent neural networks.
  • The 4 standard sequence prediction models used by recurrent neural networks.
  • The 2 most common misunderstandings made by beginners when applying sequence prediction models.

Let’s get started.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. Sequence Prediction with Recurrent Neural Networks
  2. Models for Sequence Prediction
  3. Cardinality from Timesteps not Features
  4. Two Common Misunderstandings by Practitioners

Sequence Prediction with Recurrent Neural Networks

Recurrent Neural Networks, like Long Short-Term Memory (LSTM) networks, are designed for sequence prediction problems.

In fact, at the time of writing, LSTMs achieve state-of-the-art results in challenging sequence prediction problems like neural machine translation (translating English to French).

LSTMs work by learning a function (f(…)) that maps input sequence values (X) onto output sequence values (y).

The learned mapping function is static and may be thought of as a program that takes input variables and uses internal variables. Internal variables are represented by an internal state maintained by the network and built up or accumulated over each value in the input sequence.

… RNNs combine the input vector with their state vector with a fixed (but learned) function to produce a new state vector. This can in programming terms be interpreted as running a fixed program with certain inputs and some internal variables.

— Andrej Karpathy, The Unreasonable Effectiveness of Recurrent Neural Networks, 2015

The static mapping function may be defined with a different number of inputs or outputs, as we will review in the next section.

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Models for Sequence Prediction

In this section, will review the 4 primary models for sequence prediction.

We will use the following terminology:

  • X: The input sequence value, may be delimited by a time step, e.g. X(1).
  • u: The hidden state value, may be delimited by a time step, e.g. u(1).
  • y: The output sequence value, may be delimited by a time step, e.g. y(1).

One-to-One Model

A one-to-one model produces one output value for each input value.

One-to-One Sequence Prediction Model

One-to-One Sequence Prediction Model

The internal state for the first time step is zero; from that point onward, the internal state is accumulated over the prior time steps.

One-to-One Sequence Prediction Model Over Time

One-to-One Sequence Prediction Model Over Time

In the case of a sequence prediction, this model would produce one time step forecast for each observed time step received as input.

This is a poor use for RNNs as the model has no chance to learn over input or output time steps (e.g. BPTT). If you find implementing this model for sequence prediction, you may intend to be using a many-to-one model instead.

One-to-Many Model

A one-to-many model produces multiple output values for one input value.

One-to-Many Sequence Prediction Model

One-to-Many Sequence Prediction Model

The internal state is accumulated as each value in the output sequence is produced.

This model can be used for image captioning where one image is provided as input and a sequence of words are generated as output.

Many-to-One Model

A many-to-one model produces one output value after receiving multiple input values.

Many-to-One Sequence Prediction Model

Many-to-One Sequence Prediction Model

The internal state is accumulated with each input value before a final output value is produced.

In the case of time series, this model would use a sequence of recent observations to forecast the next time step. This architecture would represent the classical autoregressive time series model.

Many-to-Many Model

A many-to-many model produces multiple outputs after receiving multiple input values.

Many-to-Many Sequence Prediction Model

Many-to-Many Sequence Prediction Model

As with the many-to-one case, state is accumulated until the first output is created, but in this case multiple time steps are output.

Importantly, the number of input time steps do not have to match the number of output time steps. Think of the input and output time steps operating at different rates.

In the case of time series forecasting, this model would use a sequence of recent observations to make a multi-step forecast.

In a sense, it combines the capabilities of the many-to-one and one-to-many models.

Cardinality from Timesteps (not Features!)

A common point of confusion is to conflate the above examples of sequence mapping models with multiple input and output features.

A sequence may be comprised of single values, one for each time step.

Alternately, a sequence could just as easily represent a vector of multiple observations at the time step. Each item in the vector for a time step may be thought of as its own separate time series. It does not affect the description of the models above.

For example, a model that takes as input one time step of temperature and pressure and predicts one time step of temperature and pressure is a one-to-one model, not a many-to-many model.

Multiple-Feature Sequence Prediction Model

Multiple-Feature Sequence Prediction Model

The model does take two values as input and predicts two values, but there is only a single sequence time step expressed for the input and predicted as output.

The cardinality of the sequence prediction models defined above refers to time steps, not features (e.g. univariate or multivariate sequences).

Two Common Misunderstandings by Practitioners

The confusion of features vs time steps leads to two main misunderstandings when implementing recurrent neural networks by practitioners:

1. Timesteps as Input Features

Observations at previous timesteps are framed as input features to the model.

This is the classical fixed-window-based approach of inputting sequence prediction problems used by multilayer Perceptrons. Instead, the sequence should be fed in one time step at a time.

This confusion may lead you to think you have implemented a many-to-one or many-to-many sequence prediction model when in fact you only have a single vector input for one time step.

2. Timesteps as Output Features

Predictions at multiple future time steps are framed as output features to the model.

This is the classical fixed-window approach of making multi-step predictions used by multilayer Perceptrons and other machine learning algorithms. Instead, the sequence predictions should be generated one time step at a time.

This confusion may lead you to think you have implemented a one-to-many or many-to-many sequence prediction model when in fact you only have a single vector output for one time step (e.g. seq2vec not seq2seq).

Note: framing timesteps as features in sequence prediction problems is a valid strategy, and could lead to improved performance even when using recurrent neural networks (try it!). The important point here is to understand the common pitfalls and not trick yourself when framing your own prediction problems.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Summary

In this tutorial, you discovered the standard models for sequence prediction with recurrent neural networks.

Specifically, you learned:

  • How sequence prediction problems are modeled with recurrent neural networks.
  • The 4 standard sequence prediction models used by recurrent neural networks.
  • The 2 most common misunderstandings made by beginners when applying sequence prediction models.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.


37 Responses to Gentle Introduction to Models for Sequence Prediction with Recurrent Neural Networks

  1. Raan July 19, 2017 at 4:34 am #

    Thanks for the article. This is very useful. Do you have any examples of forecasting multivariate time series using RNN?

    • Jason Brownlee July 19, 2017 at 8:30 am #

      I should have one on the blog soon, it has been scheduled.

  2. mriazi July 20, 2017 at 10:27 am #

    Hi Jason,

    Thank you very much for your great article and the fabulous blog. I’ve been following you blog
    for a few months now and read most of your articles on RNNs.
    Like you have mentioned above, I’m struggling to correctly model my time-series prediction problem. It’ll be great if you can help me on this.
    I have samples of sensor readings each a vector of 64 timesteps. I would like to use LSTM to learn the structure of the series and predict the next 64 timesteps.
    I think I will need to use a Many-to-Many model to the model learns the input and predicts the output (64 values) based on what it has learned. I’m trying to use LSTM for unsupervised anomaly detection problem. I guess what I’m struggling with is that I want my model to learn the most common structure in my long time series and I’m kind of confused how my input should be.
    Sorry, for the long description.
    Many thanks

    • Jason Brownlee July 21, 2017 at 9:26 am #

      I would recommend modeling it as a many-to-many supervised learning problem.

      Sorry, I don’t have experience using LSTMs for unsupervised problems, I need to do some reading.

  3. Paul August 2, 2017 at 3:22 pm #

    Hi, Jason. I’m always thankful that you posted great examples and posts.
    I have simple question.
    For predicting/forecasting time series data, are Multilayer NN and RNN(LSTM) techniques the best way to forecasting future data?

    Thank you in advance.

    Best,
    Paul

    • Jason Brownlee August 3, 2017 at 6:43 am #

      There is no best way, I would encourage you to evaluate a suite of methods and see what works best for your problem.

  4. Gustavo August 12, 2017 at 5:16 am #

    Secuence learning is the same as Online learning? What are the differences?

    • Jason Brownlee August 12, 2017 at 6:54 am #

      Hi Gustavo,

      No, a sequence is the structure of the data and prediction problem.

      Learning can be online or offline for sequence prediction the same as simpler regression and classification.

      Does that help?

      • Gustavo August 14, 2017 at 10:30 pm #

        Help indeed thanks best regards

  5. hirohi August 21, 2017 at 12:18 pm #

    In the case of Many2Many and One2Many in this post, how do you compute the hidden states at the time step, when there is no input. Specifically, in One2Many, how do you compute “u(1)”, despite of the lack of “X(2)”? I think we can only compute Y(1),Y(2), Y(3) as a vector. If I was wrong, could you tell me why with examples such as image captioning or machine translation?

    • Jason Brownlee August 21, 2017 at 4:23 pm #

      Great question!

      It is common to teach the model with “start seq” and “end seq” inputs at the beginning and end of sequences to kick-off or close-off the sequence input or output.

      I have used this approach myself with image captioning models and translation.

      • hirohi August 22, 2017 at 11:33 am #

        I investigated many2many(encoder-decoder). As you said, we feed “start” to LSTM to compute “u(1)”. My question included “what the input is necessary to compute “u(2)”. As the result of my investigation, we have to feed “y(2)” to compute “u(2)”.

        The below image is more accurate, right?
        http://suriyadeepan.github.io/img/seq2seq/seq2seq1.png

        • Jason Brownlee August 23, 2017 at 6:38 am #

          Yes, that is one way.

          Remember to explore many different framings of the problem to see what works best for your specific data.

          • hirohi August 23, 2017 at 12:30 pm #

            OK, thanks! I’ll try it!

  6. mrresearcher September 6, 2017 at 11:38 pm #

    Im facing a problem of one-to-many sequence prediction, where given a set of input parameters for a program the model should generate values of resources usage as a function of time (CPU, memory etc.). I have some examples from real-world programs and I already tried simple feed-forward networks, but now Im trying to find state-of-the-art solution for one-to-many sequence generating problem. Until now I’ve only found image captioning example, but it is tailored for predicting words instead of real values. Are you aware of any state-of-the-art solutions for generating one-to-many sequences? If you do, I would be grateful for any references. Thanks!

    • Jason Brownlee September 7, 2017 at 12:56 pm #

      Caption generation would provide a good model or starting point for your problem.

      No CNN front end of course, a big MLP perhaps instead.

      Does that help? I’m eager to hear how you go.

  7. Sama November 30, 2017 at 10:26 am #

    Dear Dr, Please I have an important question. Can RNN accumulate knowledge, for example can i contentiously train the network to built bigger knowledge or it is trained once, and if it can contentiously learn . how i can do that

    • Jason Brownlee November 30, 2017 at 2:46 pm #

      Good question.

      You can update the model after it is trained.

  8. Sharan December 29, 2017 at 2:54 pm #

    Jason,

    I am trying to apply ML for a specific problem I want to solve.

    Below is the problem statement:

    I have a system that is made of many functional blocks. These communicate with each other through events. When the system runs, the log of these events history is generated.

    From past experience, I know what the interesting sequences are. I would now like to parse through these event log and see if any of the sequences fall in the interesting category that is known a-priori. One thing to note is that time duration can vary while sequence is intact.
    FOr example, event1 t1 event2 t2 event3. Between example and actual sequence, the values of t1, t2 can vary but sequence of events (event1 -> event2 -> event3) remain.

    Manually doing this is tedious as there can be millions of such events when the system runs.

    Can you suggest what is the best approach to solve this issue>

  9. Arnold Loaiza April 9, 2018 at 11:02 am #

    Hello Jason, I have a query about a sequence prediction problem where an author used lstm with dense layer for the potential of this combination.
    The problem is to use 20 units of time from the past to predict T units of time. For example, predict the sequence of the next 5 units of time. So each sample has 20 units of time where each unit of time is a vector with 10 characteristics.

    X = ( samples, 20, 10)
    Y = (50)

    As you can see the respective “Y” for each sample is a vector of 50 units, which represents the units of time to predict, a time with its respective vector of 10 characteristics concatenated with the remaining 4 times, in total 50. In keras it would be presented in this way:

    model= Sequential()
    model.add(LSTM(500, input_shape=(20, 10)))
    model.add(Dense(10*5)) # 5 times with vector of 10 characteristics each time.
    model.compile(optimizer= ‘rmsprop’, loss=’mse’)

    According to what I read in this post, it would be a form of a vector, because it is sending its last internal state H as an output and that is being used as a characteristic vector that trains with the desired outputs of the following 5 times. The amazing thing is that this architecture learns, it is not the best but it gets very close, it gains to methods like SAE, ANN. Finally I tested this with my dataset with different output sequences for 10 times, 15 time2, 20 times in the future, just by increasing the number of output neurons desired, it’s like magic.

    What would your opinion be? Is it a Seq to Vector? Can it be done in a more effective way ?. Thank you very much.

  10. Shubhashis June 1, 2018 at 12:04 am #

    Hello Jason,

    I’m confused with the figure of “One-to-One Sequence Prediction Model Over Time”, and “Many-to-Many Sequence Prediction Model”.

    For one to one model, here is a Keras code snippet –

    model = Sequential()
    model.add(LSTM(….., input_shape=(1, ….)))
    model.add(Dense(1))

    Now, according to the figure of “One-to-One Sequence Prediction Model Over Time”, I’m assuming the Keras implementation will be –

    model = Sequential()
    model.add(LSTM(….., input_shape=(time_steps, ….), return_sequences=True))

    Now this seems oddly familiar to “Many to Many Sequence Prediction”, where the number of input features are equal to number of output features.

    Please let me know where I misunderstood. Also, for the figure, “One-to-One Sequence Prediction Model Over Time”, what would be the correct implementation with Keras?

    Thanks.

    Btw, Great article on the Time Series prediction 🙂

    • Jason Brownlee June 1, 2018 at 8:21 am #

      The “over time” is just the application of the same model to each time step. No difference to the model, just the data.

      • Shubhashis June 1, 2018 at 1:30 pm #

        So, if there are multiple time steps for a one-to-one model, you are saying that the model would be the same, that is, the model would be –

        model = Sequential()
        model.add(LSTM(….., input_shape=(1, ….)))
        model.add(Dense(1))

        But, this means that there is only 1 time step. How multiple time steps would fit into this?

        • Jason Brownlee June 1, 2018 at 2:49 pm #

          I see, I believe you are describing a many to many model.

          • Shubhashis June 2, 2018 at 12:23 am #

            Ok, if so, then I think, the figure that you’ve shown for “One-to-One Sequence Prediction Model Over Time” should be a “Many to Many” model instead.

            Because the only logical Keras implementation I could think for that is –

            model = Sequential()
            model.add(LSTM(….., input_shape=(n, ….), return_sequence=True))

            Which does not seem like a “one-to-one” model. Rather a “many-to-many” instead.

            Please let me know if this is clear.

            I can mail you in detail if you think the question that I’m asking is not sufficient to describe the problem.

          • Jason Brownlee June 2, 2018 at 6:32 am #

            Your code is a many to many, not one to one.

  11. Joe wang June 16, 2018 at 3:11 am #

    Hi Jason,

    Thank you for the blog and it is very helpful. I have a question regarding many to one structure, when we try to use many to one model to do the predication, we also need to have an sequence as the input (contain same number of time steps as training data), do I understand correctly? Or could we just feed the feature at one time stamp to get the predictions?

    • Jason Brownlee June 16, 2018 at 7:31 am #

      It means multiple timesptes as input then multiple time steps as output.

      It could be actual time series or words in a sentence or other obs where they are ordered by time.

  12. Victor September 11, 2018 at 11:15 pm #

    Hi Jason

    Thank you very much for your wonderful article.

    I am pretty new in the field and I am sure I have not yet fully understood.

    If I want to use the power of NN to predict the temperature for example, using the time sequence temperature, pressure, humidity n etc at each time frame as input, what network is it? is it best to use LSTM RNN?

    The architecture of the model that I am considering is.

    1. time sequence value of temperature, T[], which produces a temporary output O1 at time t
    2. time sequence value of pressure, P[], which produces a temporary output O2 at time t
    3. time sequence value of humidity, H[], which produces a temporary output O3 at time t
    4. finally, O1, O2, O3 will be used to generate the final output at time t, which is the model prediction of the temperature.

    Do I actually need to have 4 independent NN? or only 1 which takes all the time sequence features?

    And do I really need RNN? i don’t think I need to feed my prediction back into the network, as I can keep feeding the latest measurement as input.

    Much appreciate for your time to answer my question.

  13. Tunay October 3, 2018 at 7:02 pm #

    Hi Jason,

    can you please suggest some reading on “strategies on framing timesteps as features in sequence prediction problems” ?

    i am having hard time finding relevant literature 🙂

    • Jason Brownlee October 4, 2018 at 6:14 am #

      No literature needed, it’s a simple change in code from using past observations as time steps to instead using them as features on a time step.

      • Tunay October 7, 2018 at 12:05 pm #

        Oh I see. I actually wanted to use the observations at the timesteps only as output features, without using RNNs.

        To elaborate on that; all the input features are for t=0 and these inputs are different kind of data than the output feature. There is only one kind of output feature and it varies over time.
        So I have:
        X_1, X_2, … , X_n for t=0 and
        y_t=0, y_t=1, …, y_t=m

        I thought of employing one-to-many RNN (I am not sure if this is a valid case for this!?)
        but then I thought maybe I can also frame the different timesteps as different output features and develop a simple feedforward network with backpropagation without using RNN at all.

        Do you think this is a valid strategy?

        • Jason Brownlee October 8, 2018 at 9:22 am #

          Interesting.

          Yes, I’d recommend trying a one-to-many RNN and see how it compares to an MLP or CNN.

Leave a Reply