Demonstration of Memory with a Long Short-Term Memory Network in Python

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning over long sequences.

This differentiates them from regular multilayer neural networks that do not have memory and can only learn a mapping between input and output patterns.

It is important to understand the capabilities of complex neural networks like LSTMs on small contrived problems as this understanding will help you scale the network up to large and even very large problems.

In this tutorial, you will discover the capability of LSTMs to remember and recall.

After completing this tutorial, you will know:

  • How to define a small sequence prediction problem that only an RNN like LSTMs can solve using memory.
  • How to transform the problem representation so that it is suitable for learning by LSTMs.
  • How to design an LSTM to solve the problem correctly.

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

A Demonstration of Memory in a Long Short-Term Memory Network

A Demonstration of Memory in a Long Short-Term Memory Network
Photo by crazlei, some rights reserved.


This tutorial assumes you have a working Python 2 or 3 environment with SciPy, Keras 2.0 or higher with a TensorFlow or Theano backend.

For help setting up your Python environment, see the post:

Sequence Problem Description

The problem is to predict values of a sequence one at a time.

Given one value in the sequence, the model must predict the next value in the sequence. For example, given a value of “0” as an input, the model must predict the value “1”.

There are two different sequences that the model must learn and correctly predict.

A wrinkle is that there is conflicting information between the two sequences and that the model must know the context of each one-step prediction (e.g. the sequence it is currently predicting) in order to correctly predict each full sequence.

This wrinkle is important to prevent the model from memorizing each single-step input-output pair of values in each sequence, as a sequence unaware model may be inclined to do.

The two sequences to be learned are as follows:

  • 3, 0, 1, 2, 3
  • 4, 0, 1, 2, 4

We can see that the first value of the sequence is repeated as the last value of the sequence. This is the indicator that provides context to the model as to which sequence it is working on.

The conflict is the transition from the second to last items in each sequence. In sequence one, a “2” is given as an input and a “3” must be predicted, whereas in sequence two, a “2” is given as input and a “4” must be predicted.

This is a problem that a multilayer Perceptron and other non-recurrent neural networks cannot learn.

This is a simplified version of “Experiment 2” used to demonstrate LSTM long-term memory capabilities in Hochreiter and Schmidhuber’s 1997 paper Long Short Term Memory (PDF).

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Problem Representation

This section is divided into 3 parts; they are:

  1. One Hot Encoding
  2. Input-Output Pairs
  3. Reshape Data

One Hot Encoding

We will use a one hot encoding to represent the learning problem for the LSTM.

That is, each input and output value will be represented as a binary vector with 5 elements, because the alphabet of the problem is 5 unique values.

For example, the 5 values of [0, 1, 2, 3, 4] are represented as the following 5 binary vectors:

We can do this with a simple function that will take a sequence and return a list of binary vectors for each value in the sequence. The function encode() below implements this behavior.

We can test it on the first sequence and print the resulting list of binary vectors. The complete example is listed below.

Running the example prints each binary vector. Note that we use the floating point values 0.0 and 1.0 because they will be used as inputs and outputs for the model.

Input-Output Pairs

The next step is to split a sequence of encoded values into input-output pairs.

This is a supervised learning representation of the problem such that machine learning problems can learn how to map an input pattern (X) to an output pattern (y).

For example, the first sequence has the following input-output pairs to be learned:

Instead of the raw numbers, we must create these mapping pairs from the one hot encoded binary vectors.

For example, the first input-output pairs for 3->0 would be:

Below is a function named to_xy_pairs() that will create lists of X and y patterns given a list of encoded binary vectors.

We can put this together with the one hot encoding function above and print the encoded input and output pairs for the first sequence.

Running the example prints the input and output pairs for each step in the sequence.

Reshape Data

The final step is to reshape the data so that it can be used by the LSTM network directly.

The Keras LSTM expects input patterns (X) as a three-dimensional NumPy array with the dimensions [samples, timesteps, features].

In the case of one sequence of input data, the dimensions will be [4, 1, 5] because we have 4 rows of data, 1 time step for each row, and 5 columns in each row.

We can create a 2D NumPy array from our list of X patterns, then reshape it into the required 3D format. For example:

We must also convert the list of output patterns (y) into a 2D NumPy Array.

Below is a function named to_lstm_dataset() that takes a sequence as an input and the size of the sequence alphabet and returns an X and y dataset ready for use with an LSTM. It performs the required conversions of the sequence to a one-hot encoding and to input-output pairs before reshaping the data.

This function can be called with each sequence as follows:

We now have all of the pieces to prepare the data for the LSTM.

Learn Sequences with an LSTM

In this section, we will define the LSTM to learn the input sequences.

This section is divided into 4 sections:

  1. LSTM Configuration
  2. LSTM Training
  3. LSTM Evaluation
  4. LSTM Complete Example

LSTM Configuration

We want the LSTM to make one-step predictions, which we have defined in the format and shape of our dataset. We also want the LSTM to be updated with errors after each time step, this means we will need to use a batch-size of one.

Keras LSTMs are not stateful between batches by default. We can make them stateful by setting the stateful argument on the LSTM layer to True and managing the training epochs manually to ensure that the internal state of the LSTM is reset after each sequence.

We must define the shape of the batch using the batch_input_shape argument with 3 dimensions [batch size, time steps, and features] which will be 1, 1, and 5 respectively.

The network topology will be configured with one hidden LSTM layer with 20 units and a normal Dense layer with 5 outputs for each of the 5 columns in an output pattern. A sigmoid (logistic) activation function will be used on the output layer because of the binary outputs and the default tanh (hyperbolic tangent) activation function will be used on the LSTM layer.

A log (cross entropy) loss function will be optimized when fitting the network because of the binary outputs and the efficient ADAM optimization algorithm will be used with all default parameters.

The Keras code to define the LSTM network for this problem is listed below.

LSTM Training

We must fit the model manually one epoch at a time.

Within one epoch we can fit the model on each sequence, being sure to reset state after each sequence.

The model does not need to be trained for long given the simplicity of the problem; in this case only 250 epochs are required.

Below is an example of how the model can be fit on each sequence across all epochs.

I like to see some feedback on the loss function when fitting a network, so verbose output is turned on from one of the sequences, but not the other.

LSTM Evaluation

Next, we can evaluate the fit model by predicting each step of the learned sequences.

We can do this by predicting the outputs for each sequence.

The predict_classes() function can be used on the LSTM model that will predict the class directly. It does this by performing an argmax() on the output binary vector and returning the index of the predicted column with the largest output. The output indices map perfectly onto the integers used in the sequence (by careful design above). An example of making a prediction is listed below:

We can make a prediction, then print the result in the context of the input pattern and the expected output pattern for each step of the sequence.

LSTM Complete Example

We can now tie the whole tutorial together.

The complete code listing is provided below.

First, the data is prepared, then the model is fit and the predictions of both sequences are printed.

Running the example provides feedback regarding the model’s loss on the first sequence each epoch.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

At the end of the run, each sequence is printed in the context of the predictions.

The results show two important things:

  • That the LSTM correctly learned each sequence one step at a time.
  • That the LSTM used the context of each sequence to correctly resolve the conflicting input pairs.

In essence, the LSTM was able to remember the input pattern at the beginning of the sequence 3 time steps ago to correctly predict the last value in the sequence.

This memory and ability of LSTMs to relate observations distant in time is the key capability that makes LSTMs so powerful and why they are so widely used.

Although the example is trivial, LSTMs are able to demonstrate this same capability across 100s, and even 1000s, of time steps.


This section lists ideas for extensions to the examples in this tutorial.

  • Tuning. The configurations for the LSTM (epochs, units, etc.) were chosen after some trial and error. It is possible that a much simpler configuration can achieve the same result on this problem. Some search of parameters is required.
  • Arbitrary Alphabets. The alphabet of 5 integers was chosen arbitrarily. This could be changed to other symbols and larger alphabets.
  • Long Sequences. The sequences used in this example were very short. The LSTM is able to demonstrate the same capability on much longer sequences of 100s and 1000s of time steps.
  • Random Sequences. The sequences used in this tutorial were linearly increasing. New sequences of random values can be created, allowing the LSTM to devise a generalized solution rather than one specialized to the two sequences used in this tutorial.
  • Batch Learning. Updates were made to the LSTM after each time step. Explore using batch updates to see if this improves learning or not.
  • Shuffle Epoch. The sequences were shown in the same order each epoch during training and again during evaluation. Randomize the order of the sequences so that sequence 1 and 2 are fit within an epoch, which might improve the generalization of the model to new unseen sequences with the same alphabet.

Did you explore any of these extensions?
Share your results in the comments below. I’d love to see what you came up with.

Further Reading

I strongly recommend reading the original 1997 LSTM paper by Hochreiter and Schmidhuber; it is very good.


In this tutorial, you discovered a key capability of LSTMs in their ability to remember over multiple time steps.

Specifically, you learned:

  • How to define a small sequence prediction problem that only an RNN like LSTMs can solve using memory.
  • How to transform the problem representation so that it is suitable for learning by LSTMs.
  • How to design an LSTM to solve the problem correctly.

Do you have any questions?
Post your questions in the comments below and I will do my best to answer them.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more...

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

See What's Inside

36 Responses to Demonstration of Memory with a Long Short-Term Memory Network in Python

  1. Avatar
    Tim May 13, 2017 at 4:00 pm #

    Thank you for the tutorial! What’s the point in making the LSTM stateful if you reset it after every input sequence?

    • Avatar
      Jason Brownlee May 14, 2017 at 7:26 am #

      Great question,

      I made the LSTM stateful so that I can control exactly when that reset occurs. You are correct in that I could also contrive the num samples/batches so that resets occurred at appropriate times.

      • Avatar
        Pedro Albuquerque Santos June 20, 2020 at 8:42 am #

        First of all, thanks for the amazing tutorial and website you have here.

        Could you please provide an example on how to contrive the num samples/batches so that resets occurred at appropriate times when using stateful=false?

        • Avatar
          Jason Brownlee June 21, 2020 at 5:59 am #

          You’re welcome.

          Thanks for the suggestion, perhaps in the future.

  2. Avatar
    Birkey June 3, 2017 at 2:55 pm #

    Hi Jason,

    If we want num_lag = 2, that is:
    x1,x2 -> y
    (3, 0) -> 1
    (0, 1) -> 2
    (1, 2) -> 3

    the to_xy_pairs function needs to be updated to:

    def to_xy_pairs(encoded):
    X, y = list(), list()
    for i in range(2, len(encoded)):
    X.append(encoded[i - 2]) # <-- num_lag = 2
    X.append(encoded[i - 1])
    return X, y

    correspondingly, to_lstm_dataset() updated as:
    lstmX = lstmX.reshape(int(lstmX.shape[0] / 2), 2, lstmX.shape[1]) # 1.0 | 1.0
    (0.0, 1.0) -> 2.0 | 2.0
    (1.0, 2.0) -> 3.0 | 3.0

    In preprocessing, can we just use function series_to_supervised() (post on May 8, 2017) directly?

    I guess not.
    For it generates data like:
    – var1(t-2) var1(t-1) var1(t)
    you have to frame them as a sequence:
    – X: [var1(t-2), var1(t-1), …]
    – y: [var1(t), …]

    so the function series_to_superviesed() 1) for demonstration, 2) produce intermediate results ?

    • Avatar
      Jason Brownlee June 4, 2017 at 7:48 am #

      Perhaps, you could try it.

      In this case, we are providing the sequence one-time step at a time, rather than BPTT over the whole sequence. A truly amazing demonstration of the memory capability of LSTMs.

      • Avatar
        Birkey June 5, 2017 at 7:38 pm #

        Agreed: for one-time step, no BPTT, but for two-time step, LSTM does learn through BPTT.

        Also, for the network topology:
        Layer (type) Output Shape Param # Connected to
        ts_input (InputLayer) (None, 2, 18) 0

        when lag = 2, output shape of the input layer is 3-D tensor (time step is 2), just like learning a language sentence comprise words sequence.

  3. Avatar
    Birkey June 3, 2017 at 3:01 pm #

    (continued 🙂


    now proceeds:

    X = [[0.0, 0.0, 0.0, 1.0, 0.0], [1.0, 0.0, 0.0, 0.0, 0.0], [1.0, 0.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0, 0.0]]
    y = [[0.0, 1.0, 0.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 0.0, 1.0, 0.0]]
    lstmX.shape = (3, 2, 5)
    lstmY.shape = (3, 5)

    – X: 3 batches, __each batch contains 2 time steps__, each time step input a vector of 5 dimensions
    – y: 3 results, correspondingly.

    So I guess:
    1) for uni-variable series-to-suerpvised, to_xy_pairs() is enough.
    2) for multi-variable series-to-suerpvised, series_to_supervised() + to_xy_pairs() is enough.

  4. Avatar
    Danil June 5, 2017 at 4:46 pm #

    Great article, thank you.

    Is it beneficial to select batch_size > 1 for stateful LSTM ?
    If yes , how to reshape any DataFrame (dfX) to [batch_size, 1, num_features] input suitable for stateful LSTM?

    batch_size = 128
    tsteps = 1

    # X_train.shape = (107904, 10)
    # Y_train.shape = (107904, 2)
    # 107904 % 128 == 0
    # X_train and Y_train are both DataFrames

    model = Sequential()
    input_shape=(tsteps, Y_train.shape[1]),
    model.compile(loss=’mse’, optimizer=’rmsprop’)

    I’m getting error:

    ValueError: Error when checking input: expected lstm_1_input to have 3 dimensions, but got array with shape (107904, 10)

    • Avatar
      Jason Brownlee June 6, 2017 at 9:23 am #

      No reshaping is needed for a batch size of 1, it is more of a matter of how to run your epochs.

    • Avatar
      Birkey June 6, 2017 at 12:26 pm #

      Hi, Danil,

      you need to reshape your input x to 3-D tensor as [num_samples, num_time_steps, feature_dimensions], then you can choose any batch_size

  5. Avatar
    Danilo July 4, 2017 at 4:36 pm #

    # test LSTM on sequence 1
    print(‘Sequence 1’)
    result = model.predict_classes(seq1X, batch_size=n_batch, verbose=0)
    for i in range(len(result)):
    print(‘X=%.5f y=%.5f, yhat=%.5f’ % (seq1[i], seq1[i+1], result[i]))

    “”is there any way to ask this network to predict the next output after result[i]?”” This only predicts the data that it already knows from training set.

    thank you.

    • Avatar
      Jason Brownlee July 6, 2017 at 10:12 am #

      Yes, fit the model on all the data, then call model.predict()

  6. Avatar
    Dima October 11, 2017 at 4:18 am #

    Thank you for this course ! Why doesn’t it converge if n_batch = 4 or 2 ? Does it make any difference in this example ?

    • Avatar
      Jason Brownlee October 11, 2017 at 7:57 am #

      In this case, we are constrained to fixed batch sizes because the model is stateful.

      • Avatar
        Dima October 11, 2017 at 10:26 am #

        Hi Jason, thanks for your respond
        If we set n_batch = 4, it will not converge and result in either 0123 for both sequences or 0124.
        It behaves as if it doesn’t keep state …
        I see the only diiference that if batch size =1 it makes weight update 4 times passing through the sequence before going to the second sequence, if batch =4 it makes weight update just once , then reseting state and going to the second sequence.

        I aslo tried to concatenate both sequences so I could run 1 sequnce of 8 pairs and still I get the same result , it memorize correctly all secuence if batch size ==1 (i.e. 01230124), otherwise if I set batch size = 4 or 8 it results in either 01230123 or 01240124 ( i.e. doesn’t converge).

        What am I missing here ?

        I also tried other examples in your course , one of them “Understanding Stateful LSTM Recurrent Neural Networks in Python with Keras” where it learned the alphabet with success when increasing the batch size from 1 to the size of the training dataset batch_size=len(dataX) =26.

  7. Avatar
    Dima October 11, 2017 at 10:52 pm #

    It would be really helpful to see this example in stateless mode .
    What would the unput shape look like ?

  8. Avatar
    Dima October 19, 2017 at 1:05 am #

    Here is a very simple example of LSTM in stateless mode and we train it on a very simple sequence [0–>1] and [0–>2] . Any idea why it won’t converge in stateless mode?

    We have a batch of size 2 with 2 samples and it supposed to keep the state inside the batch.

    from keras.models import Sequential
    from keras.layers import Dense
    from keras.layers import LSTM
    import numpy
    # define sequences
    seq = [0, 1, 0, 2]
    # convert sequences into required data format
    seqX=numpy.array([[( 1. , 0. , 0.)], [( 1. , 0. , 0.)]])
    seqY=numpy.array([( 0. , 1. , 0.) , ( 0. , 0. , 1.)])

    # define LSTM configuration
    n_unique = len(set(seq))
    n_neurons = 20
    n_batch = 2
    n_epoch = 300
    n_features = n_unique
    # create LSTM
    model = Sequential()
    model.add(LSTM(n_neurons, input_shape=( 1, n_features) ))
    model.add(Dense(n_unique, activation=’sigmoid’))
    model.compile(loss=’binary_crossentropy’, optimizer=’Adam’)
    # train LSTM, seqY, epochs=n_epoch, batch_size=n_batch, verbose=2, shuffle=False)

    result = model.predict_classes(seqX, batch_size=n_batch, verbose=0)
    for i in range(2):
    print(‘X=%.1f y=%.1f, yhat=%.1f’ % (0, i+1, result[i]))

  9. Avatar
    Christopher September 19, 2018 at 12:51 am #

    Hi Jason,
    This is a great tutorial, but I have some stuff in it that I do not understand.
    Refer to the following snippets from your code. Could you explain why you fit seq1X and seq2X seperately instead of joinly [see my things to follow] in the loop?

    I am thinking about the following:

    I tried it and print seq1, seq2, and seq3. Here is the print-out. I can see the difference, but I do not know why. Could you explain it? Note the print-outs are not consistent.
    Here is the printout from run #1:
    Sequence 1
    X=3.0 y=0.0, yhat=0.0
    X=0.0 y=1.0, yhat=1.0
    X=1.0 y=2.0, yhat=2.0
    X=2.0 y=3.0, yhat=2.0
    Sequence 2
    X=4.0 y=0.0, yhat=0.0
    X=0.0 y=1.0, yhat=1.0
    X=1.0 y=2.0, yhat=3.0
    X=2.0 y=4.0, yhat=3.0
    Sequence 3
    X=3.0 y=0.0, yhat=0.0
    X=0.0 y=1.0, yhat=1.0
    X=1.0 y=2.0, yhat=2.0
    X=2.0 y=3.0, yhat=2.0
    X=3.0 y=4.0, yhat=4.0
    X=4.0 y=0.0, yhat=0.0
    X=0.0 y=1.0, yhat=1.0
    X=1.0 y=2.0, yhat=2.0
    X=2.0 y=4.0, yhat=4.0

    Print-out from run #2:
    Sequence 1
    X=3.0 y=0.0, yhat=0.0
    X=0.0 y=1.0, yhat=1.0
    X=1.0 y=2.0, yhat=2.0
    X=2.0 y=3.0, yhat=3.0
    Sequence 2
    X=4.0 y=0.0, yhat=0.0
    X=0.0 y=1.0, yhat=1.0
    X=1.0 y=2.0, yhat=2.0
    X=2.0 y=4.0, yhat=3.0
    Sequence 3
    X=3.0 y=0.0, yhat=0.0
    X=0.0 y=1.0, yhat=1.0
    X=1.0 y=2.0, yhat=2.0
    X=2.0 y=3.0, yhat=3.0
    X=3.0 y=4.0, yhat=4.0
    X=4.0 y=0.0, yhat=0.0
    X=0.0 y=1.0, yhat=1.0
    X=1.0 y=2.0, yhat=2.0
    X=2.0 y=4.0, yhat=4.0

    Print-out from run #3:
    Sequence 1
    X=3.0 y=0.0, yhat=0.0
    X=0.0 y=1.0, yhat=1.0
    X=1.0 y=2.0, yhat=2.0
    X=2.0 y=3.0, yhat=3.0
    Sequence 2
    X=4.0 y=0.0, yhat=0.0
    X=0.0 y=1.0, yhat=1.0
    X=1.0 y=2.0, yhat=3.0
    X=2.0 y=4.0, yhat=3.0
    Sequence 3
    X=3.0 y=0.0, yhat=0.0
    X=0.0 y=1.0, yhat=1.0
    X=1.0 y=2.0, yhat=2.0
    X=2.0 y=3.0, yhat=3.0
    X=3.0 y=4.0, yhat=4.0
    X=4.0 y=0.0, yhat=0.0
    X=0.0 y=1.0, yhat=1.0
    X=1.0 y=2.0, yhat=2.0
    X=2.0 y=4.0, yhat=4.0


    • Avatar
      Jason Brownlee September 19, 2018 at 6:24 am #

      I’m not sure I follow, it was a long time I go when I wrote this – I barely remember the details.

      What are you interested in demonstrating exactly?

      • Avatar
        Christopher September 20, 2018 at 7:55 pm #

        Hi Jason,
        Long story short, why would you fit the model on seq1X and seq2X seperately in two differnt fit statments like this?

        Why would you not join seq1X and seq2X and fit the model on the joined training dataset, in my example which is seq3X, shown below:

        • Avatar
          Jason Brownlee September 21, 2018 at 6:28 am #

          It comes down to how I defined the test problem. I wanted to reset states between samples and demo that the model can remember the value from the first time step over n time steps.

          Concatenating the two sequences is a different problem.

          Perhaps re-read the definition of the problem above?

  10. Avatar
    Sahil September 27, 2018 at 8:34 pm #

    Great tutorial. I just have a question like what is the importance of resetting model after each step.

    • Avatar
      Jason Brownlee September 28, 2018 at 6:13 am #

      In this example we are only interested in state across one sample.

  11. Avatar
    Mounika January 18, 2019 at 10:48 pm #

    Is it possible to dynamically remember chat conversations using RNN and LSTM ?, If yes please suggest me any article

    • Avatar
      Jason Brownlee January 19, 2019 at 5:41 am #

      Why would a model need to remember whole chat conversations?

  12. Avatar
    Jezreel Ramos January 3, 2020 at 10:16 am #

    What does the state function do? What is the difference between stateful=True and stateful=False?

    • Avatar
      Jason Brownlee January 4, 2020 at 8:15 am #

      When stateful=True the internal state of the LSTMs units is not reset until you manually reset it.

      Otherwise the internal state of the units is reset at the end of each batch.

  13. Avatar
    Anastasios Dimitriou October 4, 2020 at 10:12 am #

    Is there anyway to calculate the outputs using the model weights? I tried it but I got very different results!

    • Avatar
      Jason Brownlee October 4, 2020 at 2:57 pm #

      Yes, it is weighted sum of inputs and the model weights.

      • Avatar
        Anastasios Dimitriou October 4, 2020 at 6:19 pm #

        That’s great, though please correct me.
        I got the weigths for both layers like thtat :

  14. Avatar
    Anastasios Dimitriou October 4, 2020 at 6:20 pm #

    And then I did something like that :

    • Avatar
      Jason Brownlee October 5, 2020 at 6:50 am #

      Does it give the same result?

      • Avatar
        Anastasios Dimitriou October 10, 2020 at 6:32 pm #

        Actually it did. The problem was that the weight matrices lines were the acctuall columns!

Leave a Reply