How to Learn to Echo Random Integers with Long Short-Term Memory Recurrent Neural Networks

Long Short-Term Memory (LSTM) Recurrent Neural Networks are able to learn the order dependence in long sequence data.

They are a fundamental technique used in a range of state-of-the-art results, such as image captioning and machine translation.

They can also be difficult to understand, specifically how to frame a problem to get the most out of this type of network.

In this tutorial, you will discover how to develop a simple LSTM recurrent neural network to learn how to echo back the number in an ad hoc sequence of random integers. Although a trivial problem, developing this network will provide the skills needed to apply LSTM on a range of sequence prediction problems.

After completing this tutorial, you will know:

  • How to develop a LSTM for the simpler problem of echoing any given input.
  • How to avoid the beginner’s mistake when applying LSTMs to sequence problems like echoing integers.
  • How to develop a robust LSTM to echo the last observation from ad hoc sequences of random integers.

Let’s get started.

How to Learn to Echo Random Integers with Long Short-Term Memory Recurrent Neural Networks

How to Learn to Echo Random Integers with Long Short-Term Memory Recurrent Neural Networks
Photo by Franck Michel, some rights reserved.


This tutorial is divided into 4 parts; they are:

  1. Generate and Encode Random Sequences
  2. Echo Current Observation
  3. Echo Lag Observation Without Context (Beginners Mistake)
  4. Echo Lag Observation


This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

This tutorial assumes you have Keras v2.0 or higher installed with either the TensorFlow or Theano backend. You do not need a GPU for this tutorial, all code will easily run in a CPU.

This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help setting up your Python environment, see this post:

Generate and Encode Random Sequences

The first step is to write some code to generate a random sequence of integers and encode them for the network.

Generate Random Sequence

We can generate random integers in Python using the randint() function that takes two parameters indicating the range of integers from which to draw values.

In this tutorial, we will define the problem as having integer values between 0 and 99 with 100 unique values.

We can put this in a function called generate_sequence() that will generate a sequence of random integers of the desired length, with the default length set to 25 elements.

This function is listed below.

One Hot Encode Random Sequence

Once we have generated sequences of random integers, we need to transform them into a format that is suitable for training an LSTM network.

One option would be to rescale the integer to the range [0,1]. This would work and would require that the problem be phrased as regression.

I am interested in predicting the right number, not a number close to the expected value. This means I would prefer to frame the problem as classification rather than regression, where the expected output is a class and there are 100 possible class values.

In this case, we can use a one hot encoding of the integer values where each value is represented by a 100 elements binary vector that is all “0” values except the index of the integer, which is marked 1.

The function below called one_hot_encode() defines how to iterate over a sequence of integers and create a binary vector representation for each and returns the result as a 2-dimensional array.

We also need to decode the encoded values so that we can make use of the predictions, in this case, just review them.

The one hot encoding can be inverted by using the argmax() NumPy function that returns the index of the value in the vector with the largest value.

The function below, named one_hot_decode(), will decode an encoded sequence and can be used to later decode predictions from our network.

Complete Example

We can tie all of this together.

Below is the complete code listing for generating a sequence of 25 random integers and encoding each integer as a binary vector.

Running the example first prints the list of 25 random integers, followed by a truncated view of the binary representations of all integers in the sequence, one vector per line, then the decoded sequence again.

You may get different results as different random integers are generated each time the code is run.

Now that we know how to prepare and represent random sequences of integers, we can look at using LSTMs to learn them.

Echo Current Observation

Let’s start out by looking at a simpler echo problem.

In this section, we will develop an LSTM to echo the current observation. That is given a random integer as input, return the same integer as output.

Or slightly more formally stated as:

That is, the model is to predict the value at the current time (yhat(t)) as a function (f()) of the observed value at the current time (X(t)).

It is a simple problem because no memory is required, just a function to map an input to an identical output.

It is a trivial problem and will demonstrate a few useful things:

  • How to use the problem representation machinery above.
  • How to use LSTMs in Keras.
  • The capacity of an LSTM required to learn such a trivial problem.

This will lay the foundation for the echo of lag observations next.

First, we will develop a function to prepare a random sequence ready to train or evaluate an LSTM. This function must first generate a random sequence of integers, use a one hot encoding, then transform the input data to be a 3-dimensional array.

LSTMs require a 3D input comprised of the dimensions [samples, timesteps, features]. Our problem will be comprised of 25 examples per sequence, 1 time step, and 100 features for the one hot encoding.

This function is listed below, named generate_data().

Next, we can define our LSTM model.

The model must specify the expected dimensionality of the input data. In this case, in terms of timesteps (1) and features (100). We will use a single hidden layer LSTM with 15 memory units.

The output layer is a fully connected layer (Dense) with 100 neurons for the 100 possible integers that may be output. A softmax activation function is used on the output layer to allow the network to learn and output the distribution over the possible output values.

The network will use the log loss function while training, suitable for multi-class classification problems, and the efficient ADAM optimization algorithm. The accuracy metric will be reported each training epoch to give an idea of the skill of the model in addition to the loss.

We will fit the model manually by running each epoch by hand with a new generated sequence. The model will be fit for 500 epochs, or stated another way, trained on 500 randomly generated sequences.

This will encourage the network to learn to reproduce the actual input rather than memorizing a fixed training dataset.

Once the model is fit, we will make a prediction on a new sequence and compare the predicted output to the expected output.

The complete example is listed below.

Running the example prints the log loss and accuracy each epoch.

The network is a little over-prescribed, having more memory units and training epochs than is required for such a simple problem, and you can see this by the fact that the network quickly achieves 100% accuracy.

At the end of the run, the predicted sequence is compared to a randomly generated sequence and the two look identical.

Your specific results may vary, but your network should converge to 100% accuracy because the network was larger and trained longer than the problem requires.

Now that we know how to use the tools to create and represent random sequences and to fit an LSTM to learn to echo the current sequence, let’s look at how we can use LSTMs to learn how to echo a past observation.

Echo Lag Observation Without Context
(The Beginners Mistake)

The problem of predicting a lag observation can be more formally defined as follows:

Where the expected output for the current time step yhat(t) is defined as a function (f()) of a specific previous observation (X(t-n)).

The promise of LSTMs suggests that you can show examples to the network one at a time and that the network will use internal state to learn and to sufficiently remember prior observations in order to solve this problem.

Let’s try this out.

First, we must update the generate_data() function and re-define the problem.

Rather than using the same sequence for input and output, we will use a shifted version of the encoded sequence as input and a truncated version of the encoded sequence as output.

These changes are required in order to take a sequence of numbers, such as [1, 2, 3, 4], and turn them into a supervised learning problem with input (X) and output (y) components, such as:

In this example, you can see that the first and last rows do not contain sufficient data for the network to learn. This could be marked as a “no data” value and masked, but a simpler solution is to simply remove it from the dataset.

The updated generate_data() function is listed below:

We must test out this updated representation of the data to confirm it does what we expect. To do this, we can generate a sequence and review the decoded X and y values over the sequence.

The complete code listing for this sanity check is provided below.

Running the example prints the X and y components of the problem framing.

We can see that the first pattern will be hard (impossible) for the network to learn given the cold start. We can see that the expected pattern of yhat(t) == X(t-1) down through the data.

The network design is similar, but with one small change.

Observations are shown to the network one at a time and a weight update is performed. Because we expect the state between observations to carry the information required to learn the prior observation, we need to ensure that this state is not reset after each batch (in this case, one batch is one training observation). We can do this by making the LSTM layer stateful and manually managing when the state is reset.

This involves setting the stateful argument to True on the LSTM layer and defining the input shape using the batch_input_shape argument that includes the dimensions [batchsize, timesteps, features].

There are 24 X,y pairs for a given random sequence, therefore a batch size of 6 was used (4 batches of 6 samples = 24 samples). Remember, a sequence is broken down into samples, and samples can be shown to the network in batches before an update to the network weights is performed. A large network size of 50 memory units is used, again to over-prescribe the capacity needed for the problem.

Next, after each epoch (one iteration of a randomly generated sequence), the internal state of the network can be manually reset. The model is fit for 2,000 training epochs and care is made to not shuffle the samples within a sequence.

Putting this all together, the complete example is listed below.

Running the example gives a surprising result.

The problem cannot be learned and training ends with a model with 0% accuracy of echoing the last observation in the sequence.

How can this be?

The Beginner’s Mistake

This is a common mistake made by beginners, and if you have been around the block with RNNs or LSTMs, then you would have spotted this error above.

Specifically, the power of LSTMs does come from the learned internal state maintained, but this state is only powerful if it is trained as a function over past observations.

Stated another way, you must provide the network the context for the prediction (e.g. the observations that may contain the temporal dependence) as time steps on the input.

The above formulation trained the network to learn the output as a function only of the current input value, as in the first example:

Not as a function of the last n observations, or even just the previous observation, as we require:

The LSTM does only need one input at a time in order to learn this unknown temporal dependence, but it must perform backpropagation over the sequence in order to learn this dependence. You must provide the past observations of the sequence as context.

You are not defining a window (as in the case of Multilayer Perceptron where each past observation is a weighted input); instead, you are defining an extent of historical observations from which the LSTM will attempt to learn the temporal dependence (f(X(t-1), … X(t-n))).

To be clear, this is the beginner’s mistake when using LSTMs in Keras, and not necessarily in general.

Echo Lag Observation

Now that we have navigated around a common pitfall for beginners, we can develop an LSTM to echo the previous observation.

The first step is to reformulate the definition of the problem, again.

We know that the network only requires the last observation as input in order to make correct predictions. But we want the network to learn which of the past observations to echo in order to correctly solve this problem. Therefore, we will provide a subsequence of the last 5 observation as context.

Specifically, if our sequence contains: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], the X,y pairs would look as follows:

In this case, you can see that the first 5 rows and the last 1 row do not contain enough data, so in this case, we will remove them.

We will use the Pandas shift() function to create shifted versions of the sequence and the Pandas concat() function to recombine the shifted sequences back together. We will then manually exclude the rows that are not viable.

The updated generate_data() function is listed below.

Again, we can sanity check this updated function by generating a sequence and comparing the decoded X,y pairs. The complete example is listed below.

Running the example shows the context of the last 5 values as input and the last prior observation (X(t-1)) as output.

We can now develop an LSTM to learn this problem.

There are 20 X,y pairs for a given sequence; therefore, a batch size of 5 was chosen (4 batches of 5 examples = 20 samples).

The same structure was used with 50 memory units in the LSTM hidden layer and 100 neurons in the output layer. The network was fit for 2,000 epochs with internal state reset after each epoch.

The complete code listing is provided below.

Running the example shows that the network can fit the problem and correctly learn to return the X(t-1) observation as prediction within the context of 5 prior observations.

Example output is provided below; your specific outputs may differ given different random sequences.


This section lists some extensions to the experiments in this tutorial.

  • Ignore Internal State. Care was taken to preserve internal state of the LSTMs across samples within a sequence by manually resetting state at the end of the epoch. We know that the network already has all the context and state required within each sample via timesteps. Explore whether the additional cross-sample state adds any benefit to the skill of the model.
  • Mask Missing Data. During data preparation, rows with missing data were removed. Explore the use of marking missing values with a special value (e.g. -1) and seeing whether the LSTM can learn from these examples. Also explore the use of a Masking layer as input and explore masking out missing values.
  • Entire Sequence as Timesteps. A context of only the last 5 observations were provided as context from which to learn to echo. Explore using the entire random sequence as context for each sample, built-up as the sequence unfolds. This may require padding and even masking of missing values to meet the expectation of fixed-sized inputs to the LSTM.
  • Echo Different Lag Value. A specific lag value (t-1) was used in the echo example. Explore using a different lag value in the echo and how this affects properties such as model skill, training time, and LSTM layer size. I would expect that each lag could be learned using the same model structure.
  • Echo Lag Sequence. The network was trained to echo a specific lag observation. Explore variations where a lag sequence is echoed. This may require the use of the TimeDistributed layer on the output of the network to achieve sequence to sequence prediction.

Did you explore any of these extensions?
Share your findings in the comments below.


In this tutorial, you discovered how to develop an LSTM to address the problem of echoing a lag observation from a random sequence of integers.

Specifically, you learned:

  • How to generate and encode test data for the problem.
  • How to avoid the beginner’s mistake when attempting to address this and similar problems with LSTMs.
  • How to develop a robust LSTM to echo integers in an ad hoc sequence of random integers.

Do you have any questions?
Ask your questions in the comments and I will do my best to answer.

2 Responses to How to Learn to Echo Random Integers with Long Short-Term Memory Recurrent Neural Networks

  1. Shirish Ranade June 9, 2017 at 12:22 pm #


    That is neat. 100% accuracy !!

    Will this work with binomial classification problem as well?

    • Jason Brownlee June 10, 2017 at 8:12 am #

      This is a special case of a well defined small problem.

Leave a Reply