SALE! Use code midyear2022 for 25% off everything!
Hurry, sale ends Sunday! Click to see the full catalog.

# How to Learn to Echo Random Integers with LSTMs in Keras

Last Updated on August 27, 2020

Long Short-Term Memory (LSTM) Recurrent Neural Networks are able to learn the order dependence in long sequence data.

They are a fundamental technique used in a range of state-of-the-art results, such as image captioning and machine translation.

They can also be difficult to understand, specifically how to frame a problem to get the most out of this type of network.

In this tutorial, you will discover how to develop a simple LSTM recurrent neural network to learn how to echo back the number in an ad hoc sequence of random integers. Although a trivial problem, developing this network will provide the skills needed to apply LSTM on a range of sequence prediction problems.

After completing this tutorial, you will know:

• How to develop a LSTM for the simpler problem of echoing any given input.
• How to avoid the beginner’s mistake when applying LSTMs to sequence problems like echoing integers.
• How to develop a robust LSTM to echo the last observation from ad hoc sequences of random integers.

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

• Update Jan/2020: Updated API for Keras 2.3 and TensorFlow 2.0. How to Learn to Echo Random Integers with Long Short-Term Memory Recurrent Neural Networks
Photo by Franck Michel, some rights reserved.

## Overview

This tutorial is divided into 4 parts; they are:

1. Generate and Encode Random Sequences
2. Echo Current Observation
3. Echo Lag Observation Without Context (Beginners Mistake)
4. Echo Lag Observation

### Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

This tutorial assumes you have Keras v2.0 or higher installed with either the TensorFlow or Theano backend. You do not need a GPU for this tutorial, all code will easily run in a CPU.

This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help setting up your Python environment, see this post:

### Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Generate and Encode Random Sequences

The first step is to write some code to generate a random sequence of integers and encode them for the network.

### Generate Random Sequence

We can generate random integers in Python using the randint() function that takes two parameters indicating the range of integers from which to draw values.

In this tutorial, we will define the problem as having integer values between 0 and 99 with 100 unique values.

We can put this in a function called generate_sequence() that will generate a sequence of random integers of the desired length, with the default length set to 25 elements.

This function is listed below.

### One Hot Encode Random Sequence

Once we have generated sequences of random integers, we need to transform them into a format that is suitable for training an LSTM network.

One option would be to rescale the integer to the range [0,1]. This would work and would require that the problem be phrased as regression.

I am interested in predicting the right number, not a number close to the expected value. This means I would prefer to frame the problem as classification rather than regression, where the expected output is a class and there are 100 possible class values.

In this case, we can use a one hot encoding of the integer values where each value is represented by a 100 elements binary vector that is all “0” values except the index of the integer, which is marked 1.

The function below called one_hot_encode() defines how to iterate over a sequence of integers and create a binary vector representation for each and returns the result as a 2-dimensional array.

We also need to decode the encoded values so that we can make use of the predictions, in this case, just review them.

The one hot encoding can be inverted by using the argmax() NumPy function that returns the index of the value in the vector with the largest value.

The function below, named one_hot_decode(), will decode an encoded sequence and can be used to later decode predictions from our network.

### Complete Example

We can tie all of this together.

Below is the complete code listing for generating a sequence of 25 random integers and encoding each integer as a binary vector.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example first prints the list of 25 random integers, followed by a truncated view of the binary representations of all integers in the sequence, one vector per line, then the decoded sequence again.

Now that we know how to prepare and represent random sequences of integers, we can look at using LSTMs to learn them.

## Echo Current Observation

Let’s start out by looking at a simpler echo problem.

In this section, we will develop an LSTM to echo the current observation. That is given a random integer as input, return the same integer as output.

Or slightly more formally stated as:

That is, the model is to predict the value at the current time (yhat(t)) as a function (f()) of the observed value at the current time (X(t)).

It is a simple problem because no memory is required, just a function to map an input to an identical output.

It is a trivial problem and will demonstrate a few useful things:

• How to use the problem representation machinery above.
• How to use LSTMs in Keras.
• The capacity of an LSTM required to learn such a trivial problem.

This will lay the foundation for the echo of lag observations next.

First, we will develop a function to prepare a random sequence ready to train or evaluate an LSTM. This function must first generate a random sequence of integers, use a one hot encoding, then transform the input data to be a 3-dimensional array.

LSTMs require a 3D input comprised of the dimensions [samples, timesteps, features]. Our problem will be comprised of 25 examples per sequence, 1 time step, and 100 features for the one hot encoding.

This function is listed below, named generate_data().

Next, we can define our LSTM model.

The model must specify the expected dimensionality of the input data. In this case, in terms of timesteps (1) and features (100). We will use a single hidden layer LSTM with 15 memory units.

The output layer is a fully connected layer (Dense) with 100 neurons for the 100 possible integers that may be output. A softmax activation function is used on the output layer to allow the network to learn and output the distribution over the possible output values.

The network will use the log loss function while training, suitable for multi-class classification problems, and the efficient ADAM optimization algorithm. The accuracy metric will be reported each training epoch to give an idea of the skill of the model in addition to the loss.

We will fit the model manually by running each epoch by hand with a new generated sequence. The model will be fit for 500 epochs, or stated another way, trained on 500 randomly generated sequences.

This will encourage the network to learn to reproduce the actual input rather than memorizing a fixed training dataset.

Once the model is fit, we will make a prediction on a new sequence and compare the predicted output to the expected output.

The complete example is listed below.

Running the example prints the log loss and accuracy each epoch.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The network is a little over-prescribed, having more memory units and training epochs than is required for such a simple problem, and you can see this by the fact that the network quickly achieves 100% accuracy.

At the end of the run, the predicted sequence is compared to a randomly generated sequence and the two look identical.

Now that we know how to use the tools to create and represent random sequences and to fit an LSTM to learn to echo the current sequence, let’s look at how we can use LSTMs to learn how to echo a past observation.

## Echo Lag Observation Without Context (The Beginners Mistake)

The problem of predicting a lag observation can be more formally defined as follows:

Where the expected output for the current time step yhat(t) is defined as a function (f()) of a specific previous observation (X(t-n)).

The promise of LSTMs suggests that you can show examples to the network one at a time and that the network will use internal state to learn and to sufficiently remember prior observations in order to solve this problem.

Let’s try this out.

First, we must update the generate_data() function and re-define the problem.

Rather than using the same sequence for input and output, we will use a shifted version of the encoded sequence as input and a truncated version of the encoded sequence as output.

These changes are required in order to take a sequence of numbers, such as [1, 2, 3, 4], and turn them into a supervised learning problem with input (X) and output (y) components, such as:

In this example, you can see that the first and last rows do not contain sufficient data for the network to learn. This could be marked as a “no data” value and masked, but a simpler solution is to simply remove it from the dataset.

The updated generate_data() function is listed below:

We must test out this updated representation of the data to confirm it does what we expect. To do this, we can generate a sequence and review the decoded X and y values over the sequence.

The complete code listing for this sanity check is provided below.

Running the example prints the X and y components of the problem framing.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the first pattern will be hard (impossible) for the network to learn given the cold start. We can see that the expected pattern of yhat(t) == X(t-1) down through the data.

The network design is similar, but with one small change.

Observations are shown to the network one at a time and a weight update is performed. Because we expect the state between observations to carry the information required to learn the prior observation, we need to ensure that this state is not reset after each batch (in this case, one batch is one training observation). We can do this by making the LSTM layer stateful and manually managing when the state is reset.

This involves setting the stateful argument to True on the LSTM layer and defining the input shape using the batch_input_shape argument that includes the dimensions [batchsize, timesteps, features].

There are 24 X,y pairs for a given random sequence, therefore a batch size of 6 was used (4 batches of 6 samples = 24 samples). Remember, a sequence is broken down into samples, and samples can be shown to the network in batches before an update to the network weights is performed. A large network size of 50 memory units is used, again to over-prescribe the capacity needed for the problem.

Next, after each epoch (one iteration of a randomly generated sequence), the internal state of the network can be manually reset. The model is fit for 2,000 training epochs and care is made to not shuffle the samples within a sequence.

Putting this all together, the complete example is listed below.

Running the example gives a surprising result.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The problem cannot be learned and training ends with a model with 0% accuracy of echoing the last observation in the sequence.

How can this be?

### The Beginner’s Mistake

This is a common mistake made by beginners, and if you have been around the block with RNNs or LSTMs, then you would have spotted this error above.

Specifically, the power of LSTMs does come from the learned internal state maintained, but this state is only powerful if it is trained as a function over past observations.

Stated another way, you must provide the network the context for the prediction (e.g. the observations that may contain the temporal dependence) as time steps on the input.

The above formulation trained the network to learn the output as a function only of the current input value, as in the first example:

Not as a function of the last n observations, or even just the previous observation, as we require:

The LSTM does only need one input at a time in order to learn this unknown temporal dependence, but it must perform backpropagation over the sequence in order to learn this dependence. You must provide the past observations of the sequence as context.

You are not defining a window (as in the case of Multilayer Perceptron where each past observation is a weighted input); instead, you are defining an extent of historical observations from which the LSTM will attempt to learn the temporal dependence (f(X(t-1), … X(t-n))).

To be clear, this is the beginner’s mistake when using LSTMs in Keras, and not necessarily in general.

## Echo Lag Observation

Now that we have navigated around a common pitfall for beginners, we can develop an LSTM to echo the previous observation.

The first step is to reformulate the definition of the problem, again.

We know that the network only requires the last observation as input in order to make correct predictions. But we want the network to learn which of the past observations to echo in order to correctly solve this problem. Therefore, we will provide a subsequence of the last 5 observation as context.

Specifically, if our sequence contains: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], the X,y pairs would look as follows:

In this case, you can see that the first 5 rows and the last 1 row do not contain enough data, so in this case, we will remove them.

We will use the Pandas shift() function to create shifted versions of the sequence and the Pandas concat() function to recombine the shifted sequences back together. We will then manually exclude the rows that are not viable.

The updated generate_data() function is listed below.

Again, we can sanity check this updated function by generating a sequence and comparing the decoded X,y pairs. The complete example is listed below.

Running the example shows the context of the last 5 values as input and the last prior observation (X(t-1)) as output.

We can now develop an LSTM to learn this problem.

There are 20 X,y pairs for a given sequence; therefore, a batch size of 5 was chosen (4 batches of 5 examples = 20 samples).

The same structure was used with 50 memory units in the LSTM hidden layer and 100 neurons in the output layer. The network was fit for 2,000 epochs with internal state reset after each epoch.

The complete code listing is provided below.

Running the example shows that the network can fit the problem and correctly learn to return the X(t-1) observation as prediction within the context of 5 prior observations.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Example output is provided below.

## Extensions

This section lists some extensions to the experiments in this tutorial.

• Ignore Internal State. Care was taken to preserve internal state of the LSTMs across samples within a sequence by manually resetting state at the end of the epoch. We know that the network already has all the context and state required within each sample via timesteps. Explore whether the additional cross-sample state adds any benefit to the skill of the model.
• Mask Missing Data. During data preparation, rows with missing data were removed. Explore the use of marking missing values with a special value (e.g. -1) and seeing whether the LSTM can learn from these examples. Also explore the use of a Masking layer as input and explore masking out missing values.
• Entire Sequence as Timesteps. A context of only the last 5 observations were provided as context from which to learn to echo. Explore using the entire random sequence as context for each sample, built-up as the sequence unfolds. This may require padding and even masking of missing values to meet the expectation of fixed-sized inputs to the LSTM.
• Echo Different Lag Value. A specific lag value (t-1) was used in the echo example. Explore using a different lag value in the echo and how this affects properties such as model skill, training time, and LSTM layer size. I would expect that each lag could be learned using the same model structure.
• Echo Lag Sequence. The network was trained to echo a specific lag observation. Explore variations where a lag sequence is echoed. This may require the use of the TimeDistributed layer on the output of the network to achieve sequence to sequence prediction.

Did you explore any of these extensions?

## Summary

In this tutorial, you discovered how to develop an LSTM to address the problem of echoing a lag observation from a random sequence of integers.

Specifically, you learned:

• How to generate and encode test data for the problem.
• How to avoid the beginner’s mistake when attempting to address this and similar problems with LSTMs.
• How to develop a robust LSTM to echo integers in an ad hoc sequence of random integers.

Do you have any questions?

## Develop LSTMs for Sequence Prediction Today! #### Develop Your Own LSTM models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more...

### 37 Responses to How to Learn to Echo Random Integers with LSTMs in Keras

1. Shirish Ranade June 9, 2017 at 12:22 pm #

Wow,

That is neat. 100% accuracy !!

Will this work with binomial classification problem as well?

• Jason Brownlee June 10, 2017 at 8:12 am #

This is a special case of a well defined small problem.

2. MOHD SAIFUL BAHRI IBRAHIM October 4, 2017 at 12:18 pm #

Very good …thank u

• Jason Brownlee October 4, 2017 at 3:37 pm #

Thanks.

3. Scott November 7, 2017 at 8:59 am #

Jason, would you clarify the remark, “You are not defining a window (as in the case of Multilayer Perceptron where each past observation is a weighted input); instead, you are defining an extent of historical observations”? It seems that defining a window of weighted inputs is exactly what we are doing: We are feeding in a series of windows, and the LSTM is learning to pick out one element from that window.

• Jason Brownlee November 7, 2017 at 9:56 am #

Not quite, the LSTM processes time steps one at a time, not a weighting of all lag obs as in an MLP with a window.

4. Aditya February 22, 2018 at 8:39 pm #

Hi Jason

The post is really insightful, especially the part which covers the Beginner’s mistake.
I have one lingering question though. In the last section, where we are trying to echo the lag observation, how does an LSTM model provide an advantage over the Multilayer Perceptron?
We could have considered the time steps to be input features for an MLP and then trained it to learn the correct weights for those inputs and get the correct predicted output.

Basically, I want to understand the case where using an LSTM based RNN will be advantageous since an MLP model would be unsuccessful. The echo lag example doesn’t really seem to show the advantage of an LSTM model to me.

• Jason Brownlee February 23, 2018 at 11:55 am #

The MLP must be exposed to all time steps as features at once, where as the LSTM sees only one time step at a time and accumulates state over time steps in order to create an output.

That is the key difference.

• Aditya February 24, 2018 at 7:05 pm #

Thanks Jason, I understand the difference in the implementation. However, since the echo lag problem could also be solved by using an MLP model with all time steps as features, I wanted to understand what advantage does an LSTM offer over it?
Specifically, are there any sequences which can not be learnt with an MLP model (by using time steps as input features)?

• Jason Brownlee February 25, 2018 at 7:44 am #

The fact that each input time step is provided one at a time and the LSTM must “remember” what to echo using internal state demonstrate the difference. The MLP MUST have all time steps provided as input at once.

Perhaps this post on BPTT will clear things up further for you:
https://machinelearningmastery.com/gentle-introduction-backpropagation-time/

5. anurag October 30, 2018 at 12:48 am #

awesome article.. thnx

• Jason Brownlee October 30, 2018 at 6:03 am #

I’m happy it helped.

6. Scholes December 8, 2018 at 3:14 am #

Very helpeful article, i thank you sir for the efforts.

I’m bit new to LSTM but i kinda understood how it works, But i have a problem in my interpretation :

– I’d like to predict the next random output from “yhat”, so i did a little loop which will predict and append the list of test each time (So it will predict more future outputs) in other meaning each “yhat” list becomes “X”list in every iteration which i thought it means it will predict the next output.

But my problem is it stays blocked in the same X list that i gaved in the begening :

The list of predicted numbers gets smaller despite i did append each time, so i cant have future predictions.

Would you like i show the piece of code of that and output? it will be so helpfull for me what you did

Thank you again

• Jason Brownlee December 8, 2018 at 7:12 am #

Nice work!

• Scholes December 11, 2018 at 1:40 am #

Good morning, I guess you didnt understand me lol sorry i dont explain well

In every iteration each X_list is the previous yhat predicted list. means :

X(n iteration) = yhat(n-1) , here’s the execution so you can have an idea about my problem -> :

(‘Interation = ‘, 6)
X file: [20, 38, 19, 48, 28, 43, 50, 14, 25, 28, 20, 19, 44, 16, 9]
Predicted: [48, 28, 43, 50, 14, 25, 28, 20, 19, 44, 16, 9, 25, 13, 15]
(‘Interation = ‘, 7)
X file: [28, 43, 50, 14, 25, 28, 20, 19, 44, 16]
Predicted: [14, 25, 28, 20, 19, 44, 16, 9, 25, 13]
(‘Interation = ‘, 8)
X file: [25, 28, 20, 19, 44]
Predicted: [19, 44, 16, 9, 25]
(‘Interation = ‘, 9)
X file: []
Predicted: []

My idea was to keep predict future outputs each time we give it a predicted number, but unfortunately it shows a small vicious cycle list . if you can explain me please for my educational project. i let show you the piece of code i did about that :

for i in range(1,10): #ITERATE 10 TIMES

yhat_list = one_hot_decode(yhat) #Read last predicted list as an input for next prediction
encoded = one_hot_encode(yhat_list)
# create lag inputs
df = DataFrame(encoded)
df = concat([df.shift(4), df.shift(3), df.shift(2), df.shift(1), df], axis=1)
# remove non-viable rows
values = df.values
values = values[5:,:]
# convert to 3d for input
X = values.reshape(len(values), 5, 100)
yhat = model.predict(X, batch_size=5)
print(“Interation = “, i)
print(‘X file: %s’ % one_hot_decode(X))
print(‘Predicted: %s’ % one_hot_decode(yhat))

THANK YOU SO MUCH FOR YOUR EXPLICATIONS

• Jason Brownlee December 11, 2018 at 7:48 am #

I don’t follow. What is the problem you’re having exactly?

• Scholes December 13, 2018 at 12:49 am #

I want to continue predicting the next 100 outputs of “yhat” , so i pass it each time as X list in each iteration. i thought it

will predict new outputs in each time.

But unfortunately the loop never progress and keep getting smaller and predicting already known numbers. (as i showed you in the previous example )

for i in range(1,100): #ITERATE 10 TIMES

yhat_list = one_hot_decode(yhat) #Read last predicted list as an input for next prediction
encoded = one_hot_encode(yhat_list)
# create lag inputs
df = DataFrame(encoded)
df = concat([df.shift(4), df.shift(3), df.shift(2), df.shift(1), df], axis=1)
# remove non-viable rows
values = df.values
values = values[5:,:]
# convert to 3d for input
X = values.reshape(len(values), 5, 100)
yhat = model.predict(X, batch_size=5)
print(“Interation = “, i)
print(‘X file: %s’ % one_hot_decode(X))
print(‘Predicted: %s’ % one_hot_decode(yhat))

• Jason Brownlee December 13, 2018 at 7:54 am #

I have a number of posts on multi-step predicting with LSTMs, perhaps start here:
https://machinelearningmastery.com/start-here/#deep_learning_time_series

7. Scholes December 13, 2018 at 11:33 pm #

ok thank you jason i’ll check that out , best regards

8. brandy January 31, 2019 at 8:19 am #

Good afternoon Jason, I want to replace the random generator with my own dataset and iterate through it and get the most common numbers, it’s in a csv file how would I go about doing the in the code? Thanks in advanced

• Jason Brownlee January 31, 2019 at 2:15 pm #

I cannot write a custom example for you. What problem are you having exactly?

9. jmaidagan April 2, 2019 at 8:46 pm #

I think the solution to the interesting problem that you have raised has nothing to do with the long LSTM memory. The result you are looking for is being supplied in axis 2 of the input. You can verify it through ‘stateful = False’ on line 47 of your code: the convergence to acc = 100% is even faster!
On the other hand, what you call “a common pitfall for beginners” is exactly the way that is advised in https://keras.io/examples/lstm_stateful/. The calculation (tsteps = 2 lahead = 1) shows that, although the problem is not solved, there is an appreciable difference between stateful = True / False.
From my point of view, the crucial question is:
It is possible to train an LSTM so that (in production regime) it is capable of producing the echo receiving only a random sequence, a number at each time?
This problem would illuminate the LSTM advantage over other networks, since it is impossible to solve it without memory.
Regards

• Jason Brownlee April 3, 2019 at 6:42 am #

I believe it could, yes.

10. Christophe May 8, 2019 at 3:19 pm #

Hi Jason – May I ask why you used a custom-coded one-hot encoding rather than the keras function to_categorical?

Also what would be the implementation if the n_unique value is extremely large?

Thanks.

• Jason Brownlee May 9, 2019 at 6:35 am #

I don’t recall why, sorry.

How large? It is common to one hot encode on NLP problems up to tens of thousands of tokens, or more.

Also, embeddings work amazingly well for large cardinality categorical variables.

11. aswin June 20, 2020 at 5:17 am #

how to predict next number using this program?

• Jason Brownlee June 20, 2020 at 6:19 am #

Call: model.predict()

• Shubham Chauhan July 2, 2020 at 10:50 pm #

Thanks for the wonderful code Sir , but I am not able to predict next number that’s my code from your above article :

Xnew = array([[1,36, 0, 36, 21,2]])
ynew = model.predict(Xnew)

print(“X=%s, Predicted=%s” % (Xnew, ynew))

then it shows ——-

ValueError: Error when checking input: expected lstm_26_input to have 3 dimensions, but got array with shape (1, 6)

But after reshaping it into 3d it shows error

Xnew = array([[1,36, 0, 36, 21,2]])

Xnew=Xnew.reshape(Xnew.shape,Xnew.shape , 1)

ynew = model.predict(Xnew)

print(“X=%s, Predicted=%s” % (Xnew, ynew)) then it shows

ValueError: Error when checking input: expected lstm_26_input to have shape (5, 340) but got array with shape (6, 1)

Can you write the code in the reply plzzz because most of the people are facing this problem ??????

12. Shubham Chauhan July 3, 2020 at 12:40 pm #

sir your code is working perfectly but I am not able to predict next number if I want to give my own choice of X . By using model.predict () Suppose if I want to give 34 then what will be the next number ? Sir plzzz its a two line of code plzz write in the comment section

• Jason Brownlee July 3, 2020 at 2:24 pm #

13. João Victor November 16, 2020 at 12:33 am #

But, how can I predict the future random numbers? That’s not clear to me. Because we already put the random numbers as input, so how to predict the next one?

14. Konradino December 14, 2020 at 2:37 am #

If I want to limit the number of generated numbers from 25 to 5, or even to 6 – what other factors should be modified? Change just in amount of elements in generated numbers generates an error. BTW: going to buy your e-book, thanks!

• Jason Brownlee December 14, 2020 at 6:20 am #

Yes, change the definition of the problem, and the encoding.

You might need to tune the model and learning hyperparameters for the change in difficulty of the problem.

15. Darius.Nguyen March 8, 2021 at 6:19 pm #

First of all, your post gives me more knowledge. But I have a question? When I test your model with [a,b,c,d,e], I try to change e with many numbers and keep a,b,c,d but your model can predict the right number d. In my mind, I think when we change some number, d of predict should be changed? Is it right?

• Jason Brownlee March 9, 2021 at 5:16 am #

Perhaps the model is overfit.