How to Learn to Echo Random Integers with LSTMs in Keras

By Jason Brownlee on August 27, 2020 in Long Short-Term Memory Networks 42

Long Short-Term Memory (LSTM) Recurrent Neural Networks are able to learn the order dependence in long sequence data.

They are a fundamental technique used in a range of state-of-the-art results, such as image captioning and machine translation.

They can also be difficult to understand, specifically how to frame a problem to get the most out of this type of network.

In this tutorial, you will discover how to develop a simple LSTM recurrent neural network to learn how to echo back the number in an ad hoc sequence of random integers. Although a trivial problem, developing this network will provide the skills needed to apply LSTM on a range of sequence prediction problems.

After completing this tutorial, you will know:

How to develop a LSTM for the simpler problem of echoing any given input.
How to avoid the beginner’s mistake when applying LSTMs to sequence problems like echoing integers.
How to develop a robust LSTM to echo the last observation from ad hoc sequences of random integers.

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Jan/2020: Updated API for Keras 2.3 and TensorFlow 2.0.

How to Learn to Echo Random Integers with Long Short-Term Memory Recurrent Neural Networks
Photo by Franck Michel, some rights reserved.

Overview

This tutorial is divided into 4 parts; they are:

Generate and Encode Random Sequences
Echo Current Observation
Echo Lag Observation Without Context (Beginners Mistake)
Echo Lag Observation

Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

This tutorial assumes you have Keras v2.0 or higher installed with either the TensorFlow or Theano backend. You do not need a GPU for this tutorial, all code will easily run in a CPU.

This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help setting up your Python environment, see this post:

How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Generate and Encode Random Sequences

The first step is to write some code to generate a random sequence of integers and encode them for the network.

Generate Random Sequence

We can generate random integers in Python using the randint() function that takes two parameters indicating the range of integers from which to draw values.

In this tutorial, we will define the problem as having integer values between 0 and 99 with 100 unique values.

randint(0, 99)

1	randint(0, 99)

We can put this in a function called generate_sequence() that will generate a sequence of random integers of the desired length, with the default length set to 25 elements.

This function is listed below.

# generate a sequence of random numbers in [0, 99]
def generate_sequence(length=25):
	return [randint(0, 99) for _ in range(length)]

# generate a sequence of random numbers in [0, 99]

def generate_sequence(length=25):

return [randint(0, 99) for _ in range(length)]

One Hot Encode Random Sequence

Once we have generated sequences of random integers, we need to transform them into a format that is suitable for training an LSTM network.

One option would be to rescale the integer to the range [0,1]. This would work and would require that the problem be phrased as regression.

I am interested in predicting the right number, not a number close to the expected value. This means I would prefer to frame the problem as classification rather than regression, where the expected output is a class and there are 100 possible class values.

In this case, we can use a one hot encoding of the integer values where each value is represented by a 100 elements binary vector that is all “0” values except the index of the integer, which is marked 1.

The function below called one_hot_encode() defines how to iterate over a sequence of integers and create a binary vector representation for each and returns the result as a 2-dimensional array.

# one hot encode sequence
def one_hot_encode(sequence, n_unique=100):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# one hot encode sequence

def one_hot_encode(sequence, n_unique=100):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

We also need to decode the encoded values so that we can make use of the predictions, in this case, just review them.

The one hot encoding can be inverted by using the argmax() NumPy function that returns the index of the value in the vector with the largest value.

The function below, named one_hot_decode(), will decode an encoded sequence and can be used to later decode predictions from our network.

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

Complete Example

We can tie all of this together.

Below is the complete code listing for generating a sequence of 25 random integers and encoding each integer as a binary vector.

from random import randint
from numpy import array
from numpy import argmax

# generate a sequence of random numbers in [0, 99]
def generate_sequence(length=25):
	return [randint(0, 99) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique=100):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# generate random sequence
sequence = generate_sequence()
print(sequence)
# one hot encode
encoded = one_hot_encode(sequence)
print(encoded)
# one hot decode
decoded = one_hot_decode(encoded)
print(decoded)

from random import randint

from numpy import array

from numpy import argmax

# generate a sequence of random numbers in [0, 99]

def generate_sequence(length=25):

return [randint(0, 99) for _ in range(length)]

# one hot encode sequence

def one_hot_encode(sequence, n_unique=100):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# generate random sequence

sequence = generate_sequence()

print(sequence)

# one hot encode

encoded = one_hot_encode(sequence)

print(encoded)

# one hot decode

decoded = one_hot_decode(encoded)

print(decoded)

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example first prints the list of 25 random integers, followed by a truncated view of the binary representations of all integers in the sequence, one vector per line, then the decoded sequence again.

[37, 99, 40, 98, 44, 27, 99, 18, 52, 97, 46, 39, 60, 13, 66, 29, 26, 4, 65, 85, 29, 88, 8, 23, 61]
[[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 1]
[0 0 0 ..., 0 0 0]
...,
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]
[0 0 0 ..., 0 0 0]]
[37, 99, 40, 98, 44, 27, 99, 18, 52, 97, 46, 39, 60, 13, 66, 29, 26, 4, 65, 85, 29, 88, 8, 23, 61]

[37, 99, 40, 98, 44, 27, 99, 18, 52, 97, 46, 39, 60, 13, 66, 29, 26, 4, 65, 85, 29, 88, 8, 23, 61]

[[0 0 0 ..., 0 0 0]

[0 0 0 ..., 0 0 1]

[0 0 0 ..., 0 0 0]

...,

[0 0 0 ..., 0 0 0]

[0 0 0 ..., 0 0 0]]

[37, 99, 40, 98, 44, 27, 99, 18, 52, 97, 46, 39, 60, 13, 66, 29, 26, 4, 65, 85, 29, 88, 8, 23, 61]

Now that we know how to prepare and represent random sequences of integers, we can look at using LSTMs to learn them.

Echo Current Observation

Let’s start out by looking at a simpler echo problem.

In this section, we will develop an LSTM to echo the current observation. That is given a random integer as input, return the same integer as output.

Or slightly more formally stated as:

yhat(t) = f(X(t))

1	yhat(t) = f(X(t))

That is, the model is to predict the value at the current time (yhat(t)) as a function (f()) of the observed value at the current time (X(t)).

It is a simple problem because no memory is required, just a function to map an input to an identical output.

It is a trivial problem and will demonstrate a few useful things:

How to use the problem representation machinery above.
How to use LSTMs in Keras.
The capacity of an LSTM required to learn such a trivial problem.

This will lay the foundation for the echo of lag observations next.

First, we will develop a function to prepare a random sequence ready to train or evaluate an LSTM. This function must first generate a random sequence of integers, use a one hot encoding, then transform the input data to be a 3-dimensional array.

LSTMs require a 3D input comprised of the dimensions [samples, timesteps, features]. Our problem will be comprised of 25 examples per sequence, 1 time step, and 100 features for the one hot encoding.

This function is listed below, named generate_data().

# generate data for the lstm
def generate_data():
	# generate sequence
	sequence = generate_sequence()
	# one hot encode
	encoded = one_hot_encode(sequence)
	# convert to 3d for input
	X = encoded.reshape(encoded.shape[0], 1, encoded.shape[1])
	return X, encoded

# generate data for the lstm

def generate_data():

# generate sequence

sequence = generate_sequence()

# one hot encode

encoded = one_hot_encode(sequence)

# convert to 3d for input

X = encoded.reshape(encoded.shape[0], 1, encoded.shape[1])

return X, encoded

Next, we can define our LSTM model.

The model must specify the expected dimensionality of the input data. In this case, in terms of timesteps (1) and features (100). We will use a single hidden layer LSTM with 15 memory units.

The output layer is a fully connected layer (Dense) with 100 neurons for the 100 possible integers that may be output. A softmax activation function is used on the output layer to allow the network to learn and output the distribution over the possible output values.

The network will use the log loss function while training, suitable for multi-class classification problems, and the efficient ADAM optimization algorithm. The accuracy metric will be reported each training epoch to give an idea of the skill of the model in addition to the loss.

# define model
model = Sequential()
model.add(LSTM(15, input_shape=(1, 100)))
model.add(Dense(100, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# define model

model = Sequential()

model.add(LSTM(15, input_shape=(1, 100)))

model.add(Dense(100, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

We will fit the model manually by running each epoch by hand with a new generated sequence. The model will be fit for 500 epochs, or stated another way, trained on 500 randomly generated sequences.

This will encourage the network to learn to reproduce the actual input rather than memorizing a fixed training dataset.

# fit model
for i in range(500):
	X, y = generate_data()
	model.fit(X, y, epochs=1, batch_size=1, verbose=2)

# fit model

for i in range(500):

X, y = generate_data()

model.fit(X, y, epochs=1, batch_size=1, verbose=2)

Once the model is fit, we will make a prediction on a new sequence and compare the predicted output to the expected output.

# evaluate model on new data
X, y = generate_data()
yhat = model.predict(X)
print('Expected:  %s' % one_hot_decode(y))
print('Predicted: %s' % one_hot_decode(yhat))

# evaluate model on new data

X, y = generate_data()

yhat = model.predict(X)

print('Expected: %s' % one_hot_decode(y))

print('Predicted: %s' % one_hot_decode(yhat))

The complete example is listed below.

from random import randint
from numpy import array
from numpy import argmax
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense

# generate a sequence of random numbers in [0, 99]
def generate_sequence(length=25):
	return [randint(0, 99) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique=100):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# generate data for the lstm
def generate_data():
	# generate sequence
	sequence = generate_sequence()
	# one hot encode
	encoded = one_hot_encode(sequence)
	# convert to 3d for input
	X = encoded.reshape(encoded.shape[0], 1, encoded.shape[1])
	return X, encoded

# define model
model = Sequential()
model.add(LSTM(15, input_shape=(1, 100)))
model.add(Dense(100, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
for i in range(500):
	X, y = generate_data()
	model.fit(X, y, epochs=1, batch_size=1, verbose=2)
# evaluate model on new data
X, y = generate_data()
yhat = model.predict(X)
print('Expected:  %s' % one_hot_decode(y))
print('Predicted: %s' % one_hot_decode(yhat))

from random import randint

from numpy import array

from numpy import argmax

from keras.models import Sequential

from keras.layers import LSTM

from keras.layers import Dense

# generate a sequence of random numbers in [0, 99]

def generate_sequence(length=25):

return [randint(0, 99) for _ in range(length)]

# one hot encode sequence

def one_hot_encode(sequence, n_unique=100):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# generate data for the lstm

def generate_data():

# generate sequence

sequence = generate_sequence()

# one hot encode

encoded = one_hot_encode(sequence)

# convert to 3d for input

X = encoded.reshape(encoded.shape[0], 1, encoded.shape[1])

return X, encoded

# define model

model = Sequential()

model.add(LSTM(15, input_shape=(1, 100)))

model.add(Dense(100, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit model

for i in range(500):

X, y = generate_data()

model.fit(X, y, epochs=1, batch_size=1, verbose=2)

# evaluate model on new data

X, y = generate_data()

yhat = model.predict(X)

print('Expected: %s' % one_hot_decode(y))

print('Predicted: %s' % one_hot_decode(yhat))

Running the example prints the log loss and accuracy each epoch.

The network is a little over-prescribed, having more memory units and training epochs than is required for such a simple problem, and you can see this by the fact that the network quickly achieves 100% accuracy.

At the end of the run, the predicted sequence is compared to a randomly generated sequence and the two look identical.

...
0s - loss: 0.0895 - acc: 1.0000
Epoch 1/1
0s - loss: 0.0785 - acc: 1.0000
Epoch 1/1
0s - loss: 0.0789 - acc: 1.0000
Epoch 1/1
0s - loss: 0.0832 - acc: 1.0000
Epoch 1/1
0s - loss: 0.0927 - acc: 1.0000
Expected: [18, 41, 49, 56, 86, 25, 96, 3, 75, 24, 57, 95, 81, 44, 2, 22, 76, 34, 41, 4, 69, 47, 1, 97, 57]
Predicted: [18, 41, 49, 56, 86, 25, 96, 3, 75, 24, 57, 95, 81, 44, 2, 22, 76, 34, 41, 4, 69, 47, 1, 97, 57]

...

0s - loss: 0.0895 - acc: 1.0000

Epoch 1/1

0s - loss: 0.0785 - acc: 1.0000

Epoch 1/1

0s - loss: 0.0789 - acc: 1.0000

Epoch 1/1

0s - loss: 0.0832 - acc: 1.0000

Epoch 1/1

0s - loss: 0.0927 - acc: 1.0000

Expected: [18, 41, 49, 56, 86, 25, 96, 3, 75, 24, 57, 95, 81, 44, 2, 22, 76, 34, 41, 4, 69, 47, 1, 97, 57]

Predicted: [18, 41, 49, 56, 86, 25, 96, 3, 75, 24, 57, 95, 81, 44, 2, 22, 76, 34, 41, 4, 69, 47, 1, 97, 57]

Now that we know how to use the tools to create and represent random sequences and to fit an LSTM to learn to echo the current sequence, let’s look at how we can use LSTMs to learn how to echo a past observation.

Echo Lag Observation Without Context
(The Beginners Mistake)

The problem of predicting a lag observation can be more formally defined as follows:

yhat(t) = f(X(t-n))

1	yhat(t) = f(X(t-n))

Where the expected output for the current time step yhat(t) is defined as a function (f()) of a specific previous observation (X(t-n)).

The promise of LSTMs suggests that you can show examples to the network one at a time and that the network will use internal state to learn and to sufficiently remember prior observations in order to solve this problem.

Let’s try this out.

First, we must update the generate_data() function and re-define the problem.

Rather than using the same sequence for input and output, we will use a shifted version of the encoded sequence as input and a truncated version of the encoded sequence as output.

These changes are required in order to take a sequence of numbers, such as [1, 2, 3, 4], and turn them into a supervised learning problem with input (X) and output (y) components, such as:

X y
1, NaN
2, 1
3, 2
4, 3
NaN, 4

X y

1, NaN

2, 1

3, 2

4, 3

NaN, 4

In this example, you can see that the first and last rows do not contain sufficient data for the network to learn. This could be marked as a “no data” value and masked, but a simpler solution is to simply remove it from the dataset.

The updated generate_data() function is listed below:

# generate data for the lstm
def generate_data():
	# generate sequence
	sequence = generate_sequence()
	# one hot encode
	encoded = one_hot_encode(sequence)
	# drop first value from X
	X = encoded[1:, :]
	# convert to 3d for input
	X = X.reshape(X.shape[0], 1, X.shape[1])
	# drop last value from y
	y = encoded[:-1, :]
	return X, y

# generate data for the lstm

def generate_data():

# generate sequence

sequence = generate_sequence()

# one hot encode

encoded = one_hot_encode(sequence)

# drop first value from X

X = encoded[1:, :]

# convert to 3d for input

X = X.reshape(X.shape[0], 1, X.shape[1])

# drop last value from y

y = encoded[:-1, :]

return X, y

We must test out this updated representation of the data to confirm it does what we expect. To do this, we can generate a sequence and review the decoded X and y values over the sequence.

X, y = generate_data()
for i in range(len(X)):
	a, b = argmax(X[i,0]), argmax(y[i])
	print(a, b)

X, y = generate_data()

for i in range(len(X)):

a, b = argmax(X[i,0]), argmax(y[i])

print(a, b)

The complete code listing for this sanity check is provided below.

from random import randint
from numpy import array
from numpy import argmax
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense

# generate a sequence of random numbers in [0, 99]
def generate_sequence(length=25):
	return [randint(0, 99) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique=100):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# generate data for the lstm
def generate_data():
	# generate sequence
	sequence = generate_sequence()
	# one hot encode
	encoded = one_hot_encode(sequence)
	# drop first value from X
	X = encoded[1:, :]
	# convert to 3d for input
	X = X.reshape(X.shape[0], 1, X.shape[1])
	# drop last value from y
	y = encoded[:-1, :]
	return X, y

# test data generator
X, y = generate_data()
for i in range(len(X)):
	a, b = argmax(X[i,0]), argmax(y[i])
	print(a, b)

from random import randint

from numpy import array

from numpy import argmax

from keras.models import Sequential

from keras.layers import LSTM

from keras.layers import Dense

# generate a sequence of random numbers in [0, 99]

def generate_sequence(length=25):

return [randint(0, 99) for _ in range(length)]

# one hot encode sequence

def one_hot_encode(sequence, n_unique=100):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# generate data for the lstm

def generate_data():

# generate sequence

sequence = generate_sequence()

# one hot encode

encoded = one_hot_encode(sequence)

# drop first value from X

X = encoded[1:, :]

# convert to 3d for input

X = X.reshape(X.shape[0], 1, X.shape[1])

# drop last value from y

y = encoded[:-1, :]

return X, y

# test data generator

X, y = generate_data()

for i in range(len(X)):

a, b = argmax(X[i,0]), argmax(y[i])

print(a, b)

Running the example prints the X and y components of the problem framing.

We can see that the first pattern will be hard (impossible) for the network to learn given the cold start. We can see that the expected pattern of yhat(t) == X(t-1) down through the data.

78 65

7 78

16 7

11 16

23 11

99 23

39 99

53 39

82 53

6 82

18 6

17 18

49 17

4 49

34 4

77 34

46 77

22 46

40 22

76 40

85 76

87 85

17 87

75 17

The network design is similar, but with one small change.

Observations are shown to the network one at a time and a weight update is performed. Because we expect the state between observations to carry the information required to learn the prior observation, we need to ensure that this state is not reset after each batch (in this case, one batch is one training observation). We can do this by making the LSTM layer stateful and manually managing when the state is reset.

This involves setting the stateful argument to True on the LSTM layer and defining the input shape using the batch_input_shape argument that includes the dimensions [batchsize, timesteps, features].

There are 24 X,y pairs for a given random sequence, therefore a batch size of 6 was used (4 batches of 6 samples = 24 samples). Remember, a sequence is broken down into samples, and samples can be shown to the network in batches before an update to the network weights is performed. A large network size of 50 memory units is used, again to over-prescribe the capacity needed for the problem.

model.add(LSTM(50, batch_input_shape=(6, 1, 100), stateful=True))

1	model.add(LSTM(50, batch_input_shape=(6, 1, 100), stateful=True))

Next, after each epoch (one iteration of a randomly generated sequence), the internal state of the network can be manually reset. The model is fit for 2,000 training epochs and care is made to not shuffle the samples within a sequence.

# fit model
for i in range(2000):
	X, y = generate_data()
	model.fit(X, y, epochs=1, batch_size=6, verbose=2, shuffle=False)
	model.reset_states()

# fit model

for i in range(2000):

X, y = generate_data()

model.fit(X, y, epochs=1, batch_size=6, verbose=2, shuffle=False)

model.reset_states()

Putting this all together, the complete example is listed below.

from random import randint
from numpy import array
from numpy import argmax
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense

# generate a sequence of random numbers in [0, 99]
def generate_sequence(length=25):
	return [randint(0, 99) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique=100):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# generate data for the lstm
def generate_data():
	# generate sequence
	sequence = generate_sequence()
	# one hot encode
	encoded = one_hot_encode(sequence)
	# drop first value from X
	X = encoded[1:, :]
	# convert to 3d for input
	X = X.reshape(X.shape[0], 1, X.shape[1])
	# drop last value from y
	y = encoded[:-1, :]
	return X, y

# define model
model = Sequential()
model.add(LSTM(50, batch_input_shape=(6, 1, 100), stateful=True))
model.add(Dense(100, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
for i in range(2000):
	X, y = generate_data()
	model.fit(X, y, epochs=1, batch_size=6, verbose=2, shuffle=False)
	model.reset_states()
# evaluate model on new data
X, y = generate_data()
yhat = model.predict(X, batch_size=6)
print('Expected:  %s' % one_hot_decode(y))
print('Predicted: %s' % one_hot_decode(yhat))

from random import randint

from numpy import array

from numpy import argmax

from keras.models import Sequential

from keras.layers import LSTM

from keras.layers import Dense

# generate a sequence of random numbers in [0, 99]

def generate_sequence(length=25):

return [randint(0, 99) for _ in range(length)]

# one hot encode sequence

def one_hot_encode(sequence, n_unique=100):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# generate data for the lstm

def generate_data():

# generate sequence

sequence = generate_sequence()

# one hot encode

encoded = one_hot_encode(sequence)

# drop first value from X

X = encoded[1:, :]

# convert to 3d for input

X = X.reshape(X.shape[0], 1, X.shape[1])

# drop last value from y

y = encoded[:-1, :]

return X, y

# define model

model = Sequential()

model.add(LSTM(50, batch_input_shape=(6, 1, 100), stateful=True))

model.add(Dense(100, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit model

for i in range(2000):

X, y = generate_data()

model.fit(X, y, epochs=1, batch_size=6, verbose=2, shuffle=False)

model.reset_states()

# evaluate model on new data

X, y = generate_data()

yhat = model.predict(X, batch_size=6)

print('Expected: %s' % one_hot_decode(y))

print('Predicted: %s' % one_hot_decode(yhat))

Running the example gives a surprising result.

The problem cannot be learned and training ends with a model with 0% accuracy of echoing the last observation in the sequence.

...
Epoch 1/1
0s - loss: 4.6042 - acc: 0.0417
Epoch 1/1
0s - loss: 4.6215 - acc: 0.0000e+00
Epoch 1/1
0s - loss: 4.5802 - acc: 0.0000e+00
Epoch 1/1
0s - loss: 4.6023 - acc: 0.0000e+00
Epoch 1/1
0s - loss: 4.6071 - acc: 0.0000e+00
Expected: [71, 44, 6, 11, 91, 23, 55, 37, 53, 4, 42, 15, 81, 6, 57, 97, 49, 69, 56, 86, 70, 12, 61, 48]
Predicted: [49, 49, 49, 87, 49, 96, 96, 96, 96, 96, 85, 96, 96, 96, 96, 96, 96, 96, 49, 49, 87, 96, 49, 49]

...

Epoch 1/1

0s - loss: 4.6042 - acc: 0.0417

Epoch 1/1

0s - loss: 4.6215 - acc: 0.0000e+00

Epoch 1/1

0s - loss: 4.5802 - acc: 0.0000e+00

Epoch 1/1

0s - loss: 4.6023 - acc: 0.0000e+00

Epoch 1/1

0s - loss: 4.6071 - acc: 0.0000e+00

Expected: [71, 44, 6, 11, 91, 23, 55, 37, 53, 4, 42, 15, 81, 6, 57, 97, 49, 69, 56, 86, 70, 12, 61, 48]

Predicted: [49, 49, 49, 87, 49, 96, 96, 96, 96, 96, 85, 96, 96, 96, 96, 96, 96, 96, 49, 49, 87, 96, 49, 49]

How can this be?

The Beginner’s Mistake

This is a common mistake made by beginners, and if you have been around the block with RNNs or LSTMs, then you would have spotted this error above.

Specifically, the power of LSTMs does come from the learned internal state maintained, but this state is only powerful if it is trained as a function over past observations.

Stated another way, you must provide the network the context for the prediction (e.g. the observations that may contain the temporal dependence) as time steps on the input.

The above formulation trained the network to learn the output as a function only of the current input value, as in the first example:

yhat(t) = f(X(t))

1	yhat(t) = f(X(t))

Not as a function of the last n observations, or even just the previous observation, as we require:

yhat(t) = f(X(t-1))

1	yhat(t) = f(X(t-1))

The LSTM does only need one input at a time in order to learn this unknown temporal dependence, but it must perform backpropagation over the sequence in order to learn this dependence. You must provide the past observations of the sequence as context.

You are not defining a window (as in the case of Multilayer Perceptron where each past observation is a weighted input); instead, you are defining an extent of historical observations from which the LSTM will attempt to learn the temporal dependence (f(X(t-1), … X(t-n))).

To be clear, this is the beginner’s mistake when using LSTMs in Keras, and not necessarily in general.

Echo Lag Observation

Now that we have navigated around a common pitfall for beginners, we can develop an LSTM to echo the previous observation.

The first step is to reformulate the definition of the problem, again.

We know that the network only requires the last observation as input in order to make correct predictions. But we want the network to learn which of the past observations to echo in order to correctly solve this problem. Therefore, we will provide a subsequence of the last 5 observation as context.

Specifically, if our sequence contains: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], the X,y pairs would look as follows:

X, 							y
NaN, NaN, NaN, NaN, NaN, 	NaN
NaN, NaN, NaN, NaN, 1, 		NaN
NaN, NaN, NaN, 1, 2, 		1
NaN, NaN, 1, 2, 3, 			2
NaN, 1, 2, 3, 4, 			3
1, 2, 3, 4, 5, 				4
2, 3, 4, 5, 6, 				5
3, 4, 5, 6, 7, 				6
4, 5, 6, 7, 8, 				7
5, 6, 7, 8, 9, 				8
6, 7, 8, 9, 10, 			9
7, 8, 9, 10, NaN, 			10

X, y

NaN, NaN, NaN, NaN, NaN, NaN

NaN, NaN, NaN, NaN, 1, NaN

NaN, NaN, NaN, 1, 2, 1

NaN, NaN, 1, 2, 3, 2

NaN, 1, 2, 3, 4, 3

1, 2, 3, 4, 5, 4

2, 3, 4, 5, 6, 5

3, 4, 5, 6, 7, 6

4, 5, 6, 7, 8, 7

5, 6, 7, 8, 9, 8

6, 7, 8, 9, 10, 9

7, 8, 9, 10, NaN, 10

In this case, you can see that the first 5 rows and the last 1 row do not contain enough data, so in this case, we will remove them.

We will use the Pandas shift() function to create shifted versions of the sequence and the Pandas concat() function to recombine the shifted sequences back together. We will then manually exclude the rows that are not viable.

The updated generate_data() function is listed below.

# generate data for the lstm
def generate_data():
	# generate sequence
	sequence = generate_sequence()
	# one hot encode
	encoded = one_hot_encode(sequence)
	# create lag inputs
	df = DataFrame(encoded)
	df = concat([df.shift(4), df.shift(3), df.shift(2), df.shift(1), df], axis=1)
	# remove non-viable rows
	values = df.values
	values = values[5:,:]
	# convert to 3d for input
	X = values.reshape(len(values), 5, 100)
	# drop last value from y
	y = encoded[4:-1,:]
	print(X.shape, y.shape)
	return X, y

# generate data for the lstm

def generate_data():

# generate sequence

sequence = generate_sequence()

# one hot encode

encoded = one_hot_encode(sequence)

# create lag inputs

df = DataFrame(encoded)

df = concat([df.shift(4), df.shift(3), df.shift(2), df.shift(1), df], axis=1)

# remove non-viable rows

values = df.values

values = values[5:,:]

# convert to 3d for input

X = values.reshape(len(values), 5, 100)

# drop last value from y

y = encoded[4:-1,:]

print(X.shape, y.shape)

return X, y

Again, we can sanity check this updated function by generating a sequence and comparing the decoded X,y pairs. The complete example is listed below.

from random import randint
from numpy import array
from numpy import argmax
from pandas import concat
from pandas import DataFrame
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense

# generate a sequence of random numbers in [0, 99]
def generate_sequence(length=25):
	return [randint(0, 99) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique=100):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# generate data for the lstm
def generate_data():
	# generate sequence
	sequence = generate_sequence()
	# one hot encode
	encoded = one_hot_encode(sequence)
	# create lag inputs
	df = DataFrame(encoded)
	df = concat([df.shift(4), df.shift(3), df.shift(2), df.shift(1), df], axis=1)
	# remove non-viable rows
	values = df.values
	values = values[5:,:]
	# convert to 3d for input
	X = values.reshape(len(values), 5, 100)
	# drop last value from y
	y = encoded[4:-1,:]
	return X, y

# test data generator
X, y = generate_data()
for i in range(len(X)):
	a, b, c, d, e, f = argmax(X[i,0]), argmax(X[i,1]), argmax(X[i,2]), argmax(X[i,3]), argmax(X[i,4]), argmax(y[i])
	print(a, b, c, d, e, f)

from random import randint

from numpy import array

from numpy import argmax

from pandas import concat

from pandas import DataFrame

from keras.models import Sequential

from keras.layers import LSTM

from keras.layers import Dense

# generate a sequence of random numbers in [0, 99]

def generate_sequence(length=25):

return [randint(0, 99) for _ in range(length)]

# one hot encode sequence

def one_hot_encode(sequence, n_unique=100):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# generate data for the lstm

def generate_data():

# generate sequence

sequence = generate_sequence()

# one hot encode

encoded = one_hot_encode(sequence)

# create lag inputs

df = DataFrame(encoded)

df = concat([df.shift(4), df.shift(3), df.shift(2), df.shift(1), df], axis=1)

# remove non-viable rows

values = df.values

values = values[5:,:]

# convert to 3d for input

X = values.reshape(len(values), 5, 100)

# drop last value from y

y = encoded[4:-1,:]

return X, y

# test data generator

X, y = generate_data()

for i in range(len(X)):

a, b, c, d, e, f = argmax(X[i,0]), argmax(X[i,1]), argmax(X[i,2]), argmax(X[i,3]), argmax(X[i,4]), argmax(y[i])

print(a, b, c, d, e, f)

Running the example shows the context of the last 5 values as input and the last prior observation (X(t-1)) as output.

57 96 99 77 44 77
96 99 77 44 45 44
99 77 44 45 28 45
77 44 45 28 70 28
44 45 28 70 73 70
45 28 70 73 74 73
28 70 73 74 73 74
70 73 74 73 64 73
73 74 73 64 29 64
74 73 64 29 15 29
73 64 29 15 94 15
64 29 15 94 98 94
29 15 94 98 89 98
15 94 98 89 52 89
94 98 89 52 96 52
98 89 52 96 46 96
89 52 96 46 46 46
52 96 46 46 85 46
96 46 46 85 49 85
46 46 85 49 59 49

57 96 99 77 44 77

96 99 77 44 45 44

99 77 44 45 28 45

77 44 45 28 70 28

44 45 28 70 73 70

45 28 70 73 74 73

28 70 73 74 73 74

70 73 74 73 64 73

73 74 73 64 29 64

74 73 64 29 15 29

73 64 29 15 94 15

64 29 15 94 98 94

29 15 94 98 89 98

15 94 98 89 52 89

94 98 89 52 96 52

98 89 52 96 46 96

89 52 96 46 46 46

52 96 46 46 85 46

96 46 46 85 49 85

46 46 85 49 59 49

We can now develop an LSTM to learn this problem.

There are 20 X,y pairs for a given sequence; therefore, a batch size of 5 was chosen (4 batches of 5 examples = 20 samples).

The same structure was used with 50 memory units in the LSTM hidden layer and 100 neurons in the output layer. The network was fit for 2,000 epochs with internal state reset after each epoch.

The complete code listing is provided below.

from random import randint
from numpy import array
from numpy import argmax
from pandas import concat
from pandas import DataFrame
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense

# generate a sequence of random numbers in [0, 99]
def generate_sequence(length=25):
	return [randint(0, 99) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique=100):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# generate data for the lstm
def generate_data():
	# generate sequence
	sequence = generate_sequence()
	# one hot encode
	encoded = one_hot_encode(sequence)
	# create lag inputs
	df = DataFrame(encoded)
	df = concat([df.shift(4), df.shift(3), df.shift(2), df.shift(1), df], axis=1)
	# remove non-viable rows
	values = df.values
	values = values[5:,:]
	# convert to 3d for input
	X = values.reshape(len(values), 5, 100)
	# drop last value from y
	y = encoded[4:-1,:]
	return X, y

# define model
model = Sequential()
model.add(LSTM(50, batch_input_shape=(5, 5, 100), stateful=True))
model.add(Dense(100, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
for i in range(2000):
	X, y = generate_data()
	model.fit(X, y, epochs=1, batch_size=5, verbose=2, shuffle=False)
	model.reset_states()
# evaluate model on new data
X, y = generate_data()
yhat = model.predict(X, batch_size=5)
print('Expected:  %s' % one_hot_decode(y))
print('Predicted: %s' % one_hot_decode(yhat))

from random import randint

from numpy import array

from numpy import argmax

from pandas import concat

from pandas import DataFrame

from keras.models import Sequential

from keras.layers import LSTM

from keras.layers import Dense

# generate a sequence of random numbers in [0, 99]

def generate_sequence(length=25):

return [randint(0, 99) for _ in range(length)]

# one hot encode sequence

def one_hot_encode(sequence, n_unique=100):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# generate data for the lstm

def generate_data():

# generate sequence

sequence = generate_sequence()

# one hot encode

encoded = one_hot_encode(sequence)

# create lag inputs

df = DataFrame(encoded)

df = concat([df.shift(4), df.shift(3), df.shift(2), df.shift(1), df], axis=1)

# remove non-viable rows

values = df.values

values = values[5:,:]

# convert to 3d for input

X = values.reshape(len(values), 5, 100)

# drop last value from y

y = encoded[4:-1,:]

return X, y

# define model

model = Sequential()

model.add(LSTM(50, batch_input_shape=(5, 5, 100), stateful=True))

model.add(Dense(100, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit model

for i in range(2000):

X, y = generate_data()

model.fit(X, y, epochs=1, batch_size=5, verbose=2, shuffle=False)

model.reset_states()

# evaluate model on new data

X, y = generate_data()

yhat = model.predict(X, batch_size=5)

print('Expected: %s' % one_hot_decode(y))

print('Predicted: %s' % one_hot_decode(yhat))

Running the example shows that the network can fit the problem and correctly learn to return the X(t-1) observation as prediction within the context of 5 prior observations.

Example output is provided below.

...
Epoch 1/1
0s - loss: 0.1763 - acc: 1.0000
Epoch 1/1
0s - loss: 0.2393 - acc: 0.9500
Epoch 1/1
0s - loss: 0.1674 - acc: 1.0000
Epoch 1/1
0s - loss: 0.1256 - acc: 1.0000
Epoch 1/1
0s - loss: 0.1539 - acc: 1.0000
Expected: [24, 49, 86, 73, 51, 6, 6, 52, 34, 32, 0, 14, 83, 16, 37, 75, 41, 40, 80, 33]
Predicted: [24, 49, 86, 73, 51, 6, 6, 52, 34, 32, 0, 14, 83, 16, 37, 75, 41, 40, 80, 33]

...

Epoch 1/1

0s - loss: 0.1763 - acc: 1.0000

Epoch 1/1

0s - loss: 0.2393 - acc: 0.9500

Epoch 1/1

0s - loss: 0.1674 - acc: 1.0000

Epoch 1/1

0s - loss: 0.1256 - acc: 1.0000

Epoch 1/1

0s - loss: 0.1539 - acc: 1.0000

Expected: [24, 49, 86, 73, 51, 6, 6, 52, 34, 32, 0, 14, 83, 16, 37, 75, 41, 40, 80, 33]

Predicted: [24, 49, 86, 73, 51, 6, 6, 52, 34, 32, 0, 14, 83, 16, 37, 75, 41, 40, 80, 33]

Extensions

This section lists some extensions to the experiments in this tutorial.

Ignore Internal State. Care was taken to preserve internal state of the LSTMs across samples within a sequence by manually resetting state at the end of the epoch. We know that the network already has all the context and state required within each sample via timesteps. Explore whether the additional cross-sample state adds any benefit to the skill of the model.
Mask Missing Data. During data preparation, rows with missing data were removed. Explore the use of marking missing values with a special value (e.g. -1) and seeing whether the LSTM can learn from these examples. Also explore the use of a Masking layer as input and explore masking out missing values.
Entire Sequence as Timesteps. A context of only the last 5 observations were provided as context from which to learn to echo. Explore using the entire random sequence as context for each sample, built-up as the sequence unfolds. This may require padding and even masking of missing values to meet the expectation of fixed-sized inputs to the LSTM.
Echo Different Lag Value. A specific lag value (t-1) was used in the echo example. Explore using a different lag value in the echo and how this affects properties such as model skill, training time, and LSTM layer size. I would expect that each lag could be learned using the same model structure.
Echo Lag Sequence. The network was trained to echo a specific lag observation. Explore variations where a lag sequence is echoed. This may require the use of the TimeDistributed layer on the output of the network to achieve sequence to sequence prediction.

Did you explore any of these extensions?
Share your findings in the comments below.

Summary

In this tutorial, you discovered how to develop an LSTM to address the problem of echoing a lag observation from a random sequence of integers.

Specifically, you learned:

How to generate and encode test data for the problem.
How to avoid the beginner’s mistake when attempting to address this and similar problems with LSTMs.
How to develop a robust LSTM to echo integers in an ad hoc sequence of random integers.

Do you have any questions?
Ask your questions in the comments and I will do my best to answer.

42 Responses to How to Learn to Echo Random Integers with LSTMs in Keras

Shirish Ranade June 9, 2017 at 12:22 pm #

Wow,

That is neat. 100% accuracy !!

Will this work with binomial classification problem as well?

Reply
- Jason Brownlee June 10, 2017 at 8:12 am #
  
  This is a special case of a well defined small problem.
  
  Reply
MOHD SAIFUL BAHRI IBRAHIM October 4, 2017 at 12:18 pm #

Very good …thank u

Reply
- Jason Brownlee October 4, 2017 at 3:37 pm #
  
  Thanks.
  
  Reply
Scott November 7, 2017 at 8:59 am #

Jason, would you clarify the remark, “You are not defining a window (as in the case of Multilayer Perceptron where each past observation is a weighted input); instead, you are defining an extent of historical observations”? It seems that defining a window of weighted inputs is exactly what we are doing: We are feeding in a series of windows, and the LSTM is learning to pick out one element from that window.

Reply
- Jason Brownlee November 7, 2017 at 9:56 am #
  
  Not quite, the LSTM processes time steps one at a time, not a weighting of all lag obs as in an MLP with a window.
  
  See this post:
  https://machinelearningmastery.com/gentle-introduction-backpropagation-time/
  
  Reply
Aditya February 22, 2018 at 8:39 pm #

Hi Jason

The post is really insightful, especially the part which covers the Beginner’s mistake.
I have one lingering question though. In the last section, where we are trying to echo the lag observation, how does an LSTM model provide an advantage over the Multilayer Perceptron?
We could have considered the time steps to be input features for an MLP and then trained it to learn the correct weights for those inputs and get the correct predicted output.

Basically, I want to understand the case where using an LSTM based RNN will be advantageous since an MLP model would be unsuccessful. The echo lag example doesn’t really seem to show the advantage of an LSTM model to me.

Reply
- Jason Brownlee February 23, 2018 at 11:55 am #
  
  The MLP must be exposed to all time steps as features at once, where as the LSTM sees only one time step at a time and accumulates state over time steps in order to create an output.
  
  That is the key difference.
  
  Reply
  - Aditya February 24, 2018 at 7:05 pm #
    
    Thanks Jason, I understand the difference in the implementation. However, since the echo lag problem could also be solved by using an MLP model with all time steps as features, I wanted to understand what advantage does an LSTM offer over it?
    Specifically, are there any sequences which can not be learnt with an MLP model (by using time steps as input features)?
    
    Reply
    - Jason Brownlee February 25, 2018 at 7:44 am #
      
      The fact that each input time step is provided one at a time and the LSTM must “remember” what to echo using internal state demonstrate the difference. The MLP MUST have all time steps provided as input at once.
      
      Perhaps this post on BPTT will clear things up further for you:
      https://machinelearningmastery.com/gentle-introduction-backpropagation-time/
      
      Reply
anurag October 30, 2018 at 12:48 am #

awesome article.. thnx

Reply
- Jason Brownlee October 30, 2018 at 6:03 am #
  
  I’m happy it helped.
  
  Reply
Scholes December 8, 2018 at 3:14 am #

Very helpeful article, i thank you sir for the efforts.

I’m bit new to LSTM but i kinda understood how it works, But i have a problem in my interpretation :

– I’d like to predict the next random output from “yhat”, so i did a little loop which will predict and append the list of test each time (So it will predict more future outputs) in other meaning each “yhat” list becomes “X”list in every iteration which i thought it means it will predict the next output.

But my problem is it stays blocked in the same X list that i gaved in the begening :

The list of predicted numbers gets smaller despite i did append each time, so i cant have future predictions.

Would you like i show the piece of code of that and output? it will be so helpfull for me what you did

Thank you again

Reply
- Jason Brownlee December 8, 2018 at 7:12 am #
  
  Nice work!
  
  Reply
  - Scholes December 11, 2018 at 1:40 am #
    
    Good morning, I guess you didnt understand me lol sorry i dont explain well
    
    In every iteration each X_list is the previous yhat predicted list. means :
    
    X(n iteration) = yhat(n-1) , here’s the execution so you can have an idea about my problem -> :
    
    (‘Interation = ‘, 6)
    X file: [20, 38, 19, 48, 28, 43, 50, 14, 25, 28, 20, 19, 44, 16, 9]
    Predicted: [48, 28, 43, 50, 14, 25, 28, 20, 19, 44, 16, 9, 25, 13, 15]
    (‘Interation = ‘, 7)
    X file: [28, 43, 50, 14, 25, 28, 20, 19, 44, 16]
    Predicted: [14, 25, 28, 20, 19, 44, 16, 9, 25, 13]
    (‘Interation = ‘, 8)
    X file: [25, 28, 20, 19, 44]
    Predicted: [19, 44, 16, 9, 25]
    (‘Interation = ‘, 9)
    X file: []
    Predicted: []
    
    My idea was to keep predict future outputs each time we give it a predicted number, but unfortunately it shows a small vicious cycle list . if you can explain me please for my educational project. i let show you the piece of code i did about that :
    
    for i in range(1,10): #ITERATE 10 TIMES
    
    yhat_list = one_hot_decode(yhat) #Read last predicted list as an input for next prediction
    encoded = one_hot_encode(yhat_list)
    # create lag inputs
    df = DataFrame(encoded)
    df = concat([df.shift(4), df.shift(3), df.shift(2), df.shift(1), df], axis=1)
    # remove non-viable rows
    values = df.values
    values = values[5:,:]
    # convert to 3d for input
    X = values.reshape(len(values), 5, 100)
    yhat = model.predict(X, batch_size=5)
    print(“Interation = “, i)
    print(‘X file: %s’ % one_hot_decode(X))
    print(‘Predicted: %s’ % one_hot_decode(yhat))
    
    THANK YOU SO MUCH FOR YOUR EXPLICATIONS
    
    Reply
    - Jason Brownlee December 11, 2018 at 7:48 am #
      
      I don’t follow. What is the problem you’re having exactly?
      
      Reply
      - Scholes December 13, 2018 at 12:49 am #
        
        I want to continue predicting the next 100 outputs of “yhat” , so i pass it each time as X list in each iteration. i thought it
        
        will predict new outputs in each time.
        
        But unfortunately the loop never progress and keep getting smaller and predicting already known numbers. (as i showed you in the previous example )
        
        for i in range(1,100): #ITERATE 10 TIMES
        
        yhat_list = one_hot_decode(yhat) #Read last predicted list as an input for next prediction
        encoded = one_hot_encode(yhat_list)
        # create lag inputs
        df = DataFrame(encoded)
        df = concat([df.shift(4), df.shift(3), df.shift(2), df.shift(1), df], axis=1)
        # remove non-viable rows
        values = df.values
        values = values[5:,:]
        # convert to 3d for input
        X = values.reshape(len(values), 5, 100)
        yhat = model.predict(X, batch_size=5)
        print(“Interation = “, i)
        print(‘X file: %s’ % one_hot_decode(X))
        print(‘Predicted: %s’ % one_hot_decode(yhat))
      - Jason Brownlee December 13, 2018 at 7:54 am #
        
        I have a number of posts on multi-step predicting with LSTMs, perhaps start here:
        https://machinelearningmastery.com/start-here/#deep_learning_time_series
Scholes December 13, 2018 at 11:33 pm #

ok thank you jason i’ll check that out , best regards

Reply
brandy January 31, 2019 at 8:19 am #

Good afternoon Jason, I want to replace the random generator with my own dataset and iterate through it and get the most common numbers, it’s in a csv file how would I go about doing the in the code? Thanks in advanced

Reply
- Jason Brownlee January 31, 2019 at 2:15 pm #
  
  I cannot write a custom example for you. What problem are you having exactly?
  
  Reply
jmaidagan April 2, 2019 at 8:46 pm #

Very interesting your blog, I learned a lot reading it.

I think the solution to the interesting problem that you have raised has nothing to do with the long LSTM memory. The result you are looking for is being supplied in axis 2 of the input. You can verify it through ‘stateful = False’ on line 47 of your code: the convergence to acc = 100% is even faster!
On the other hand, what you call “a common pitfall for beginners” is exactly the way that is advised in https://keras.io/examples/lstm_stateful/. The calculation (tsteps = 2 lahead = 1) shows that, although the problem is not solved, there is an appreciable difference between stateful = True / False.
From my point of view, the crucial question is:
It is possible to train an LSTM so that (in production regime) it is capable of producing the echo receiving only a random sequence, a number at each time?
This problem would illuminate the LSTM advantage over other networks, since it is impossible to solve it without memory.
Regards

Reply
- Jason Brownlee April 3, 2019 at 6:42 am #
  
  I believe it could, yes.
  
  See this post:
  https://machinelearningmastery.com/memory-in-a-long-short-term-memory-network/
  
  Reply
Christophe May 8, 2019 at 3:19 pm #

Hi Jason – May I ask why you used a custom-coded one-hot encoding rather than the keras function to_categorical?

Also what would be the implementation if the n_unique value is extremely large?

Thanks.

Reply
- Jason Brownlee May 9, 2019 at 6:35 am #
  
  I don’t recall why, sorry.
  
  How large? It is common to one hot encode on NLP problems up to tens of thousands of tokens, or more.
  
  Also, embeddings work amazingly well for large cardinality categorical variables.
  
  Reply
aswin June 20, 2020 at 5:17 am #

how to predict next number using this program?

Reply
- Jason Brownlee June 20, 2020 at 6:19 am #
  
  Call: model.predict()
  
  Perhaps this will help:
  https://machinelearningmastery.com/how-to-make-classification-and-regression-predictions-for-deep-learning-models-in-keras/
  
  Reply
- Shubham Chauhan July 2, 2020 at 10:50 pm #
  
  Thanks for the wonderful code Sir , but I am not able to predict next number that’s my code from your above article :
  
  Xnew = array([[1,36, 0, 36, 21,2]])
  ynew = model.predict(Xnew)
  
  print(“X=%s, Predicted=%s” % (Xnew[0], ynew[0]))
  
  then it shows ——-
  
  ValueError: Error when checking input: expected lstm_26_input to have 3 dimensions, but got array with shape (1, 6)
  
  But after reshaping it into 3d it shows error
  
  Xnew = array([[1,36, 0, 36, 21,2]])
  
  Xnew=Xnew.reshape(Xnew.shape[0],Xnew.shape[1] , 1)
  
  ynew = model.predict(Xnew)
  
  print(“X=%s, Predicted=%s” % (Xnew[0], ynew[0])) then it shows
  
  ValueError: Error when checking input: expected lstm_26_input to have shape (5, 340) but got array with shape (6, 1)
  
  Can you write the code in the reply plzzz because most of the people are facing this problem ??????
  
  Reply
  - Jason Brownlee July 3, 2020 at 6:16 am #
    
    Sorry to hear that, this may help:
    https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
    
    Reply
Shubham Chauhan July 3, 2020 at 12:40 pm #

sir your code is working perfectly but I am not able to predict next number if I want to give my own choice of X . By using model.predict () Suppose if I want to give 34 then what will be the next number ? Sir plzzz its a two line of code plzz write in the comment section

Reply
- Jason Brownlee July 3, 2020 at 2:24 pm #
  
  Perhaps start with the working code and adapt it for your required change.
  
  Reply
João Victor November 16, 2020 at 12:33 am #

But, how can I predict the future random numbers? That’s not clear to me. Because we already put the random numbers as input, so how to predict the next one?

Reply
- Jason Brownlee November 16, 2020 at 6:28 am #
  
  We are not predicting future random numbers in this tutorial, we are learning how to use an LSTM to echo input.
  
  For prediction, perhaps start here:
  https://machinelearningmastery.com/start-here/#deep_learning_time_series
  
  Reply
Konradino December 14, 2020 at 2:37 am #

If I want to limit the number of generated numbers from 25 to 5, or even to 6 – what other factors should be modified? Change just in amount of elements in generated numbers generates an error. BTW: going to buy your e-book, thanks!

Reply
- Jason Brownlee December 14, 2020 at 6:20 am #
  
  Yes, change the definition of the problem, and the encoding.
  
  You might need to tune the model and learning hyperparameters for the change in difficulty of the problem.
  
  Reply
Darius.Nguyen March 8, 2021 at 6:19 pm #

First of all, your post gives me more knowledge. But I have a question? When I test your model with [a,b,c,d,e], I try to change e with many numbers and keep a,b,c,d but your model can predict the right number d. In my mind, I think when we change some number, d of predict should be changed? Is it right?

Reply
- Jason Brownlee March 9, 2021 at 5:16 am #
  
  Perhaps the model is overfit.
  
  Reply
Marcos Berti October 1, 2022 at 4:38 am #

Dear Jason,

I already bought two of your e-books and learned a lot, both were on sequence and LSTM applications. Following your emails, I found this Echo Lag Observation fantastic. I´ve executed the Python code and is really interesting, around 500 epochs the accuracy reaches 1. But I need your help, cause I tried to predict the next value of the sequence, and I really don´t have the skills to do it. The program predicts exactly what it´s presented as the X input. I need to predict the next number in that sequence, and I really don´t know how to do it. Please, can you send me the code to predict the next value in the sequence?

the last statements are:

yhat = model.predict(X, batch_size=5)
print(‘Expected: %s’ % one_hot_decode(y))
print(‘Predicted: %s’ % one_hot_decode(yhat))

Best regards,

Marcos

Reply
- James Carmichael October 1, 2022 at 6:58 am #
  
  Hi Marcos…You may want to investigate sequence to sequence prediction for this purpose:
  
  https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/
  
  Reply
Marcos Berti October 1, 2022 at 11:57 pm #

Thank you James.
I want to use this Echo Random described here in this doc, the only thing I don´t know how to do is given a sequence of numbers, how to predict the next number of that sequence. This code shown by Jason predicts exactly what was presented to the algorithm. What I need to know is what are the code to predict the next number of the sequence. All examples that I see the algorithm predicts exactly the same sequence. I just need to get the next number of the sequence. I don’t know how.

Reply
Tariq A April 5, 2024 at 12:47 pm #

Thank you for the blog! Indeed, it works great for any length of random integers.

I wanted to ask.. If I use conventional method and apply LSTM, I split the dataset into training and testing, and after prediction, I can validate the training and make prediction of the next values in the sequence too.

In this case of random integers, if I input a certain random sequence to this mode, how do I predict the next values in the sequence?

Reply
- James Carmichael April 7, 2024 at 7:16 am #
  
  Hi Tariq…Predicting the next values in a sequence of random integers using an LSTM (Long Short-Term Memory) model is an intriguing task because random sequences by definition do not have predictable patterns or dependencies that traditional time series or sequence prediction models exploit. However, if the sequence is pseudo-random or contains hidden patterns or dependencies not immediately apparent, LSTM models might be able to capture some of these characteristics.
  
  Here’s a general approach to how you might attempt to predict the next values in a sequence of integers using an LSTM model:
  
  ### 1. **Preparing the Dataset**
  – **Sequence Creation**: Convert the sequence of random integers into a supervised learning problem. This typically involves creating input-output pairs where the inputs are sequences of integers and the outputs are the next integer(s) in the sequence. For example, from a sequence \([x_1, x_2, x_3, x_4, …]\), you can create input-output pairs like \(([x_1, x_2, x_3], x_4)\).
  – **Normalization**: Depending on the range of integers, you might need to normalize or scale your data to help the LSTM model perform better.
  
  ### 2. **Model Design**
  – **Input Layer**: Design your LSTM with an input layer that matches the dimension of your data. For a sequence input, this is typically the sequence length and number of features (e.g., \((sequence\_length, 1)\) for a single-feature sequence).
  – **LSTM Layers**: Add one or more LSTM layers. The complexity of the model can be adjusted depending on the dataset size and the computational resources available.
  – **Output Layer**: Since you are predicting integers, the output layer could be a dense layer with a linear activation function (if predicting the next integer as a regression problem) or a softmax activation layer (if classifying into categories).
  
  ### 3. **Training the Model**
  – **Loss Function**: Use MSE (Mean Squared Error) for regression problems or cross-entropy for classification problems.
  – **Optimizer**: Common choices include Adam or SGD (Stochastic Gradient Descent).
  – **Epochs and Batches**: Choose appropriate values based on your dataset size and overfitting behavior.
  
  ### 4. **Prediction Phase**
  – **Using Last Known Values**: To predict the next integer(s) in the sequence, input the last known values into the model. For instance, if your last input sequence during training was \([x_{n-2}, x_{n-1}, x_n]\), to predict \(x_{n+1}\), you feed \([x_{n-2}, x_{n-1}, x_n]\) into the model.
  – **Sequential Prediction**: If you want to predict several future steps, you can use the predictions as new inputs. For example, predict \(x_{n+1}\), then use \([x_{n-1}, x_n, x_{n+1}]\) to predict \(x_{n+2}\), and so forth.
  
  ### 5. **Evaluation**
  – Evaluate the model’s performance using appropriate metrics (e.g., RMSE for regression).
  – Perform diagnostics to check if the model is merely memorizing the training data or actually capturing useful patterns.
  
  ### 6. **Challenges with Random Data**
  – **True Randomness**: If the data is truly random, you will likely find that the LSTM model does not perform well in predicting future values since there are no patterns to learn.
  – **Pseudo-random Patterns**: If there are underlying patterns or the sequence is generated through deterministic pseudo-random algorithms, the LSTM might capture some of these.
  
  Using an LSTM to predict future values in a sequence of random integers is more of an experimental than a practical approach, given the nature of randomness. If you’re exploring this as a theoretical exercise or for learning purposes, it can provide valuable insights into sequence modeling and LSTM capabilities.
  
  Reply

Navigation

How to Learn to Echo Random Integers with LSTMs in Keras

Overview

Environment

Need help with LSTMs for Sequence Prediction?

Generate and Encode Random Sequences

Generate Random Sequence

One Hot Encode Random Sequence

Complete Example

Echo Current Observation

Echo Lag Observation Without Context
(The Beginners Mistake)

The Beginner’s Mistake

Echo Lag Observation

Extensions

Summary

Develop LSTMs for Sequence Prediction Today!

Develop Your Own LSTM models in Minutes

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

More On This Topic

42 Responses to How to Learn to Echo Random Integers with LSTMs in Keras

Leave a Reply Click here to cancel reply.

Navigation

Overview

Environment

Need help with LSTMs for Sequence Prediction?

Generate and Encode Random Sequences

Generate Random Sequence

One Hot Encode Random Sequence

Complete Example

Echo Current Observation

Echo Lag Observation Without Context (The Beginners Mistake)

The Beginner’s Mistake

Echo Lag Observation

Extensions

Summary

Develop LSTMs for Sequence Prediction Today!

Develop Your Own LSTM models in Minutes

Finally Bring LSTM Recurrent Neural Networks to Your Sequence Predictions Projects

More On This Topic

42 Responses to How to Learn to Echo Random Integers with LSTMs in Keras

Leave a Reply Click here to cancel reply.

Echo Lag Observation Without Context
(The Beginners Mistake)

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects