How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras

By Jason Brownlee on August 27, 2020 in Long Short-Term Memory Networks 390

The encoder-decoder model provides a pattern for using recurrent neural networks to address challenging sequence-to-sequence prediction problems such as machine translation.

Encoder-decoder models can be developed in the Keras Python deep learning library and an example of a neural machine translation system developed with this model has been described on the Keras blog, with sample code distributed with the Keras project.

This example can provide the basis for developing encoder-decoder LSTM models for your own sequence-to-sequence prediction problems.

In this tutorial, you will discover how to develop a sophisticated encoder-decoder recurrent neural network for sequence-to-sequence prediction problems with Keras.

After completing this tutorial, you will know:

How to correctly define a sophisticated encoder-decoder model in Keras for sequence-to-sequence prediction.
How to define a contrived yet scalable sequence-to-sequence prediction problem that you can use to evaluate the encoder-decoder LSTM model.
How to apply the encoder-decoder LSTM model in Keras to address the scalable integer sequence-to-sequence prediction problem.

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Jan/2020: Updated API for Keras 2.3 and TensorFlow 2.0.

How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras
Photo by Björn Groß, some rights reserved.

Tutorial Overview

This tutorial is divided into 3 parts; they are:

Encoder-Decoder Model in Keras
Scalable Sequence-to-Sequence Problem
Encoder-Decoder LSTM for Sequence Prediction

Python Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this tutorial.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this post:

How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda

Encoder-Decoder Model in Keras

The encoder-decoder model is a way of organizing recurrent neural networks for sequence-to-sequence prediction problems.

It was originally developed for machine translation problems, although it has proven successful at related sequence-to-sequence prediction problems such as text summarization and question answering.

The approach involves two recurrent neural networks, one to encode the source sequence, called the encoder, and a second to decode the encoded source sequence into the target sequence, called the decoder.

The Keras deep learning Python library provides an example of how to implement the encoder-decoder model for machine translation (lstm_seq2seq.py) described by the libraries creator in the post: “A ten-minute introduction to sequence-to-sequence learning in Keras.”

For a detailed breakdown of this model see the post:

How to Define an Encoder-Decoder Sequence-to-Sequence Model for Neural Machine Translation in Keras

For more information on the use of return_state, which might be new to you, see the post:

Understand the Difference Between Return Sequences and Return States for LSTMs in Keras

For more help getting started with the Keras Functional API, see the post:

How to Use the Keras Functional API for Deep Learning

Using the code in that example as a starting point, we can develop a generic function to define an encoder-decoder recurrent neural network. Below is this function named define_models().

# returns train, inference_encoder and inference_decoder models
def define_models(n_input, n_output, n_units):
	# define training encoder
	encoder_inputs = Input(shape=(None, n_input))
	encoder = LSTM(n_units, return_state=True)
	encoder_outputs, state_h, state_c = encoder(encoder_inputs)
	encoder_states = [state_h, state_c]
	# define training decoder
	decoder_inputs = Input(shape=(None, n_output))
	decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True)
	decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
	decoder_dense = Dense(n_output, activation='softmax')
	decoder_outputs = decoder_dense(decoder_outputs)
	model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
	# define inference encoder
	encoder_model = Model(encoder_inputs, encoder_states)
	# define inference decoder
	decoder_state_input_h = Input(shape=(n_units,))
	decoder_state_input_c = Input(shape=(n_units,))
	decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
	decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
	decoder_states = [state_h, state_c]
	decoder_outputs = decoder_dense(decoder_outputs)
	decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)
	# return all models
	return model, encoder_model, decoder_model

# returns train, inference_encoder and inference_decoder models

def define_models(n_input, n_output, n_units):

# define training encoder

encoder_inputs = Input(shape=(None, n_input))

encoder = LSTM(n_units, return_state=True)

encoder_outputs, state_h, state_c = encoder(encoder_inputs)

encoder_states = [state_h, state_c]

# define training decoder

decoder_inputs = Input(shape=(None, n_output))

decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True)

decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)

decoder_dense = Dense(n_output, activation='softmax')

decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# define inference encoder

encoder_model = Model(encoder_inputs, encoder_states)

# define inference decoder

decoder_state_input_h = Input(shape=(n_units,))

decoder_state_input_c = Input(shape=(n_units,))

decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)

decoder_states = [state_h, state_c]

decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

# return all models

return model, encoder_model, decoder_model

The function takes 3 arguments, as follows:

n_input: The cardinality of the input sequence, e.g. number of features, words, or characters for each time step.
n_output: The cardinality of the output sequence, e.g. number of features, words, or characters for each time step.
n_units: The number of cells to create in the encoder and decoder models, e.g. 128 or 256.

The function then creates and returns 3 models, as follows:

train: Model that can be trained given source, target, and shifted target sequences.
inference_encoder: Encoder model used when making a prediction for a new source sequence.
inference_decoder Decoder model use when making a prediction for a new source sequence.

The model is trained given source and target sequences where the model takes both the source and a shifted version of the target sequence as input and predicts the whole target sequence.

For example, one source sequence may be [1,2,3] and the target sequence [4,5,6]. The inputs and outputs to the model during training would be:

Input1: ['1', '2', '3']
Input2: ['_', '4', '5']
Output: ['4', '5', '6']

Input1: ['1', '2', '3']

Input2: ['_', '4', '5']

Output: ['4', '5', '6']

The model is intended to be called recursively when generating target sequences for new source sequences.

The source sequence is encoded and the target sequence is generated one element at a time, using a “start of sequence” character such as ‘_’ to start the process. Therefore, in the above case, the following input-output pairs would occur during training:

t, 	Input1,				Input2,		Output
1,  ['1', '2', '3'],	'_',		'4'
2,  ['1', '2', '3'],	'4',		'5'
3,  ['1', '2', '3'],	'5',		'6'

t, Input1, Input2, Output

1, ['1', '2', '3'], '_', '4'

2, ['1', '2', '3'], '4', '5'

3, ['1', '2', '3'], '5', '6'

Here you can see how the recursive use of the model can be used to build up output sequences.

During prediction, the inference_encoder model is used to encode the input sequence once which returns states that are used to initialize the inference_decoder model. From that point, the inference_decoder model is used to generate predictions step by step.

The function below named predict_sequence() can be used after the model is trained to generate a target sequence given a source sequence.

# generate target given source sequence
def predict_sequence(infenc, infdec, source, n_steps, cardinality):
	# encode
	state = infenc.predict(source)
	# start of sequence input
	target_seq = array([0.0 for _ in range(cardinality)]).reshape(1, 1, cardinality)
	# collect predictions
	output = list()
	for t in range(n_steps):
		# predict next char
		yhat, h, c = infdec.predict([target_seq] + state)
		# store prediction
		output.append(yhat[0,0,:])
		# update state
		state = [h, c]
		# update target sequence
		target_seq = yhat
	return array(output)

# generate target given source sequence

def predict_sequence(infenc, infdec, source, n_steps, cardinality):

# encode

state = infenc.predict(source)

# start of sequence input

target_seq = array([0.0 for _ in range(cardinality)]).reshape(1, 1, cardinality)

# collect predictions

output = list()

for t in range(n_steps):

# predict next char

yhat, h, c = infdec.predict([target_seq] + state)

# store prediction

output.append(yhat[0,0,:])

# update state

state = [h, c]

# update target sequence

target_seq = yhat

return array(output)

This function takes 5 arguments as follows:

infenc: Encoder model used when making a prediction for a new source sequence.
infdec: Decoder model use when making a prediction for a new source sequence.
source:Encoded source sequence.
n_steps: Number of time steps in the target sequence.
cardinality: The cardinality of the output sequence, e.g. the number of features, words, or characters for each time step.

The function then returns a list containing the target sequence.

Scalable Sequence-to-Sequence Problem

In this section, we define a contrived and scalable sequence-to-sequence prediction problem.

The source sequence is a series of randomly generated integer values, such as [20, 36, 40, 10, 34, 28], and the target sequence is a reversed pre-defined subset of the input sequence, such as the first 3 elements in reverse order [40, 36, 20].

The length of the source sequence is configurable; so is the cardinality of the input and output sequence and the length of the target sequence.

We will use source sequences of 6 elements, a cardinality of 50, and target sequences of 3 elements.

Below are some more examples to make this concrete.

Source,						Target
[13, 28, 18, 7, 9, 5]		[18, 28, 13]
[29, 44, 38, 15, 26, 22]	[38, 44, 29]
[27, 40, 31, 29, 32, 1]		[31, 40, 27]
...

Source, Target

[13, 28, 18, 7, 9, 5] [18, 28, 13]

[29, 44, 38, 15, 26, 22] [38, 44, 29]

[27, 40, 31, 29, 32, 1] [31, 40, 27]

...

You are encouraged to explore larger and more complex variations. Post your findings in the comments below.

Let’s start off by defining a function to generate a sequence of random integers.

We will use the value of 0 as the padding or start of sequence character, therefore it is reserved and we cannot use it in our source sequences. To achieve this, we will add 1 to our configured cardinality to ensure the one-hot encoding is large enough (e.g. a value of 1 maps to a ‘1’ value in index 1).

For example:

n_features = 50 + 1

1	n_features = 50 + 1

We can use the randint() python function to generate random integers in a range between 1 and 1-minus the size of the problem’s cardinality. The generate_sequence() below generates a sequence of random integers.

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(1, n_unique-1) for _ in range(length)]

# generate a sequence of random integers

def generate_sequence(length, n_unique):

return [randint(1, n_unique-1) for _ in range(length)]

Next, we need to create the corresponding output sequence given the source sequence.

To keep thing simple, we will select the first n elements of the source sequence as the target sequence and reverse them.

# define target sequence
target = source[:n_out]
target.reverse()

# define target sequence

target = source[:n_out]

target.reverse()

We also need a version of the output sequence shifted forward by one time step that we can use as the mock target generated so far, including the start of sequence value in the first time step. We can create this from the target sequence directly.

# create padded input target sequence
target_in = [0] + target[:-1]

1 2	# create padded input target sequence target_in = [0] + target[:-1]

Now that all of the sequences have been defined, we can one-hot encode them, i.e. transform them into sequences of binary vectors. We can use the Keras built in to_categorical() function to achieve this.

We can put all of this into a function named get_dataset() that will generate a specific number of sequences that we can use to train a model.

# prepare data for the LSTM
def get_dataset(n_in, n_out, cardinality, n_samples):
	X1, X2, y = list(), list(), list()
	for _ in range(n_samples):
		# generate source sequence
		source = generate_sequence(n_in, cardinality)
		# define target sequence
		target = source[:n_out]
		target.reverse()
		# create padded input target sequence
		target_in = [0] + target[:-1]
		# encode
		src_encoded = to_categorical([source], num_classes=cardinality)
		tar_encoded = to_categorical([target], num_classes=cardinality)
		tar2_encoded = to_categorical([target_in], num_classes=cardinality)
		# store
		X1.append(src_encoded)
		X2.append(tar2_encoded)
		y.append(tar_encoded)
	return array(X1), array(X2), array(y)

# prepare data for the LSTM

def get_dataset(n_in, n_out, cardinality, n_samples):

X1, X2, y = list(), list(), list()

for _ in range(n_samples):

# generate source sequence

source = generate_sequence(n_in, cardinality)

# define target sequence

target = source[:n_out]

target.reverse()

# create padded input target sequence

target_in = [0] + target[:-1]

# encode

src_encoded = to_categorical([source], num_classes=cardinality)

tar_encoded = to_categorical([target], num_classes=cardinality)

tar2_encoded = to_categorical([target_in], num_classes=cardinality)

# store

X1.append(src_encoded)

X2.append(tar2_encoded)

y.append(tar_encoded)

return array(X1), array(X2), array(y)

Finally, we need to be able to decode a one-hot encoded sequence to make it readable again.

This is needed for both printing the generated target sequences but also for easily comparing whether the full predicted target sequence matches the expected target sequence. The one_hot_decode() function will decode an encoded sequence.

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

We can tie all of this together and test these functions.

A complete worked example is listed below.

from random import randint
from numpy import array
from numpy import argmax
from keras.utils import to_categorical

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(1, n_unique-1) for _ in range(length)]

# prepare data for the LSTM
def get_dataset(n_in, n_out, cardinality, n_samples):
	X1, X2, y = list(), list(), list()
	for _ in range(n_samples):
		# generate source sequence
		source = generate_sequence(n_in, cardinality)
		# define target sequence
		target = source[:n_out]
		target.reverse()
		# create padded input target sequence
		target_in = [0] + target[:-1]
		# encode
		src_encoded = to_categorical([source], num_classes=cardinality)
		tar_encoded = to_categorical([target], num_classes=cardinality)
		tar2_encoded = to_categorical([target_in], num_classes=cardinality)
		# store
		X1.append(src_encoded)
		X2.append(tar2_encoded)
		y.append(tar_encoded)
	return array(X1), array(X2), array(y)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# configure problem
n_features = 50 + 1
n_steps_in = 6
n_steps_out = 3
# generate a single source and target sequence
X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1)
print(X1.shape, X2.shape, y.shape)
print('X1=%s, X2=%s, y=%s' % (one_hot_decode(X1[0]), one_hot_decode(X2[0]), one_hot_decode(y[0])))

from random import randint

from numpy import array

from numpy import argmax

from keras.utils import to_categorical

# generate a sequence of random integers

def generate_sequence(length, n_unique):

return [randint(1, n_unique-1) for _ in range(length)]

# prepare data for the LSTM

def get_dataset(n_in, n_out, cardinality, n_samples):

X1, X2, y = list(), list(), list()

for _ in range(n_samples):

# generate source sequence

source = generate_sequence(n_in, cardinality)

# define target sequence

target = source[:n_out]

target.reverse()

# create padded input target sequence

target_in = [0] + target[:-1]

# encode

src_encoded = to_categorical([source], num_classes=cardinality)

tar_encoded = to_categorical([target], num_classes=cardinality)

tar2_encoded = to_categorical([target_in], num_classes=cardinality)

# store

X1.append(src_encoded)

X2.append(tar2_encoded)

y.append(tar_encoded)

return array(X1), array(X2), array(y)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# configure problem

n_features = 50 + 1

n_steps_in = 6

n_steps_out = 3

# generate a single source and target sequence

X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1)

print(X1.shape, X2.shape, y.shape)

print('X1=%s, X2=%s, y=%s' % (one_hot_decode(X1[0]), one_hot_decode(X2[0]), one_hot_decode(y[0])))

Running the example first prints the shape of the generated dataset, ensuring the 3D shape required to train the model matches our expectations.

The generated sequence is then decoded and printed to screen demonstrating both that the preparation of source and target sequences matches our intention and that the decode operation is working.

(1, 6, 51) (1, 3, 51) (1, 3, 51)
X1=[32, 16, 12, 34, 25, 24], X2=[0, 12, 16], y=[12, 16, 32]

1 2	(1, 6, 51) (1, 3, 51) (1, 3, 51) X1=[32, 16, 12, 34, 25, 24], X2=[0, 12, 16], y=[12, 16, 32]

We are now ready to develop a model for this sequence-to-sequence prediction problem.

Encoder-Decoder LSTM for Sequence Prediction

In this section, we will apply the encoder-decoder LSTM model developed in the first section to the sequence-to-sequence prediction problem developed in the second section.

The first step is to configure the problem.

# configure problem
n_features = 50 + 1
n_steps_in = 6
n_steps_out = 3

# configure problem

n_features = 50 + 1

n_steps_in = 6

n_steps_out = 3

Next, we must define the models and compile the training model.

# define model
train, infenc, infdec = define_models(n_features, n_features, 128)
train.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# define model

train, infenc, infdec = define_models(n_features, n_features, 128)

train.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Next, we can generate a training dataset of 100,000 examples and train the model.

# generate training dataset
X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 100000)
print(X1.shape,X2.shape,y.shape)
# train model
train.fit([X1, X2], y, epochs=1)

# generate training dataset

X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 100000)

print(X1.shape,X2.shape,y.shape)

# train model

train.fit([X1, X2], y, epochs=1)

Once the model is trained, we can evaluate it. We will do this by making predictions for 100 source sequences and counting the number of target sequences that were predicted correctly. We will use the numpy array_equal() function on the decoded sequences to check for equality.

# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
	X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1)
	target = predict_sequence(infenc, infdec, X1, n_steps_out, n_features)
	if array_equal(one_hot_decode(y[0]), one_hot_decode(target)):
		correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

# evaluate LSTM

total, correct = 100, 0

for _ in range(total):

X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1)

target = predict_sequence(infenc, infdec, X1, n_steps_out, n_features)

if array_equal(one_hot_decode(y[0]), one_hot_decode(target)):

correct += 1

print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

Finally, we will generate some predictions and print the decoded source, target, and predicted target sequences to get an idea of whether the model is working as expected.

Putting all of these elements together, the complete code example is listed below.

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.utils import to_categorical
from keras.models import Model
from keras.layers import Input
from keras.layers import LSTM
from keras.layers import Dense

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(1, n_unique-1) for _ in range(length)]

# prepare data for the LSTM
def get_dataset(n_in, n_out, cardinality, n_samples):
	X1, X2, y = list(), list(), list()
	for _ in range(n_samples):
		# generate source sequence
		source = generate_sequence(n_in, cardinality)
		# define padded target sequence
		target = source[:n_out]
		target.reverse()
		# create padded input target sequence
		target_in = [0] + target[:-1]
		# encode
		src_encoded = to_categorical([source], num_classes=cardinality)
		tar_encoded = to_categorical([target], num_classes=cardinality)
		tar2_encoded = to_categorical([target_in], num_classes=cardinality)
		# store
		X1.append(src_encoded)
		X2.append(tar2_encoded)
		y.append(tar_encoded)
	return array(X1), array(X2), array(y)

# returns train, inference_encoder and inference_decoder models
def define_models(n_input, n_output, n_units):
	# define training encoder
	encoder_inputs = Input(shape=(None, n_input))
	encoder = LSTM(n_units, return_state=True)
	encoder_outputs, state_h, state_c = encoder(encoder_inputs)
	encoder_states = [state_h, state_c]
	# define training decoder
	decoder_inputs = Input(shape=(None, n_output))
	decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True)
	decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
	decoder_dense = Dense(n_output, activation='softmax')
	decoder_outputs = decoder_dense(decoder_outputs)
	model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
	# define inference encoder
	encoder_model = Model(encoder_inputs, encoder_states)
	# define inference decoder
	decoder_state_input_h = Input(shape=(n_units,))
	decoder_state_input_c = Input(shape=(n_units,))
	decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
	decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
	decoder_states = [state_h, state_c]
	decoder_outputs = decoder_dense(decoder_outputs)
	decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)
	# return all models
	return model, encoder_model, decoder_model

# generate target given source sequence
def predict_sequence(infenc, infdec, source, n_steps, cardinality):
	# encode
	state = infenc.predict(source)
	# start of sequence input
	target_seq = array([0.0 for _ in range(cardinality)]).reshape(1, 1, cardinality)
	# collect predictions
	output = list()
	for t in range(n_steps):
		# predict next char
		yhat, h, c = infdec.predict([target_seq] + state)
		# store prediction
		output.append(yhat[0,0,:])
		# update state
		state = [h, c]
		# update target sequence
		target_seq = yhat
	return array(output)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# configure problem
n_features = 50 + 1
n_steps_in = 6
n_steps_out = 3
# define model
train, infenc, infdec = define_models(n_features, n_features, 128)
train.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# generate training dataset
X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 100000)
print(X1.shape,X2.shape,y.shape)
# train model
train.fit([X1, X2], y, epochs=1)
# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
	X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1)
	target = predict_sequence(infenc, infdec, X1, n_steps_out, n_features)
	if array_equal(one_hot_decode(y[0]), one_hot_decode(target)):
		correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))
# spot check some examples
for _ in range(10):
	X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1)
	target = predict_sequence(infenc, infdec, X1, n_steps_out, n_features)
	print('X=%s y=%s, yhat=%s' % (one_hot_decode(X1[0]), one_hot_decode(y[0]), one_hot_decode(target)))

100

101

102

103

104

105

106

107

108

109

110

from random import randint

from numpy import array

from numpy import argmax

from numpy import array_equal

from keras.utils import to_categorical

from keras.models import Model

from keras.layers import Input

from keras.layers import LSTM

from keras.layers import Dense

# generate a sequence of random integers

def generate_sequence(length, n_unique):

return [randint(1, n_unique-1) for _ in range(length)]

# prepare data for the LSTM

def get_dataset(n_in, n_out, cardinality, n_samples):

X1, X2, y = list(), list(), list()

for _ in range(n_samples):

# generate source sequence

source = generate_sequence(n_in, cardinality)

# define padded target sequence

target = source[:n_out]

target.reverse()

# create padded input target sequence

target_in = [0] + target[:-1]

# encode

src_encoded = to_categorical([source], num_classes=cardinality)

tar_encoded = to_categorical([target], num_classes=cardinality)

tar2_encoded = to_categorical([target_in], num_classes=cardinality)

# store

X1.append(src_encoded)

X2.append(tar2_encoded)

y.append(tar_encoded)

return array(X1), array(X2), array(y)

# returns train, inference_encoder and inference_decoder models

def define_models(n_input, n_output, n_units):

# define training encoder

encoder_inputs = Input(shape=(None, n_input))

encoder = LSTM(n_units, return_state=True)

encoder_outputs, state_h, state_c = encoder(encoder_inputs)

encoder_states = [state_h, state_c]

# define training decoder

decoder_inputs = Input(shape=(None, n_output))

decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True)

decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)

decoder_dense = Dense(n_output, activation='softmax')

decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# define inference encoder

encoder_model = Model(encoder_inputs, encoder_states)

# define inference decoder

decoder_state_input_h = Input(shape=(n_units,))

decoder_state_input_c = Input(shape=(n_units,))

decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)

decoder_states = [state_h, state_c]

decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

# return all models

return model, encoder_model, decoder_model

# generate target given source sequence

def predict_sequence(infenc, infdec, source, n_steps, cardinality):

# encode

state = infenc.predict(source)

# start of sequence input

target_seq = array([0.0 for _ in range(cardinality)]).reshape(1, 1, cardinality)

# collect predictions

output = list()

for t in range(n_steps):

# predict next char

yhat, h, c = infdec.predict([target_seq] + state)

# store prediction

output.append(yhat[0,0,:])

# update state

state = [h, c]

# update target sequence

target_seq = yhat

return array(output)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# configure problem

n_features = 50 + 1

n_steps_in = 6

n_steps_out = 3

# define model

train, infenc, infdec = define_models(n_features, n_features, 128)

train.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# generate training dataset

X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 100000)

print(X1.shape,X2.shape,y.shape)

# train model

train.fit([X1, X2], y, epochs=1)

# evaluate LSTM

total, correct = 100, 0

for _ in range(total):

X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1)

target = predict_sequence(infenc, infdec, X1, n_steps_out, n_features)

if array_equal(one_hot_decode(y[0]), one_hot_decode(target)):

correct += 1

print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

# spot check some examples

for _ in range(10):

X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1)

target = predict_sequence(infenc, infdec, X1, n_steps_out, n_features)

print('X=%s y=%s, yhat=%s' % (one_hot_decode(X1[0]), one_hot_decode(y[0]), one_hot_decode(target)))

Running the example first prints the shape of the prepared dataset.

(100000, 6, 51) (100000, 3, 51) (100000, 3, 51)

1	(100000, 6, 51) (100000, 3, 51) (100000, 3, 51)

Next, the model is fit. You should see a progress bar and the run should take less than one minute on a modern multi-core CPU.

100000/100000 [==============================] - 50s - loss: 0.6344 - acc: 0.7968

1	100000/100000 [==============================] - 50s - loss: 0.6344 - acc: 0.7968

Next, the model is evaluated and the accuracy printed.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the model achieves 100% accuracy on new randomly generated examples.

Accuracy: 100.00%

1	Accuracy: 100.00%

Finally, 10 new examples are generated and target sequences are predicted. Again, we can see that the model correctly predicts the output sequence in each case and the expected value matches the reversed first 3 elements of the source sequences.

X=[22, 17, 23, 5, 29, 11] y=[23, 17, 22], yhat=[23, 17, 22]
X=[28, 2, 46, 12, 21, 6] y=[46, 2, 28], yhat=[46, 2, 28]
X=[12, 20, 45, 28, 18, 42] y=[45, 20, 12], yhat=[45, 20, 12]
X=[3, 43, 45, 4, 33, 27] y=[45, 43, 3], yhat=[45, 43, 3]
X=[34, 50, 21, 20, 11, 6] y=[21, 50, 34], yhat=[21, 50, 34]
X=[47, 42, 14, 2, 31, 6] y=[14, 42, 47], yhat=[14, 42, 47]
X=[20, 24, 34, 31, 37, 25] y=[34, 24, 20], yhat=[34, 24, 20]
X=[4, 35, 15, 14, 47, 33] y=[15, 35, 4], yhat=[15, 35, 4]
X=[20, 28, 21, 39, 5, 25] y=[21, 28, 20], yhat=[21, 28, 20]
X=[50, 38, 17, 25, 31, 48] y=[17, 38, 50], yhat=[17, 38, 50]

X=[22, 17, 23, 5, 29, 11] y=[23, 17, 22], yhat=[23, 17, 22]

X=[28, 2, 46, 12, 21, 6] y=[46, 2, 28], yhat=[46, 2, 28]

X=[12, 20, 45, 28, 18, 42] y=[45, 20, 12], yhat=[45, 20, 12]

X=[3, 43, 45, 4, 33, 27] y=[45, 43, 3], yhat=[45, 43, 3]

X=[34, 50, 21, 20, 11, 6] y=[21, 50, 34], yhat=[21, 50, 34]

X=[47, 42, 14, 2, 31, 6] y=[14, 42, 47], yhat=[14, 42, 47]

X=[20, 24, 34, 31, 37, 25] y=[34, 24, 20], yhat=[34, 24, 20]

X=[4, 35, 15, 14, 47, 33] y=[15, 35, 4], yhat=[15, 35, 4]

X=[20, 28, 21, 39, 5, 25] y=[21, 28, 20], yhat=[21, 28, 20]

X=[50, 38, 17, 25, 31, 48] y=[17, 38, 50], yhat=[17, 38, 50]

You now have a template for an encoder-decoder LSTM model that you can apply to your own sequence-to-sequence prediction problems.

Summary

In this tutorial, you discovered how to develop an encoder-decoder recurrent neural network for sequence-to-sequence prediction problems with Keras.

Specifically, you learned:

How to correctly define a sophisticated encoder-decoder model in Keras for sequence-to-sequence prediction.
How to define a contrived yet scalable sequence-to-sequence prediction problem that you can use to evaluate the encoder-decoder LSTM model.
How to apply the encoder-decoder LSTM model in Keras to address the scalable integer sequence-to-sequence prediction problem.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

390 Responses to How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras

Alex November 2, 2017 at 7:06 pm #

Is this model suited for sequence regression too? For example the shampoo sales problem

Reply
- Jason Brownlee November 3, 2017 at 5:15 am #
  
  You could try, but generally LSTMs do not perform well on autoregression problems:
  https://machinelearningmastery.com/suitability-long-short-term-memory-networks-time-series-forecasting/
  
  Reply
  - rui December 3, 2020 at 8:42 pm #
    
    Hello, can this example be applied to the prediction of floating point numbers?For example [0.123,0.234,0.345], the target is [0.234,0.123],
    
    Reply
    - Jason Brownlee December 4, 2020 at 6:41 am #
      
      Yes, see an example here:
      https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
      
      Reply
      - cubayang August 10, 2021 at 5:40 pm #
        
        Hi, thank you for your post. I am new to LSTM. It looks like the target should always be a subset of the source, could we use LSTM to predict the target which is not a subset of the source? For example, the source is [0.123,0.234,0.345], and the target is [0.564, 0.923, 0.667]? Thanks in advance.
      - Adrian Tam August 11, 2021 at 6:40 am #
        
        Yes, you can.
Teimour November 2, 2017 at 9:55 pm #

Hi. is it possible to have multi layers of LSTM in encoder and decoder in this code? thank you for your great blog

Reply
- Jason Brownlee November 3, 2017 at 5:16 am #
  
  Yes, but I don’t have an example. For this specific case it would require some careful re-design.
  
  Reply
  - Dk January 17, 2019 at 5:18 pm #
    
    Hi, maybe this time you got any idea of multi-layer of LSTM autoencoder?
    
    Reply
    - Jason Brownlee January 18, 2019 at 5:30 am #
      
      Perhaps this will help as a start:
      https://machinelearningmastery.com/lstm-autoencoders/
      
      Reply
    - James Wanga February 21, 2019 at 5:54 pm #
      
      @Teimour @Dk @jason Brownlee Here is a great example of a multilayer encoder-decoder using the Keras functional API.
      
      Reply
  - JJ December 18, 2019 at 11:21 pm #
    
    Could you give some more detail as to how it would need to be modified to have multiple LSTM layers in the decoder? I am working on something similar and looking to add layers to the decoder.
    
    Reply
Kyu November 3, 2017 at 12:06 am #

How can I extract the bottleneck layer to extract the important features with sequence data?

Reply
- Jason Brownlee November 3, 2017 at 5:18 am #
  
  You could access the returned states to get the context vector, but it does not help you understand which input features are relevant/important.
  
  Reply
Thabet November 3, 2017 at 4:09 am #

Thank you Jason!

Reply
- Jason Brownlee November 3, 2017 at 5:21 am #
  
  You’re welcome.
  
  Reply

Harry Garrison November 18, 2017 at 3:30 am #

Thanks for the wonderful tutorial, Jason!
I am facing an issue, though: I tried to execute your code as is (copy-pasted it), but it throws an error:

Using TensorFlow backend.
Traceback (most recent call last):
File “C:\Users\User\Documents\pystuff\keras_auto.py”, line 91, in
train, infenc, infdec = define_models(n_features, n_features, 128)
File “C:\Users\User\Documents\pystuff\keras_auto.py”, line 40, in define_models
encoder = LSTM(n_units, return_state=True)
File “C:\Users\User\Anaconda3\envs\py35\lib\site-packages\keras\legacy\interfaces.py”, line 88, in wrapper
return func(*args, **kwargs)
File “C:\Users\User\Anaconda3\envs\py35\lib\site-packages\keras\layers\recurrent.py”, line 949, in __init__
super(LSTM, self).__init__(**kwargs)
File C:\Users\User\Anaconda3\envs\py35\lib\site-packages\keras\layers\recurrent.py”, line 191, in __init__
super(Recurrent, self).__init__(**kwargs)
File “C:\Users\User\Anaconda3\envs\py35\lib\site-packages\keras\engine\topology.py”, line 281, in __init__
raise TypeError(‘Keyword argument not understood:’, kwarg)
TypeError: (‘Keyword argument not understood:’, ‘return_state’)

I am using an anaconda environment (python 3.5.3). What could have possibly gone wrong?

Jason Brownlee November 18, 2017 at 10:23 am #

Perhaps confirm that you have the most recent version of Keras and TensorFlow installed.

Carolyn December 8, 2017 at 12:34 pm #

I had the same problem, and updated Keras (to version 2.1.2) and TensorFlow (to version 1.4.0). The problem above was solved. However, I now see that the shapes of X1, X2, and y are ((100000, 1, 6, 51), (100000, 1, 3, 51), (100000, 1, 3, 51)) instead of ((100000, 6, 51), (100000, 3, 51), (100000, 3, 51)). Why could this be?

Jason Brownlee December 8, 2017 at 2:29 pm #

I’m not sure, perhaps related to recent API changes in Keras?

Carolyn December 9, 2017 at 1:36 am #

Here’s how I fixed the problem.

At the top of the code, include this line (before any ‘from numpy import’ statements:

import numpy as np

Change the get_dataset() function to the following:

def get_dataset(n_in, n_out, cardinality, n_samples):
        X1, X2, y = list(), list(), list()
        for _ in range(n_samples):
                # generate source sequence
                source = generate_sequence(n_in, cardinality)
                # define padded target sequence
                target = source[:n_out]
                target.reverse()
                # create padded input target sequence
                target_in = [0] + target[:-1]
                # encode
                src_encoded = to_categorical([source], num_classes=cardinality)
                tar_encoded = to_categorical([target], num_classes=cardinality)
                tar2_encoded = to_categorical([target_in], num_classes=cardinality)
                # store
                X1.append(src_encoded)
                X2.append(tar2_encoded)
                y.append(tar_encoded)
        X1 = np.squeeze(array(X1), axis=1) 
        X2 = np.squeeze(array(X2), axis=1) 
        y = np.squeeze(array(y), axis=1) 
        return X1, X2, y

def get_dataset(n_in, n_out, cardinality, n_samples):

X1, X2, y = list(), list(), list()

for _ in range(n_samples):

# generate source sequence

source = generate_sequence(n_in, cardinality)

# define padded target sequence

target = source[:n_out]

target.reverse()

# create padded input target sequence

target_in = [0] + target[:-1]

# encode

src_encoded = to_categorical([source], num_classes=cardinality)

tar_encoded = to_categorical([target], num_classes=cardinality)

tar2_encoded = to_categorical([target_in], num_classes=cardinality)

# store

X1.append(src_encoded)

X2.append(tar2_encoded)

y.append(tar_encoded)

X1 = np.squeeze(array(X1), axis=1)

X2 = np.squeeze(array(X2), axis=1)

y = np.squeeze(array(y), axis=1)

return X1, X2, y

Notice instead of returning array(X1), array(X2), array(y), we now return arrays that have been squeezed – one axis has been removed. We remove axis 1 because it’s the wrong shape for what we need.

The output is now as it should be (although I’m getting 98% accuracy instead of 100%).

Jason Brownlee December 9, 2017 at 5:42 am #

Thanks for sharing.

Perhaps confirm that you have updated Keras to 2.1.2, it fixes bugs with to_categorical()?
Salem Shaikh January 29, 2020 at 4:19 am #

yes, we have to reshape the model and then everything works fine
what is(n_in, cardinality)?
I am facing difficulty in using it with a user-defined variable.for random what should be the input and input shape for the arguments
Jason Brownlee January 29, 2020 at 6:47 am #

See this:
https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-samples-timesteps-and-features-for-lstm-input
AMARESH PATNAIK March 1, 2020 at 9:09 pm #

I removed the brackets [ ] in the parameters to to_categorical instead. I think because parameters like target etc are already arrays the brackets are not necessary.
Jason Brownlee March 2, 2020 at 6:16 am #

Fair enough.

Kushagra September 6, 2023 at 7:52 pm #

I was facing the same issue, your comment helped me solve it! Thanks!

Reply

George Orfanidis February 14, 2019 at 8:09 pm #

As I was facing the same issue I think the simpler solution is just use:
to_categorical(source, num_classes=cardinality)

instead of

to_categorical([source], num_classes=cardinality).shape

for the 3 lists.

Reply
- Björn Lindqvist June 9, 2020 at 1:51 am #
  
  I also had to do that to get the data in the right format. There’s also a function tensorflow.one_hot which I prefer over to_categorical. They do the same thing.
  
  Reply

Thabet November 26, 2017 at 8:47 am #

Hi Jason!
Are the encoder-decoder networks suitable for time series classification?

Reply
- Jason Brownlee November 27, 2017 at 5:41 am #
  
  In my experience LSTMs have not proven effective at autoregression compared to MLPs.
  
  See this post for more details:
  https://machinelearningmastery.com/suitability-long-short-term-memory-networks-time-series-forecasting/
  
  Reply
- angelfight September 19, 2018 at 5:41 pm #
  
  for people later, you don’ t need to use squeeze,
  
  Let ‘s try:
  from
  src_encoded = to_categorical([source], num_classes=cardinality)
  tar_encoded = to_categorical([target], num_classes=cardinality)
  tar2_encoded = to_categorical([target_in], num_classes=cardinality)
  
  to:
  
  src_encoded = to_categorical(source, num_classes=cardinality)
  tar_encoded = to_categorical(target, num_classes=cardinality)
  tar2_encoded = to_categorical(target_in, num_classes=cardinality)
  
  Reply
  - Jason Brownlee September 20, 2018 at 7:51 am #
    
    Thanks.
    
    Reply
    - r kant October 19, 2019 at 1:28 am #
      
      Thanks Jason, for writing blogs for machinelearningmastery.com. I have seen no. of blogs.
      
      I have one query – –
      
      In my dataset —-
      no. of columns are 5 (inputs),
      No. of outputs – 3 (feature , which I want to predict) and
      number of rows – 34079.
      Then how to put these parameters in your code –
      
      n_features = 34079
      n_steps_in = 5
      n_steps_out = 3
      
      # generate training dataset
      X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1)
      print(X1.shape,X2.shape,y.shape)
      
      I am getting error like ——–>>> Your session crashed after using all available RAM. View runtime logs. I am using 14Gb RAM. please suggest me how to arrange my parameters as per your code.. thanks in advance.
      
      Reply
      - Jason Brownlee October 19, 2019 at 6:47 am #
        
        Sounds like you might need to reduce your dataset size or use a machine with more RAM?
      - r kant October 19, 2019 at 4:45 pm #
        
        Thanks Jason for for reply,
        but I think, I may not be keeping parameters properly in the variables.
        
        In my dataset —-
        no. of columns are 5 (inputs),
        No. of outputs – 3 (feature , which I want to predict) and
        number of rows – 34079.
        
        then how to put in your variables —
        
        n_features = ???
        n_steps_in = 5
        n_steps_out = 3
        sample = ???
        
        thanks
      - Jason Brownlee October 20, 2019 at 6:17 am #
        
        Ah I see, this will help you map your problem onto the terminology of samples, time steps, features:
        https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-samples-timesteps-and-features-for-lstm-input
        
        Does that help?
Python November 30, 2017 at 11:40 pm #

Hi Jason,
running the get_dataset the function returns one additional row in the Arrays X1, X2, y:
(1,1,6,51) (1,1,3,51)(1,1,3,51).
This results form numpy.array(), before the X1, X2, y are lists with the correct sizes:
(1,6,51) (1,3,51)(1,3,51).
It seems the comand array() adds an additional dimension. Could you help on solving this problem?

Reply
nandu December 1, 2017 at 6:42 pm #

how to train the keras ,it has to identify capital letter and small letter has to same

Please suggest any tactics for it.

Reply
- Jason Brownlee December 2, 2017 at 8:50 am #
  
  Start by clearly defining your problem:
  https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/
  
  Reply
Pritish Yuvraj December 7, 2017 at 4:34 am #

Could you please give me a tutorial for a case where we are input to the seq2seq model is word embeddings and outputs is also word embeddings. I find it frustrating to see a Dense layer at the end. this is what is stopping a fully seq2seq model with input as well as output word embeddings.

Reply
- Jason Brownlee December 7, 2017 at 8:08 am #
  
  Why would we have a word embedding at the output layer?
  
  Reply
  - Uthman Apatira March 2, 2018 at 6:59 pm #
    
    If embeddings are coming in, we’d want to have embeddings going out (auto-encoder). Imagine input = category (cat, dog, frog, wale, human). These animals are quite different, so we represent w/ embedding. Rather than having a dense output of 5 OHE, if an embedding is used for output, the assumption is that, especially if weights are shared between inputs + outputs, we could train the net with and give it more detail about what exactly each class is… rather than use something like cross entropy.
    
    Reply
    - Jason Brownlee March 3, 2018 at 8:08 am #
      
      Ok.
      
      Reply
      - Joe May 25, 2018 at 9:12 am #
        
        Is there such a tutorial yet? Sounds interesting.
      - Jason Brownlee May 25, 2018 at 9:36 am #
        
        I’ve not written one, I still don’t get/see benefit in the approach. Happy to be proven wrong.
  - Ashima March 30, 2019 at 4:06 am #
    
    Hi Jason,
    
    I am trying to put embedding layer at encoder and decoder input with dense layer as it as you mentioned in the above code.
    
    Later in the prediction function call, when I give 3D input to target_seq, it throws dimensional error: Error when checking input: expected Decoder_input to have 2 dimensions, but got array with shape (1, 1, 341), here 341 is my original feature dimension which is compressed to 32 by embedding layer.
    
    Could you please advice, what’s going wrong.
    
    Reply
    - Jason Brownlee March 30, 2019 at 6:32 am #
      
      I cannot debug your changes, sorry. Perhaps post your code and error to stackoverflow?
      
      Reply
      - Ashima April 5, 2019 at 2:33 am #
        
        Thanks Jason. Just wanted to get an idea in scenarios when decoder_input and decoder_output have different dimensions, what happens to your prediction function (target_seq) shape?
        
        Could you please guide.
      - Jason Brownlee April 5, 2019 at 6:22 am #
        
        Not sure off the cuff, I think you will need to experiment.
Dinter December 13, 2017 at 12:00 am #

Hi ! When adding dropout and recurrent_dropout as LSTM arguments on line 40 of the last complete code example with everything else being the same, the code went wrong. So how can I add dropout in this case? Thanks!

Reply
- Jason Brownlee December 13, 2017 at 5:38 am #
  
  I give a worked example of dropout with LSTMs here:
  https://machinelearningmastery.com/use-dropout-lstm-networks-time-series-forecasting/
  
  Reply
  - Dipesh January 11, 2018 at 6:20 am #
    
    Hi Jason! Thanks for your great effort to put encoder decoder implementations here. As Dinter mentioned, when dropout is added, the code runs well for training phase but gives following error during prediction.
    
    InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor ‘lstm_1/keras_learning_phase’ with dtype bool
    [[Node: lstm_1/keras_learning_phase = Placeholder[dtype=DT_BOOL, shape=, _device=”/job:localhost/replica:0/task:0/cpu:0″]()]]
    
    How to fix the problem in this particular implementation?
    
    [Note: Your worked example of dropout however worked for me, but the difference is you are adding layer by layer in sequential model there which is different than this example of encoder decoder]
    
    Reply
    - Jason Brownlee January 12, 2018 at 5:45 am #
      
      Sorry to hear that, perhaps it’s a bug? See if you can reproduce the fault on a small standalone network?
      
      Reply
      - Dipesh Gautam January 17, 2018 at 6:57 am #
        
        When I tried with exactly the same network in the example you presented and added dropout=0.0 in line 10 and line 15 of define_models function, the program runs but for other values of dropout it gives error. Also changing the size of network, for example, number of units to 5, 10, 20 does give the same error.
        
        Any idea to add dropout?
        
        line 5: encoder = LSTM(n_units, return_state=True,dropout=0.0)
        
        line 10: decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True,dropout=0.0)
      - Jason Brownlee January 17, 2018 at 10:01 am #
        
        See this tutorial:
        https://machinelearningmastery.com/use-dropout-lstm-networks-time-series-forecasting/
Huzefa Calcuttawala January 23, 2018 at 8:40 pm #

Hi Jason,
what is the difference between specifying the model input:
Model( [decoder_inputs] + decoder_states_inputs,….)
and this
Model([decoder_inputs,decoder_states_inputs],…)

Does the 1st version add the elements of decoder_states_inputs array to corresponding elements of decoder_inputs

Reply
Alfredo January 24, 2018 at 10:26 pm #

Hi Jason, Thanks for sharing this tutorial. I am only confused when you are defining the model. This is the line:

train, infenc, infdec = define_models(n_features, n_features, 128)

It is a silly question but why n_features in this case is used for the n_input and for the n_output instead of n_input equal to 6 and n_output equal to 3 ?

I look forward to hearing from you soon.

Thanks

Reply
- Jason Brownlee January 25, 2018 at 5:56 am #
  
  Good question, because the model only does one time step per call, so we walk the input/output time steps manually.
  
  Reply
  - Bajj June 11, 2020 at 5:43 pm #
    
    Hi Jason, I got the same doubt as above. Can you please explain once again. Sorry, I didn’t understand your reply
    
    Reply
Dat February 22, 2018 at 8:15 pm #

I would like to see the hidden states vector. Because there are 96 training samples, there would be 96 of these (each as a vector of length 4).

I added the “return_sequences=True” in the LSTM

model = Sequential()
model.add( LSTM(4, input_shape=(1, look_back), return_sequences=True ) )
model.add(Dense(1))
model.compile(loss=’mean_squared_error’, optimizer=’adam’)
model.fit(trainX, trainY, epochs=20, batch_size=20, verbose=2)

But, I get the error

Traceback (most recent call last):
File “”, line 1, in
File “E:\ProgramData\Anaconda3\lib\site-packages\keras\models.py”, line 965, in fit
validation_steps=validation_steps)
File “E:\ProgramData\Anaconda3\lib\site-packages\keras\engine\training.py”, line 1593, in fit
batch_size=batch_size)
File “E:\ProgramData\Anaconda3\lib\site-packages\keras\engine\training.py”, line 1430, in _standardize_user_data
exception_prefix=’target’)
File “E:\ProgramData\Anaconda3\lib\site-packages\keras\engine\training.py”, line 110, in _standardize_input_data
‘with shape ‘ + str(data_shape))
ValueError: Error when checking target: expected dense_24 to have 3 dimensions, but got array with shape (94, 1)

How can I make this model work, and also how can I view the hidden states for each of the input samples (should be 96 hidden states).

Thank you.

Reply
- Jason Brownlee February 23, 2018 at 11:54 am #
  
  Return sequences does not return the hidden state, but instead the outcome from each time step.
  
  This post might clear things up for you regarding outputs and hidden states:
  https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/
  
  Reply

D. Liebman March 3, 2018 at 12:24 am #

Hi Jason,
I like your blog posts. I have some code based on this post, but I get this error message all the time. There’s something wrong with my dense layer. Can you point it out? Here units is 300 and tokens_per_sentence is 25.

an error message:

ValueError: Error when checking target: expected dense_layer_b to have 2 dimensions, but got array with shape (1, 25, 300)

this is some code:

        def model_lstm():
            x_shape = (None,units)
            valid_word_a = Input(shape=x_shape)
            valid_word_b = Input(shape=x_shape)
            ### encoder for training ###
            lstm_a = LSTM(units=tokens_per_sentence,input_shape=x_shape, return_state=True)
            recurrent_a, lstm_a_h, lstm_a_c = lstm_a(valid_word_a)
            lstm_a_states = [lstm_a_h , lstm_a_c]
            print(lstm_a_h)
            ### decoder for training ###
            lstm_b = LSTM(units=tokens_per_sentence ,return_state=True ,input_shape=x_shape)
            recurrent_b, inner_lstmb_h, inner_lstmb_c = lstm_b(valid_word_b, initial_state=lstm_a_states)
            print(inner_lstmb_h.shape, inner_lstmb_c.shape,'h c'
            dense_b = Dense(units  , activation='softmax', name='dense_layer_b')
            decoder_b = dense_b(recurrent_b)
            model = Model([valid_word_a,valid_word_b], decoder_b) 
            return model

def model_lstm():

x_shape = (None,units)

valid_word_a = Input(shape=x_shape)

valid_word_b = Input(shape=x_shape)

### encoder for training ###

lstm_a = LSTM(units=tokens_per_sentence,input_shape=x_shape, return_state=True)

recurrent_a, lstm_a_h, lstm_a_c = lstm_a(valid_word_a)

lstm_a_states = [lstm_a_h , lstm_a_c]

print(lstm_a_h)

### decoder for training ###

lstm_b = LSTM(units=tokens_per_sentence ,return_state=True ,input_shape=x_shape)

recurrent_b, inner_lstmb_h, inner_lstmb_c = lstm_b(valid_word_b, initial_state=lstm_a_states)

print(inner_lstmb_h.shape, inner_lstmb_c.shape,'h c'

dense_b = Dense(units , activation='softmax', name='dense_layer_b')

decoder_b = dense_b(recurrent_b)

model = Model([valid_word_a,valid_word_b], decoder_b)

return model

Jason Brownlee March 3, 2018 at 8:12 am #

Perhaps the data does not match the model, you could change one or the other.

I’m eager to help, but I cannot debug the code for you sorry.

Reply
- D. Liebman March 4, 2018 at 6:30 am #
  
  hi. so I think I needed to set ‘return_sequences’ to True for both lstm_a and lstm_b.
  
  Reply

Jia Yee March 16, 2018 at 2:24 pm #

Dear Jason,

Do you think that this algorithm works for weather prediction? For example, by having the input integer variables as dew point, humidity, and temperature, to predict rainfall as output

Reply
- Jason Brownlee March 16, 2018 at 2:26 pm #
  
  Only for demonstration purposes.
  
  In practice, weather forecasts are performed using simulations of physics models and are more accurate than small machine learning models.
  
  Reply
Lukas March 20, 2018 at 10:10 pm #

Hi Jason.

I would like tou ask you, how could I use this model with float numbers? Yours training data seems like this:
[[
[0,0,0,0,1,0,0]
[0,0,1,0,0,1,0]
[0,1,0,0,0,0,0]]
.
.
.
]]

I would need something like this:
[[
[0.12354,0.9854,5875, 0.0659]
[0.12354,0.9854,5875, 0.0659]
[0.12354,0.9854,5875, 0.0659]
[0.12354,0.9854,5875, 0.0659]
]]

Whan i run your model with float numbers, the network doesn’t learn. Should I use some different LOSS function?

Thank you

Reply
- Jason Brownlee March 21, 2018 at 6:33 am #
  
  The loss function is related to the output, if you have a real-valued output, consider mse or mae loss functions.
  
  Reply
Jugal March 31, 2018 at 4:03 pm #

How to add bidirectional layer in encoder decoder architecture?

Reply
- Jason Brownlee April 1, 2018 at 5:44 am #
  
  Use it directly on the encoder or decoder.
  
  Here’s how to in Keras:
  https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/
  
  Reply
Luke April 3, 2018 at 9:51 pm #

Hi.
I would like to use your model with word embedding. I was inspired by https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html ->features -> word-level model

I decoded words in sentences as integer numbers. My input is list of sentences with 15 words. [[1,5,7,5,6,4,5, 10,15,12,11,10,8,1,2], […], [….], …]

My model seems:

encoder_inputs = Input(shape=(None,))
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim, return_state=True)(x)
encoder_states = [state_h, state_c]

# Set up the decoder, using encoder_states as initial state.
decoder_inputs = Input(shape=(None,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation=’softmax’)(x)

# Define the model that will turn
# encoder_input_data & decoder_input_data into decoder_target_data
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

Whan i try to train the model, i get this error: expected dense_1 to have 3 dimensions, but got array with shape (10, 15). Could you help me with that please?

Thank you

Reply
- Fatemeh August 8, 2018 at 4:52 am #
  
  Hello,
  I also have the same issue. could you solve your problem?
  
  Reply
Sunil April 4, 2018 at 6:44 am #

Hi ,

I am trying to do seq2seq problem using Keras – LSTM. Predicted output words matches with most frequent words of the vocabulary built using the dataset. Not sure what could be the reason. After training is completed while trying to predict the output for the given input sequence model is predicting same output irrespective of the input seq. Also using separate embedding layer for both encoder and decoder.

Can you help me what could be the reason ?

Question elaborated here:
https://stackoverflow.com/questions/49638685/keras-seq2seq-model-predicting-same-output-for-all-test-inputs

Thanks.

Reply
- Jason Brownlee April 5, 2018 at 5:42 am #
  
  It suggests the model has not learned the problem.
  
  Here are a list of ideas to try:
  https://machinelearningmastery.com/improve-deep-learning-performance/
  
  Reply
Phetsa Ndlangamandla April 9, 2018 at 9:45 pm #

Hi Jason,

In your current setup, how would you add a pre-trained embedding matrix, like glove?

Reply
- Jason Brownlee April 10, 2018 at 6:18 am #
  
  Here are some examples of loading a word embedding:
  https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
  
  Reply
Lukas April 11, 2018 at 2:11 am #

Hi,

Could you help me with my problem? I think that a lot of required logic is implemented in your code but I am very beginner in python. I want to predict numbers according to input test sequence (seq2seq) so output of my decoder should by sequence of 6 numbers (1-7). U can imagine it as lottery prediction. I have very long input vector of numbers 1-7 (contains sequenses of 6 numbers) so I don’t need to generate test data. I just need to predict next 6 numbers which should be generated. 1,1,2,2,3 -> 3,4,4,5,5,6

Thank you

Reply
- Jason Brownlee April 11, 2018 at 6:40 am #
  
  Perhaps here would be a good place for you to get stared with LSTMs:
  https://machinelearningmastery.com/start-here/#lstm
  
  Reply
  - Lukas April 13, 2018 at 8:48 pm #
    
    Thank you Jason, it was very helpful for me. And can you give me advice or link how to correctly initiate and train network with multiple input sequences?
    
    Reply
    - Jason Brownlee April 14, 2018 at 6:42 am #
      
      They can be side by side:
      https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
      
      Or separate inputs entirely:
      https://machinelearningmastery.com/keras-functional-api-deep-learning/
      
      Reply
Jameshwart Lopez April 17, 2018 at 11:41 am #

Thank you Jason for this tutorial it gets me started. I tried to use larger sequence where input is 400 series of numbers and output is 40 numbers. I have problem with categorical because of index error and i don’t know how to set or get the the value for cardinality/n_features. Can you give me an idea on this?

Im also not clear if this is a classification type of model. Can you please confirm. Thanks

Reply
- Jason Brownlee April 17, 2018 at 2:50 pm #
  
  You can integer encode or one hot encode categorial input features. Here is a tutorial on the topic:
  https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
  
  This post will make the distinction between classification and regression clear:
  https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/
  
  Reply
Kadirou April 17, 2018 at 8:15 pm #

Hi,
thank you for this totu, please i receive this error when i run your example :
ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (100000, 1, 6, 51)

Reply
- Claudiu April 23, 2018 at 10:21 pm #
  
  Hi! I have the same problem.. Did you solve it ? Thanks!
  
  Reply
  - Gary April 30, 2018 at 1:18 am #
    
    change this:
    
    src_encoded = to_categorical([source], num_classes=cardinality)[0]
    tar_encoded = to_categorical([target], num_classes=cardinality)[0]
    tar2_encoded = to_categorical([target_in], num_classes=cardinality)[0]
    
    Reply
  - cumparaturi online February 27, 2023 at 7:37 am #
    
    Did you tried Gary’s answer?
    
    Reply
Jonathan K April 25, 2018 at 3:32 am #

Hi Jason, thank you very much for the tutorial. Is it possible to decode many sequences simultaneously instead of decoding a single sequence one character at a time? The inference for my project takes too long and I thought that doing it in larger batches may help.

Reply
- Jason Brownlee April 25, 2018 at 6:37 am #
  
  You could use copies of the model to make prediction in parallel for different inputs.
  
  Reply
Jameshwart Lopez April 26, 2018 at 4:08 pm #

Hi im just wondering why do we need to use to_categorical if the sequence is already numbers. On my case i have a series of input features(numbers) and another series of output features(number). Should i still use to_categorical method?

Reply
- Jason Brownlee April 27, 2018 at 5:59 am #
  
  Good question, it will one hot encode the numbers:
  https://keras.io/utils/#to_categorical
  
  Reply
chunky April 28, 2018 at 10:16 pm #

Hi Jason,

I am working on word boundary detection problem where dataset containing .wav files in which a sentence is spoken are given, and corresponding to each .wav file a .wrd file is also given which contains the words spoken in a sentence and also its boundaries (starting and end boundaries).
Our task is to identify word boundaries in test .wav file (words spoken will also be given).
I want to do this with sequential models .

What I have tried is:

I have read .wav files using librosa module in numpy array (made it equal to max size using padding)

Its output is like 3333302222213333022221333302222133333 (for i/p ex: I am hero)
where (0:start of word, 1:end of word, 2:middle, 3:space)

means I want to solve this as supervised learning problem, can I train such model with RNN?

Reply
- Jason Brownlee April 29, 2018 at 6:27 am #
  
  Sounds like a great project.
  
  I don’t have examples of working with audio data sorry, I cannot give you good off the cuff advice.
  
  Perhaps find some code bases on related modeling problems and use them for inspiration?
  
  Reply
Lukas April 30, 2018 at 1:03 am #

Hello.
I’ve tried this example for word-level chatbot. Everything works great, on a small data (5000 sentences)

When I use dataset of 50 000 sentences something is wrong. Model accuracy is 95% but when i try to chat with this chatbot, responses are generated randomly. Chatbot is capable of learning sentences from dataset, but it use them randomly to respond to users questions.

How it is possible, when accuracy is so hight?
Thanks

Reply
- Jason Brownlee April 30, 2018 at 5:37 am #
  
  Perhaps it memorized the training data? E.g. overfitting?
  
  Reply
ricky May 3, 2018 at 12:06 am #

Sir, how to create confusion matrix, evaluated and the accuracy printed for this model :

# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard encoder_outputs and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using encoder_states as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don’t use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation=’softmax’)
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# encoder_input_data & decoder_input_data into decoder_target_data
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer=’rmsprop’, loss=’categorical_crossentropy’, metrics=[‘accuracy’])
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)

model.summary()

Reply
- Jason Brownlee May 3, 2018 at 6:35 am #
  
  Here is information on how to calculate a confusion matrix:
  https://machinelearningmastery.com/confusion-matrix-machine-learning/
  
  Reply
Chandra Sutrisno May 8, 2018 at 7:54 pm #

Hi Jason,

Thank you for this awesome tutorial, so useful. I have one simple question. Is there any specific
reason why use 50+1 as n_features?

Please advise

Reply
- Jason Brownlee May 9, 2018 at 6:19 am #
  
  The +1 is to leave room for the “0” value, for “no data”.
  
  Reply
Kunwar May 15, 2018 at 11:28 pm #

Hi Jason,

Thanks for all the wonderful tutorial ..great work.

i have question,

in time series prediction does multivariate give better result then uni-variate.
eg.- for “Beijing PM2.5 Data Set” we have multivariate data will the multivariate give better results, or by taking the single pollution data for uni-variate will give better result.

2 – what is better encoder-decoder or normal RNN for time series prediction.

Reply
- Jason Brownlee May 16, 2018 at 6:04 am #
  
  For both questions, it depends on the specific problem.
  
  Reply
matt May 20, 2018 at 3:44 am #

Hi Jason, here it looks like one time step in the model. What do I have to change to add here more time steps in the model ?

# returns train, inference_encoder and inference_decoder models
def define_models(n_input, n_output, n_units):
# define training encoder
encoder_inputs = Input(shape=(None, n_input))
encoder = LSTM(n_units, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]
# define training decoder
decoder_inputs = Input(shape=(None, n_output))
decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(n_output, activation=’softmax’)
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# define inference encoder
encoder_model = Model(encoder_inputs, encoder_states)
# define inference decoder
decoder_state_input_h = Input(shape=(n_units,))
decoder_state_input_c = Input(shape=(n_units,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)
# return all models
return model, encoder_model, decoder_model

Reply
Skye May 21, 2018 at 6:09 pm #

Hi Jason,

Thank you very much! I would like to ask a question which seems silly…When training the model, we store the weights in training model. Are the inference encoder and inference decoder empty? At predict stage, the training model is not used, so how are the training model weights used to predict?

Looking forward to your reply. Thank you!

Reply
- Jason Brownlee May 22, 2018 at 6:24 am #
  
  No silly questions here!
  
  The state of the model is reset at the end of each batch.
  
  Reply
Skye May 24, 2018 at 10:36 am #

Get it. Thank you a lot!

Reply
- Jason Brownlee May 24, 2018 at 1:50 pm #
  
  Glad to hear it.
  
  Reply
- Ruzbeh January 12, 2019 at 8:18 am #
  
  Can you kindly explain this a bit more. I don’t seem to understand how model.fit relates to inference_encoder/decoder. Thank you!
  
  Reply
  - Bill January 15, 2019 at 1:46 am #
    
    (great post, btw) I could also use a bit more. Here is my misunderstanding, 3 models were instantiated (train, infenc, infdec). One was trained (train) and developed weights that (hopefully) would generate accurate predictions.
    
    Then the other two models are used to test the predictive power, but I do not understand where the weights from train were parsed and transferred to infenc and infdec, respectively. (and, I suspect I fundamentally am missing something about the keras infrastructure which informs this)
    
    Reply
    - John August 30, 2020 at 10:51 am #
      
      Same question here, was wondering how the weights from train are used in infenc
      
      Reply
      - Jason Brownlee August 31, 2020 at 6:06 am #
        
        Good question.
        
        I recommend reviewing the define_models() function, you can see that model weights are shared between the models – but each model provides a different use case.
Joe May 25, 2018 at 9:28 am #

Can this model described in this blog post be used when we have variable length input? How.

Reply
- Joe May 25, 2018 at 9:29 am #
  
  And also variable length output.
  
  Reply
- Jason Brownlee May 25, 2018 at 9:37 am #
  
  Yes, it processes time steps one at a time, therefore supports variable length by default.
  
  Reply
michel June 9, 2018 at 3:32 am #

Hi Jason how to deal and implement this model if you have time series data with 3D shape (samples, timesteps, features) and can not /want not one hot encode them? Thank you in advance.

Reply
- Jason Brownlee June 9, 2018 at 6:57 am #
  
  You can integer encode the inputs and use an embedding layer on the input.
  
  Reply
  - michel June 10, 2018 at 8:54 am #
    
    how would that look like ? And can I not do it with the inference part later only ? Sorry my questions could sound a bit stupid, I try to understand the topics.
    
    Reply
    - Jason Brownlee June 11, 2018 at 6:04 am #
      
      I provide a ton more help on preparing data for LSTMs here:
      https://machinelearningmastery.com/faq/single-faq/how-do-i-prepare-my-data-for-an-lstm
      
      Reply
mlguy June 17, 2018 at 2:41 pm #

Hi Jason,
thank you for sharing your codes. I used this for my own problem and it works, but I still get good prediction results on values that are far away from the ones used in the training. What could my problem be ? I made a regression using mse for loss, do I need a different loss function ?

Reply
- Jason Brownlee June 18, 2018 at 6:39 am #
  
  I have some suggestions for improving model skill here that may give you some ideas:
  https://machinelearningmastery.com/improve-deep-learning-performance/
  
  Reply
patrick June 18, 2018 at 8:27 am #

Hi Jason, great contribution! when I use timeseries data for this model, can I also use not the shfitet case in the target data for y, so: model.fit([input, output], output) , so output=input.reversed() and than would this make sense as well? because I want to use sliding windows for input and output; and than shfiting by one for the output being would not make sense in my eyes.

Reply
- Jason Brownlee June 18, 2018 at 3:05 pm #
  
  Not sure I follow sorry.
  
  Perhaps this post will make things clearer:
  https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
  
  Reply
simon June 18, 2018 at 9:56 am #

lets say I trained with shape (1000,20,1) but I want to predict with (20000,20,1) than this would not work, because the sample size is bigger. how do I have to adjust this ?

output=list()
for t in range(2):
output_tokens, h, c = decoder_model.predict([target_seqs] + states_values)
output.append(output_tokens[0,0,:])
states_values = [h,c]
target_seq = output_tokens

Reply
- Jason Brownlee June 18, 2018 at 3:06 pm #
  
  Why would it not work?
  
  Reply
  - simon June 19, 2018 at 3:32 am #
    
    it says the following: index 1000 is out of bounds for axis 0 with size 1000,
    so the axis of the second one needs to be in the same length. But I will not train with the same length of data that I am predicting. How can this be solved?
    
    Reply
    - Jason Brownlee June 19, 2018 at 6:37 am #
      
      Perhaps double check your data.
      
      Reply
      - simon June 24, 2018 at 6:54 pm #
        
        still not working Jason, Unfortunately it seems like variable sequence length is not possible ? I trained my model with a part of the whole data and wanted to use in the inference the whole data ?
      - simon June 24, 2018 at 7:20 pm #
        
        I correct myself: to I have to padd my data with zeros in the beginning of my training data since I want to use the last state of the encoder as initial state ?
Sarah June 20, 2018 at 1:19 am #

I have 2 questions:
1- This example is using teacher forcing, right? I’m trying to re-implement it without teacher forcing but it fails. When I define the train model as (line 49 of your example): model = Model(encoder_inputs, decoder_outputs), it gives me this error: RuntimeError: Graph disconnected: cannot obtain value for tensor Tensor(“decoder_inputs:0”, shape=(?, n_units, n_output_features), dtype=float32) at layer “decoder_inputs”. The following previous layers were accessed without issue: [‘encoder_inputs’, ‘encoder_LSTM’]

2- Can you please explain a bit more about how you defined the decoder model? I don’t get what is happening in the part between line 52 to 60 of the code (copied below)? Why do you need to re-define the decoder_outputs? How the model is defined?

# define inference decoder
decoder_state_input_h = Input(shape=(n_units,))
decoder_state_input_c = Input(shape=(n_units,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)
# return all models

Reply
- Jason Brownlee June 20, 2018 at 6:30 am #
  
  Correct. If performance is poor without teacher forcing, then don’t train without it. I’m not sure why you would want to remove it?!?
  
  The decoder is a new model that uses parts of the encoder. Does that help?
  
  Reply
dave June 29, 2018 at 5:38 am #

HI Jason, I looked at different pages and found that a seq2seq is not possible with variable length without manipulating the data with zeros etc. So in your example, is it right that you can only use in inference data with shape (100000,timesteps,features), so not having variable length? If not how you can change that ? thanks a lot for responding!

Reply
- Jason Brownlee June 29, 2018 at 6:16 am #
  
  Generally, variable length inputs can be truncated or padded and a masking layer can be used to ignore the padding.
  
  Alternately, the above model can be used time-step wise, allowing truely multivariate inputs.
  
  Reply
  - dave June 29, 2018 at 4:11 pm #
    
    nice explanation, a last question: my timesteps woud be the same in all cases,just the sample size woud be different. The keras padding sequences shows only padding for timesteps or am I wrong ? in that case I would need to pad my samples
    
    Reply
    - Jason Brownlee June 30, 2018 at 6:03 am #
      
      Good question. Only the time steps and feature dimensions are fixed, the number of samples can vary.
      
      Reply
DrCam June 29, 2018 at 8:09 pm #

Jason,

These are really awesome little tutorials and you do a great job explaining them.

Just as an FYI, to help out other readers, I had to use tensorflow v1.7 to get this to run with Keras 2.2.0. With older versions I got a couple of errors while initializing the models and with the newer v1.8 I got some session errors with TF.

Reply
- Jason Brownlee June 30, 2018 at 6:07 am #
  
  Thanks for the tip.
  
  Reply
Yasir Hussain July 16, 2018 at 5:53 pm #

Hello Jason, Your work has always been awesome. I have learned a lot form your tutorials.
I am trying a simple encoder-decoder model following your example but facing some problem.

my issue is vocabulary size where it is around 5600, which makes my one hot encoder really big and puts my pc to freeze. can you give some simple example in which I don’t need to one hot encode my data? I know about sparse_categorical_crossentropy but I am facing trouble implementing it. Maybe my approach is wrong. your guidance can help me a lot.

Thanks once again for such great tutorials…..

Reply
- Jason Brownlee July 17, 2018 at 6:12 am #
  
  Perhaps try a word embedding instead?
  
  Or, perhaps try training on an EC2 instance?
  
  Reply
George Kibirige July 24, 2018 at 1:32 am #

Hi Jason
Why I got this shape ((100000, 1, 6, 51), (100000, 1, 3, 51), (100000, 1, 3, 51)) when I run your code….Or Do I need to reshape or i did not copy correct

Because it shows you get ((100000,6, 51), (100000,3, 51), (100000,3, 51))

Reply
- Jason Brownlee July 24, 2018 at 6:21 am #
  
  Is your version of Keras up to date?
  
  Did you copy all of the code exactly?
  
  Reply
  - George July 24, 2018 at 4:49 pm #
    
    Hi Jason, the version is different i put this
    X1 = np.squeeze(array(X1), axis=1)
    X2 = np.squeeze(array(X2), axis=1)
    y = np.squeeze(array(y), axis=1)
    
    Its Ok now, What are you doing is called many to one encoding? and in decoding case is it called one to many or?
    
    another question I want to remove those one hot encode, I want to encode the exactly number and predict the exactly number
    
    Also to remove this option src_encoded = to_categorical([source], num_classes=cardinality)
    but I got a lot of error
    
    Reply
    - Jason Brownlee July 25, 2018 at 6:13 am #
      
      I have advice on changing between regression/classification in keras:
      https://machinelearningmastery.com/faq/single-faq/how-can-i-change-a-neural-network-from-regression-to-classification
      
      Reply
- C.Park February 12, 2019 at 3:51 am #
  
  Try removing the bracket in the to_categorical argument, that is, try using
  to_categorical(source, num_classes=cardinality) instead of
  to_categorical([source], num_classes=cardinality).
  
  Reply
  - Mariia Kunilovskaia May 14, 2020 at 12:27 pm #
    
    +500
    A bit annoying that this mishap is not fixed. My guess it is an intentional overlooking intended to enhance training value of the tutorial 🙂
    A teaching method of sorts…
    
    Reply
    - Jason Brownlee May 14, 2020 at 1:29 pm #
      
      I think the API changed. I need to fix it…
      
      Reply
broley July 27, 2018 at 12:27 am #

Hi Jason,

good explanations appreciate your sharings! I wanted to ask if in the line for the inference prediction
for t in range(n_steps):

you also could go with not the n_steps but also with one more ? Or why are you looping with n_steps? you also would get a result with for t in range(1), right= I hope you understood what I try to find out.

Reply
- Jason Brownlee July 27, 2018 at 5:55 am #
  
  Sorry, i don’t follow. Perhaps you can provide more details?
  
  Reply
broley July 27, 2018 at 6:10 am #

so in your code here below, you are predicting in a for loop with n_steps, what needs n_steps to be ? Can you also go with 1? or do you need that n_steps from the data shape (batch,n_steps,features) as timesteps? Because when you infdec.predict isnt the model is taking the whole target_seq for prediction at once why you need n_steps?

for t in range(n_steps):
# predict next char
yhat, h, c = infdec.predict([target_seq] + state)
# store prediction
output.append(yhat[0,0,:])
# update state
state = [h, c]
# update target sequence
target_seq = yhat
return array(output)

Reply
- Jason Brownlee July 27, 2018 at 11:02 am #
  
  Sure, you can set it to 1.
  
  Reply
  - broley July 28, 2018 at 7:16 pm #
    
    and what is the benefit of using 1 or a higher iteration number ?
    
    Reply
    - Jason Brownlee July 29, 2018 at 6:10 am #
      
      for multi-step forecasts.
      
      Reply
guofeng July 30, 2018 at 8:41 pm #

In the “def define_models()”, there are three models: model, encoder_model, decoder_model,

what is the function of encoder_model and decoder_model? Can I use only the “model” for

prediction? Looking forward to your reply

Reply
- Jason Brownlee July 31, 2018 at 6:01 am #
  
  You can learn more about the architecture here:
  https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/
  
  Reply
Rahul Sattar July 31, 2018 at 6:28 pm #

How do we use Normalized Discounted Cumulative Rate (NDCR) for evaluating the model?
Is it necessary to use NDCR or we can live with accuracy as a performance metric?

Reply
- Jason Brownlee August 1, 2018 at 7:41 am #
  
  What is NDCR exactly? I’ve never heard of it.
  
  Reply
Fatemeh August 8, 2018 at 7:53 am #

Hi, thank you for your great description. why you didn’t use “rmsprop” optimizer?

Reply
- Jason Brownlee August 8, 2018 at 9:39 am #
  
  I find I get great results from Adam, an extension of RMSProp:
  https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/
  
  Reply
Fatemeh August 10, 2018 at 1:02 am #

Hi,
I added the embediing matrix to your code and received an error related to matrix dimentions same as what Luke mentioned in the previous comments. do you have any post for adding embedding matrix to encoder-decoder?

Reply
- Jason Brownlee August 10, 2018 at 6:19 am #
  
  I have some suggestions here:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
star August 15, 2018 at 1:57 am #

Hi Jason,
I have a low-level question. why did you train the model in only “one” epoch?wasn’t better to choose the higher number?

Reply
- Jason Brownlee August 15, 2018 at 6:08 am #
  
  Because we have a very large number of random samples.
  
  Perhaps re-read the tutorial?
  
  Reply
mohammad H August 15, 2018 at 11:37 pm #

Thank you, Jason. How can we save the results of our model to use them later on?

Reply
- Jason Brownlee August 16, 2018 at 6:06 am #
  
  See this tutorial:
  https://machinelearningmastery.com/save-load-keras-deep-learning-models/
  
  Reply
  - Thiago September 25, 2019 at 5:54 am #
    
    In your code, I suppose the 3 models should be saved (i.e., the 3 returned by defined_models):
    
    train, infenc, infdec = define_models(n_features, n_features, 128)
    
    Reply
    - Jason Brownlee September 25, 2019 at 6:07 am #
      
      Yes, perhaps try using the save() function to save each.
      
      Reply
khan August 24, 2018 at 4:28 am #

Hi Jason, I was using your code to solve a regression problem. I have the data defined exactly like you have but instead of casting them to one-hot-vectors I wish to leave them integers or floats.
I modified the code as follows:
def define_models(n_input, n_output, n_units): # define training encoder encoder_inputs = Input(shape=(None, n_input)) encoder = LSTM(n_units, return_state=True) encoder_outputs, state_h, state_c = encoder(encoder_inputs) encoder_states = [state_h, state_c] # define training decoder decoder_inputs = Input(shape=(None, n_output)) decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True) decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states) decoder_dense = Dense(n_output, activation='relu') decoder_outputs = decoder_dense(decoder_outputs) model = Model([encoder_inputs, decoder_inputs], decoder_outputs) # define inference encoder encoder_model = Model(encoder_inputs, encoder_states) # define inference decoder decoder_state_input_h = Input(shape=(n_units,)) decoder_state_input_c = Input(shape=(n_units,)) decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c] decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs) decoder_states = [state_h, state_c] decoder_outputs = decoder_dense(decoder_outputs) decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states) # return all models return model, encoder_model, decoder_model
However the model does not learn to predict for more than one future timesteps, gives correct prediction for only one ahead. Could you please suggest some solution. Much thanks!

I make the prediction then as follows:
def predict_sequence(infenc, infdec, source, n_steps):
state = infenc.predict(source)
output = list()
pred=np.array([0]).reshape(1,1,1)
for t in range(n_steps):
yhat, h, c = infdec.predict([pred]+state)
output.append(yhat[0,0,0])
state = [h, c]
pred = yhat
return array(output)

Reply
- Jason Brownlee August 24, 2018 at 6:15 am #
  
  I have some ideas here:
  https://machinelearningmastery.com/improve-deep-learning-performance/
  
  Reply
- Gilles Madi November 27, 2018 at 1:26 am #
  
  Hello Khan, I have a similar proble the network not predicting more than one step ahead. did you find a solution?
  
  Reply
Canh Bui September 5, 2018 at 1:15 pm #

Hi Jason, thanks for great serial about MLs,
In this tutorial, i have a bit complex (I’m a beginner in Python. I just have a little AI and ML theory).

y set (target sequence) is a part of X1. Clearly, y specified by X (reserve and subset). So,
1, In train model, Have you use y in X1 as input sequence ( sequence X1(y) => sequence y )
2, Can i define y separate with X1 (such as not subset, reserve…)?
Thanks you!

Reply
- Jason Brownlee September 5, 2018 at 2:42 pm #
  
  Not sure I follow. Perhaps this will help you define your problem:
  https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/
  
  Reply
rahnema November 11, 2018 at 2:48 pm #

hi sir
i want to generate seq from pretrained embedding and in details i have a set of embedding list and its corresponding sentence.and i want to have a model to generate sentence with that embedding data. but i don’t know how to develop that model.

Reply
- Jason Brownlee November 12, 2018 at 5:35 am #
  
  You can use a model like an LSTM to generate text, but not an embedding alone.
  
  Reply
maxrutag November 13, 2018 at 11:18 pm #

Hi Jason,

Great work again! I am a big fan of your tutorials!

I am working on a water inflow forecast problem. I have access to weather predictions and I want to predict the water inflow, given the historical values of both weather and inflows.

I have tried a simple LSTM model with overlapping sequences (time series to supervised), taking weather predictions and past inflows as input, and outputting future inflows. This works pretty well.

Seeing all this Seq2Seq time series forecast trend, I have also tried it, by encoding the weather forecast and decoding them to water inflow (with past inflows as decoder inputs), expecting even better results.

But this Seq2Seq model is performing very poorly. Do you have an idea why? Should I give up this kind of models for my problem?

Reply
- Jason Brownlee November 14, 2018 at 7:29 am #
  
  Perhaps try improving the performance of the model:
  https://machinelearningmastery.com/improve-deep-learning-performance/
  
  Reply
trancy November 23, 2018 at 7:33 pm #

Hi Jason,

I have read couple of your blogs about seq2seq and lstm. Wondering how should I combine this encoder-decoder model with attention? My scenario is kind of translation.

Reply
- Jason Brownlee November 24, 2018 at 6:30 am #
  
  Sure, attention often improves the performance of an encoder-decoder model.
  
  Reply
Marcos Jimenez November 28, 2018 at 3:30 am #

hey Jason, I have 2 questions :
question 1 – how do you estimate a “confidence” for a whole sequence prediction in a neural translation model. i guess you could multiply the individual probabilities for each time-step. somehow i’m not sure that is a proper way to do it. another approach is to feed the output sequence probabilities into a binary classifier (confident/not-confident) and take its value as the confidence score.
question 2 – how do ensure that the whole sequence prediction in a neural translation model is the optimal sequence. if you have a greedy approach where you take the max at each time-step you end up with a sequence which itself is probably not the most likely one. i’ve heard of beam search and other approaches. do you have an idea of the state-of-the-art ?

Reply
- Jason Brownlee November 28, 2018 at 7:45 am #
  
  You can do a greedy search through the predicted probabilities to get the most likely word at each step, but it does not mean that it is the best sequence.
  
  Perhaps look into a beam search.
  
  Reply
Ashima December 6, 2018 at 2:15 am #

Hi Luke,

I am dealing with the same scenario of sequences of integers.

Before you input the sequence, you need to reshape the sequences to 3D and best to train the model in mini batches as it will reset the states after every iteration which works really well for LSTM based model.

Hope that helps.

Reply
Kushal December 20, 2018 at 11:30 am #

Why not apply TimeDistributed to the Decoder as in https://machinelearningmastery.com/develop-neural-machine-translation-system-keras/ ?

Reply
- Jason Brownlee December 20, 2018 at 1:58 pm #
  
  In this case we have a dynamic 1-step model – e.g. it’s not needed.
  
  Reply
uday December 27, 2018 at 4:54 pm #

Hi Jason !

Can this model be used to summarize the sentence ? Am working on abstractive summarizer and trying to build one using encoder decoder with attention.

Can you help me out.

Reply
- Jason Brownlee December 28, 2018 at 5:52 am #
  
  I have a number of posts on summarization:
  https://machinelearningmastery.com/?s=text+summarization&post_type=post&submit=Search
  
  Reply
Ruzbeh January 12, 2019 at 7:57 am #

Thank you for the great tutorial!! There is one thing I don’t understand, hopefully you can clarify:

We train a model as:

train, infenc, infdec = define_models(n_features, n_features, 128)
train.compile(optimizer=’adam’, loss=’categorical_crossentropy’, metrics=[‘acc’])
train.fit([X1, X2], y, epochs=1)

However, the inference encoder/decoder models (infenc and infdec) seem to still un-trained.

I don’t understand how, where, or when the information (weights, states, etc.) from the “train model” is transfer to the inference models.

At prediction time, you said that:

“During prediction, the inference_encoder model is used to encode the input sequence once which returns states that are used to initialize the inference_decoder model. From that point, the inference_decoder model is used to generate predictions step by step.”

But aren’t the inference_encoder and inference_decoder un-trained?

Reply
- Jason Brownlee January 13, 2019 at 5:38 am #
  
  The weights are trained, it is just we have two references to the same weights so we can use them different ways (training/inference).
  
  Reply
Ashima February 13, 2019 at 3:30 am #

Hi Jason,

Is it possible to apply Batchnormalization with the above code?

Reply
- Jason Brownlee February 13, 2019 at 8:03 am #
  
  Sure.
  
  Here’s some examples of using batchnorm that you can adapt:
  https://machinelearningmastery.com/how-to-accelerate-learning-of-deep-neural-networks-with-batch-normalization/
  
  Reply
  - Ashima February 16, 2019 at 2:37 am #
    
    Hi Jason,
    
    Since the above code doesn’t use Keras Sequential model, so it doesn’t allow to add Batchnormalization layer directly to the model.
    
    Could you please help if we want to apply Bn in the code above?
    
    Reply
    - Jason Brownlee February 16, 2019 at 6:21 am #
      
      You can add it to the model using the functional API:
      https://machinelearningmastery.com/keras-functional-api-deep-learning/
      
      Reply
      - Ashima February 18, 2019 at 9:54 pm #
        
        Thanks Jason. Your posts are really helpful !
Ashima February 16, 2019 at 1:48 am #

Thanks Jason!

Reply
- Jason Brownlee February 16, 2019 at 6:20 am #
  
  I’m glad it helped.
  
  Reply
Kyu February 24, 2019 at 5:51 pm #

Hi Jason,

I just wonder why you specified the “epochs” to 1. Is it just because the accuracy already reach 100%?
If it were not the accuracy of 100% with the “epochs” equal to 1, should I switch to any other epoch number to get increased accuracy?

Thank you in advance!
Kyu

Reply
- Jason Brownlee February 25, 2019 at 6:38 am #
  
  Not quite, it is because we defined a very large number of random examples that acted like a proxy for epochs.
  
  You could use fewer examples and more epochs to achieve the same effect.
  
  Reply
Tobs March 3, 2019 at 8:02 am #

Hi Jason,
nice post as always, I enjoy reading your blog. I got on question here. I might missed something, but why do we need the shifted target sequence as input for the decoder? and one more question: why do we need to shift it? Would be nice if you can clearify that for me. Thanks in advance 🙂

Cheers,
Tobs

Reply
- Jason Brownlee March 3, 2019 at 8:05 am #
  
  To transform it into a supervised learning problem, for example, see this:
  https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
  
  Reply
Ashima March 9, 2019 at 2:29 am #

Hi Jason,

Could you please help with gru based inference model for autoencoder.

/Ashima

Reply
- Jason Brownlee March 9, 2019 at 6:30 am #
  
  You can use the LSTM autoencoder in this tutorial and change the layers to GRU:
  https://machinelearningmastery.com/lstm-autoencoders/
  
  Reply
Ashima March 11, 2019 at 9:48 pm #

Thanks Jason

Reply
- Jason Brownlee March 12, 2019 at 6:52 am #
  
  You’re welcome.
  
  Reply
Mostafa April 7, 2019 at 3:16 pm #

Hi Jason,
Thank you very much for your blog and its excellent content. Regarding this post, it seems that the encoder_outputs is not used. Why don’t we use this out-put? Can we put an auto-decoder on top of it to reproduce the input1? Would this help the network in predicting the sequences?
Bests
Mostafa

Reply
- Jason Brownlee April 8, 2019 at 5:54 am #
  
  It is used – the encoder output is connected to the decoder.
  
  Reply
  - Abhishek Shivkumar March 23, 2020 at 11:19 pm #
    
    Hi Jason, I see the encoder_outputs is not used at all. Any purpose capturing it in a variable? I see only the states being propagated to the decoder.
    
    Reply
    - Jason Brownlee March 24, 2020 at 6:03 am #
      
      Not really.
      
      Reply
Gunay April 11, 2019 at 2:48 am #

Hi Jason, Thanks for the tutorial! I want to add BatchNormalization and Dropout layer after the decoder LSTM layer. Do you have any idea how I can do it?

Regards,
Gunay

Reply
- Jason Brownlee April 11, 2019 at 6:44 am #
  
  Be careful with dropout on LSTMs, use LSTM specific dropout:
  https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/
  
  Also batch norm and dropout can interact, be careful not to use them side by side – always test.
  
  Reply
Hossein Shirazi April 18, 2019 at 3:50 am #

Hi Jason,

Thank you so much for your exciting blog. It’s great and I have a question.
What if we have more than one input series that are related to each other and want to predict the result?
For example, predicting the weather which we have more than one features and series?

Reply
- Jason Brownlee April 18, 2019 at 8:55 am #
  
  You can have parallel input features or have a multi-input model.
  
  I show how to do both here:
  https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
  
  Reply
ryh April 22, 2019 at 11:34 am #

Great, thanks a lot

Reply
Waqas Sheikh April 29, 2019 at 7:53 pm #

Hi Jason,
Thanks for the great post.

I need to ask one question that you may think is very silly question.

I have a sequence of integers indicating the locations visited by some user xyz having user type “category1”. I have many users belonging to one of the category and each user has visited some locations.

Given a sequence of some user, I want to predict on which location that user will go next. Also, I want to predict it by taking into consideration the user type/category.

The question is, how can we use the above post for the stated problem since in the post, the values are predicted in reverse order only.

Reply
- Jason Brownlee April 30, 2019 at 6:53 am #
  
  Perhaps there are multiple ways to frame the problem. I’d encourage you to brainstorm and then test a few to see what works.
  
  It sounds like a time series classification problem. I think some of the tutorials here will give you ideas:
  https://machinelearningmastery.com/start-here/#deep_learning_time_series
  
  Reply
Chandresh Kumar Maurya May 8, 2019 at 12:34 am #

@Jason, while giving free tips, mind that code should be working in the real world for large datasets. To_categorical is the worst thing to use and not keeping the memory footprint optimized. Hence, I would recommend re-edit of this post and using “sparse_softmax_cross_entropy_with_logits” as a proxy for “categorical_cross_entropy”. Seq2seq models are generally trained on large vocab datasets and your examples sadly don’t work. One has to go to other places for a solution. BTW, your posts are intuitive.

Reply
- Jason Brownlee May 8, 2019 at 6:45 am #
  
  Thanks for the suggestion.
  
  Reply
Sravan Malla May 15, 2019 at 3:37 pm #

Hi Jason, I am trying with machine translation and going through all your articles.

This post is an implementation of “Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras” with some sophisticated code in place.

I have seen one more post of your’s at

https://machinelearningmastery.com/develop-neural-machine-translation-system-keras/

and

https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/

which is also introucing to the Encoder-Decoder LSTMs for
sequence-to-sequence prediction with example Python code which is very simple

model = Sequential()
model.add(LSTM(…, input_shape=(…)))
model.add(RepeatVector(…))
model.add(LSTM(…, return_sequences=True))
model.add(TimeDistributed(Dense(…)))

Can you please let me know the exact difference between both of these ? is the later one with simple 5 lines of code the implementation of this sophistcated in a better way using LSTM’s? or which is better?

Reply
- Jason Brownlee May 16, 2019 at 6:25 am #
  
  I generally teach an approach to the encoder-decoder that uses an autoencoder architecture as it often gives the same or better results as the more advanced encoder-decoder.
  
  More here:
  https://machinelearningmastery.com/lstm-autoencoders/
  
  Reply
  - Sravan Malla May 22, 2019 at 6:16 pm #
    
    So the one in this post isn’t autoencoder but have teacher forcing in place, can we have teacher forcing in place for auto-encoder approaches even?
    
    Reply
    - Jason Brownlee May 23, 2019 at 5:56 am #
      
      Correct.
      
      Reply
Di June 28, 2019 at 3:25 pm #

Hi Jason, thank you for this tutorial!
I have a question about teacher forcing.

if I understand your example correctly, you are not using here any validation set (is it right?).

What if I add validation split during fitting, would the model use teacher forcing during validation phase as well?

Would this choice be correct, or should the model use greedy or beam search during the validation to simulate a more realistic performance?

Thank you!

Reply
- Jason Brownlee June 29, 2019 at 6:35 am #
  
  Not quite. I use walk forward validation.
  
  You can learn more about teacher forcing here:
  https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/
  
  Reply
Sandiel August 17, 2019 at 8:05 pm #

Hi Jason,

I have chunks (each chunk has different length, from 700 rows to 2000 rows) of data that include multivariate Inputs (3 time series) and dependent series. For example my first chunk of data has 700 rows of 3 input time series and dependent (one output) time series. I want to use the 3 inputs and predict (generate) the dependent series.
How should I do it?

Thank you in advance!

Reply
- Jason Brownlee August 18, 2019 at 6:42 am #
  
  Perhaps this post will help or give you ideas:
  https://machinelearningmastery.com/make-predictions-long-short-term-memory-models-keras/
  
  And this:
  https://machinelearningmastery.com/how-to-make-classification-and-regression-predictions-for-deep-learning-models-in-keras/
  
  Reply
jmc September 7, 2019 at 9:20 am #

Hi,

which would be the difference between this encoder-decoder vs. the one proposed here:
https://machinelearningmastery.com/how-to-develop-lstm-models-for-multi-step-time-series-forecasting-of-household-power-consumption/

the codification is done by the states of the LSTM in this case and just by the conv filters in the link?

Besides complexity, there is any case in which this architecture wouldn’t be a better pick than the other one?

Reply
- Jason Brownlee September 8, 2019 at 5:11 am #
  
  The example in this post meets the definition of encoder-decoder provided in the paper and is an example of a dynamic RNN.
  
  The example in the other post is a simpler encoder-decoder model based on the idea of an autoencoder.
  
  I generally recommend the latter approach because it is a lot simpler to train and use.
  
  Reply
jmc September 7, 2019 at 9:32 am #

Jason,

wouldn’t it be possible to input a sequence and output also a sequence instead of having to loop the inference decoder over the predicted outputs to obtain the whole predicted sequence?

any thoughts on what would be better?
my particular case is in predicting video frames using the approach of this post

Reply
- Jason Brownlee September 8, 2019 at 5:11 am #
  
  Yes, you could achieve this with a vector output or with a decoder model based on an LSTM autoencoder.
  
  Reply
jmc September 7, 2019 at 9:45 am #

The encoder architecture needs to be the same as the decoder or I could have more LSTM layers in one?

Reply
- Jason Brownlee September 8, 2019 at 5:11 am #
  
  No, they can vary.
  
  Reply
Alejandro Oñate Latorre September 9, 2019 at 2:49 am #

Hi!
I am trying to train my model with several gigabytes of data. My machine runs out of memory.
How can I do to train the batch model? I want to invoke the fit method with packages of 10,000 examples, but I don’t know how to do it.

You can help?

Reply
- Jason Brownlee September 9, 2019 at 5:18 am #
  
  You can use a data generator and progressively load your data into memory for the model.
  
  I give a few examples on the blog for loading large image and text datasets, for example:
  https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/
  
  Reply
Eli October 9, 2019 at 7:31 am #

Hey Jason,

I’m sure I’ll have questions come up here and there, but let me just say how amazed I am with the sheer number of topics you cover in your posts. This one in particular is so incredibly helpful. Thank you for all your work!

Eli

Reply
- Jason Brownlee October 9, 2019 at 8:21 am #
  
  Thanks Eli!
  
  Reply
Supergus October 14, 2019 at 4:16 am #

Great post, thanks

How should I think about when to use teacher-forcing models (which is what I think is described in this post), versus a simple vanilla LSTM like this:

https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/

Thank you

Reply
- Jason Brownlee October 14, 2019 at 8:10 am #
  
  Try a few framings of the problem and see what works/makes sense for your specific dataset.
  
  Reply
Lara October 18, 2019 at 9:52 pm #

Hi Jason,

first of all, I want to thank you for your awesome website. It has helped me a great deal with writing my thesis so far!

I am currently trying to implement an autoencoder with teacher forcing. As my data is purely continous, I have been trying to avoid one-hot encoding. However, I got stuck on the question, how I should mark the beginning of my target sequence. I have tried using a number which is inside or outside the range of the numbers used in the source/target sequences, but the results are poor. My network seems to repeat the same output (with only small alterations) no matter what I feed as an input sequence. I may have made another mistake in my implementation, but maybe you have a asuggestion for me regarding the start of sequence “character”, when no one-hot encoding is used?

Thank you very much in advance.

Reply
- Jason Brownlee October 19, 2019 at 6:33 am #
  
  Thanks.
  
  I use a token to mark the beginning and end of the sequence on problems with variable length output. An example is caption generation for photos:
  https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/
  
  Reply
  - Lara October 21, 2019 at 8:35 pm #
    
    Hi Jason,
    
    thanks for the quick reply.
    
    My output length is fixed. My question was rather regarding the type of data used. Let me try to rephrase it: Is their a standard way to mark the beginning and/or end of a sequence, if the sequence consists of continous (data type: double or float) data and I do NOT want to bin and one-hot encode it? An example of such a seuqnce could be a time series of sensor data.
    
    Thank you for your help.
    
    Reply
    - Jason Brownlee October 22, 2019 at 5:46 am #
      
      Yes, use a value out of the scope of normal values, like -1 or -999, etc.
      
      Reply
Asif Nawaz October 21, 2019 at 5:00 pm #

I have a sequence problem. Each sample is one trajectory having 200 points in chronological order. Each point exists in a 2D grid, where each grid cell is represented by a unique integer id. e,g a trajectory of 10 points passes through following grid cells –> [23, 32, 38, 18, 58, 41, 28, 28, 30, 48]. How to model this problem?

I have 8000 samples, and each sample has a sequence of 200 points, and their is only one feature (that is grid cell number). What should be the shape of X, if this is (8000,200,1), then their are 200 timesteps and each time step has only one entry. then the problem would be to generate shifted sequence in get_dataset.
or other way is to use single timestep for all samples like (8000,1,200). is this logical to solve sequence to sequence problem with single timestep?

Reply
- Jason Brownlee October 22, 2019 at 5:43 am #
  
  Perhaps this will help you to determine the shape:
  https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-samples-timesteps-and-features-for-lstm-input
  
  Perhaps try a few different models and framings of the problem in order to see what works best.
  
  Reply
li November 15, 2019 at 12:56 pm #

Hello, Jason, I would like to ask how to extract the status returned by each time part of LSTM, for example, I want to extract the output of the first and second time steps.

Reply
- Jason Brownlee November 16, 2019 at 7:18 am #
  
  Perhaps use return_sequences=True on the layer of interest and use that layer as an output layer?
  
  Reply
  - li November 16, 2019 at 1:43 pm #
    
    Using return_sequences=True, the function I want to achieve is to splicing the two outputs close to each other, followed by a Dense(1), how to realize this function, expect to get your answer, thank you
    
    Reply
    - Jason Brownlee November 17, 2019 at 7:13 am #
      
      Sorry, I don’t understand.
      
      Perhaps try posting your question and code to stackoverflow?
      
      Reply
Adham November 26, 2019 at 11:35 pm #

hello Jason,

I tried to run the above code but got the following error:

ValueError Traceback (most recent call last)
in ()
—-> 1 train.fit([X1, X2], y, epochs=1)

2 frames
/usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
129 ‘: expected ‘ + names[i] + ‘ to have ‘ +
130 str(len(shape)) + ‘ dimensions, but got array ‘
–> 131 ‘with shape ‘ + str(data_shape))
132 if not check_batch_axis:
133 data_shape = data_shape[1:]

ValueError: Error when checking input: expected input_9 to have 3 dimensions, but got array with shape (100000, 1, 6, 51)

can you please help me to solve it ?

Reply
- Jason Brownlee November 27, 2019 at 6:08 am #
  
  Sorry to hear that, I have some suggestions here:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Fred December 20, 2019 at 5:26 pm #

Hi and thanks a lot for writing this tutorial!
I’m going to have a stab at modifying this to predict/fill in the middle 1/3 of a long sentence.
Do you think that is possible and would you be able to give any pointers?

Reply
- Jason Brownlee December 21, 2019 at 7:07 am #
  
  Sounds fun!
  
  Yes, the tutorials on language modeling here will help:
  https://machinelearningmastery.com/start-here/#nlp
  
  Reply
John January 4, 2020 at 11:57 am #

Hey Jason I wanted to know. Can this work with a dataframe where the label is one number and the target is a sequence.
example
dataframe
label | Prediction
9000 | [5000,8000,9000]
Like a one to many but i would want the model to give me a sequence based off of its label
is that possible with encoder-decoder? if not can you direct me to a better setup.

Reply
- Jason Brownlee January 5, 2020 at 7:01 am #
  
  You can model that problem, but one number does not look like enough information to make a sequence prediction.
  
  Reply
  - John Daniel January 7, 2020 at 3:42 am #
    
    It seems that encoding a large number hits a memory error. a list of numbers higher than 100000
    is there a way to fix this issue
    
    in get_dataset(n_in, n_out, cardinality, n_samples)
    15 target_in.append(in_target)
    16 # encode
    —> 17 src_encoded = to_categorical(source)
    18 tar_encoded = to_categorical(target)
    19 tar2_encoded = to_categorical(target_in)
    
    ~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/keras/utils/np_utils.py in to_categorical(y, num_classes, dtype)
    46 num_classes = np.max(y) + 1
    47 n = y.shape[0]
    —> 48 categorical = np.zeros((n, num_classes), dtype=dtype)
    49 categorical[np.arange(n), y] = 1
    50 output_shape = input_shape + (num_classes,)
    
    MemoryError:
    
    Reply
    - Jason Brownlee January 7, 2020 at 7:25 am #
      
      Some ideas:
      
      – Use fewer categories.
      – Use a machine with more RAM
      – Use a label encoding instead of one hot encoding.
      
      More ideas:
      https://machinelearningmastery.com/faq/single-faq/how-do-i-handle-a-large-number-of-categories
      
      Reply
Jenna January 7, 2020 at 7:57 pm #

How to add attention to this dynamic encoder-decoder model?
Thank you very much for your help.

Reply
- Jason Brownlee January 8, 2020 at 8:21 am #
  
  Perhaps try using the TensorFlow attention layers?
  
  Reply
Jenna January 7, 2020 at 8:49 pm #

And sorry to bother you again.
You have used the keras layers to build seq2seq models many times in other blogs, such as:
model = Sequential()
model.add(LSTM(…, input_shape=(…)))
model.add(RepeatVector(…))
model.add(LSTM(…, return_sequences=True))
model.add(TimeDistributed(Dense(…)))
I am confused about what is the difference of this keras seq2seq from the above dynamic one in this blog?
Thank you!

Reply
- Jason Brownlee January 8, 2020 at 8:24 am #
  
  The approach I typically use is based on an LSTM autoencoder and is much easier to implement in Keras:
  https://machinelearningmastery.com/lstm-autoencoders/
  
  The dynamic lstm above is a closer fit to the original paper.
  
  Performance appears to be much the same for both models on a range of problems.
  
  Reply
  - Jenna January 8, 2020 at 2:26 pm #
    
    Thanks a lot!
    
    Reply
Markus January 30, 2020 at 3:18 am #

Hi

Looking at the folowing lines of code by the define_models function:

decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(n_output, activation=’softmax’)

Why there’s no need for a TimeDistributed wrapper?

That’s I would expect to see the following:

decoder_dense = TimeDistributed(Dense(n_output, activation=’softmax’))

As because we apply the Dense layer to each given timestep given by decoder_outputs (because of return_sequences=True)

Thanks

Reply
- Jason Brownlee January 30, 2020 at 6:55 am #
  
  It is a dynamic rnn that makes a single step prediction per call.
  
  Reply
  - Markus February 3, 2020 at 2:38 am #
    
    Thanks, but that code snippet belongs to the training and not inference phase… So we’re not in the predication step yet.
    
    Reply
  - Markus February 4, 2020 at 4:56 pm #
    
    Hi Jason
    
    Would really appreciate if you could shed some light on this.
    
    Thanks
    
    Reply
    - Markus February 5, 2020 at 6:05 am #
      
      Ah the puzzle solved:
      
      https://stackoverflow.com/questions/47305618/what-is-the-role-of-timedistributed-layer-in-keras/47309453#47309453
      
      Thanks
      
      Reply
    - Jason Brownlee February 5, 2020 at 8:03 am #
      
      Creates the output for the decoder by the look of things off the cuff.
      
      Reply
      - Markus February 5, 2020 at 9:15 am #
        
        Thanks for your reply. My english is not that good, so I did not understand your answer, could you elaborate it a bit please?
        
        In the meanwhile using the following code proved me that (since keras 2.0) the Dense layer can handle 3 dimensional inputs:
        
        from keras.layers import Dense, TimeDistributed
        from keras.models import Sequential
        
        from numpy import arange
        
        batch = 3
        steps = 10
        features = 16
        
        # model using TimeDistributed
        model = Sequential()
        model.add(TimeDistributed(Dense(8), input_shape=(steps, features)))
        model.compile(loss=’mean_squared_error’, optimizer=’adam’, metrics=[‘accuracy’])
        
        # model without TimeDistributed
        model2 = Sequential()
        model2.add(Dense(8, input_shape=(steps, features)))
        model2.compile(loss=’mean_squared_error’, optimizer=’adam’, metrics=[‘accuracy’])
        
        X = arange(batch * steps * features).reshape(batch, steps, features)
        y = model.predict(X)
        y2 = model.predict(X)
        
        # both outputs are of the same shape
        print(y.shape, y2.shape)
        
        Does this make sense?
      - Jason Brownlee February 5, 2020 at 1:40 pm #
        
        I don’t think so.
        
        I recommend reading this to understand the difference between samples, time steps and features:
        https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-samples-timesteps-and-features-for-lstm-input
      - Markus February 7, 2020 at 6:49 am #
        
        Can you please point out the reason why you don’t agree. I have got also the same understanding with the link you provided.
        
        Following the code again to show what I mean. What is wrong with that (using keras 2.3.1)?
        
        from keras.layers import Dense, TimeDistributed
        from keras.models import Sequential
        from numpy import arange
        
        n_samples = 3
        n_steps = 10
        n_features = 16
        
        # model using TimeDistributed
        model = Sequential()
        model.add(TimeDistributed(Dense(8), batch_input_shape=(n_samples, n_steps, n_features)))
        model.compile(loss=’mean_squared_error’, optimizer=’adam’, metrics=[‘accuracy’])
        
        # model without TimeDistributed
        model2 = Sequential()
        model2.add(Dense(8, batch_input_shape=(n_samples, n_steps, n_features)))
        model2.compile(loss=’mean_squared_error’, optimizer=’adam’, metrics=[‘accuracy’])
        
        X = arange(n_samples * n_steps * n_features).reshape(n_samples, n_steps, n_features)
        y = model.predict(X)
        y2 = model2.predict(X)
        
        # both outputs are of the same shape: (3, 10, 8)
        print(y.shape, y2.shape)
        
        model.summary()
        model2.summary()
        
        Which outputs:
        
        (3, 10, 8) (3, 10, 8)
        Model: “sequential_1”
        _________________________________________________________________
        Layer (type) Output Shape Param #
        =================================================================
        time_distributed_1 (TimeDist (3, 10, 8) 136
        =================================================================
        Total params: 136
        Trainable params: 136
        Non-trainable params: 0
        _________________________________________________________________
        Model: “sequential_2”
        _________________________________________________________________
        Layer (type) Output Shape Param #
        =================================================================
        dense_2 (Dense) (3, 10, 8) 136
        =================================================================
        Total params: 136
        Trainable params: 136
        Non-trainable params: 0
        _________________________________________________________________
      - Jason Brownlee February 7, 2020 at 8:29 am #
        
        Sorry, I don’t have the capacity to review/debug this code.
Markus February 5, 2020 at 9:20 am #

Sorry for the typo above, the line:

y2 = model.predict(X)

Should be replaced with:

y2 = model2.predict(X)

Following the complete code again:

from keras.layers import Dense, TimeDistributed
from keras.models import Sequential
from numpy import arange

batch = 3
steps = 10
features = 16

# model using TimeDistributed
model = Sequential()
model.add(TimeDistributed(Dense(8), input_shape=(steps, features)))
model.compile(loss=’mean_squared_error’, optimizer=’adam’, metrics=[‘accuracy’])

# model without TimeDistributed
model2 = Sequential()
model2.add(Dense(8, input_shape=(steps, features)))
model2.compile(loss=’mean_squared_error’, optimizer=’adam’, metrics=[‘accuracy’])

X = arange(batch * steps * features).reshape(batch, steps, features)
y = model.predict(X)
y2 = model2.predict(X)

# both outputs are of the same shape: (3, 10, 8)
print(y.shape, y2.shape)

Reply
Ramsha Siddiqui February 16, 2020 at 3:12 am #

Hi! So, if my 2nd Dimension for Input is not defined like so:

encoder_inputs = Input(shape=(None, None))
encoder = LSTM(32, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

I get an error:
TypeError: unsupported operand type(s) for +: ‘NoneType’ and ‘int’

Any solutions?

Reply
Salem Shaikh February 18, 2020 at 1:21 am #

After training what variables are required for prediction
Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 array:

Reply
- Jason Brownlee February 18, 2020 at 6:22 am #
  
  Sounds like you need to provide 2 arrays to your model.
  
  Reply
  - Amjad February 19, 2020 at 2:03 am #
    
    thanks. I figured that out.
    This model is guessing the last three elements of the sequence i.e it is just copying the last three elements
    
    Reply
syed Rafee March 7, 2020 at 6:24 am #

Hi how can i save and load the model for inference in later?

Reply
- Jason Brownlee March 7, 2020 at 7:21 am #
  
  See this tutorial:
  https://machinelearningmastery.com/save-load-keras-deep-learning-models/
  
  Reply
syed Rafee March 10, 2020 at 8:23 am #

Hi

Is there anyway you can add attention mechanism in this seq2seq model?

Reply
- Jason Brownlee March 10, 2020 at 1:37 pm #
  
  Yes. You can add an attention layer. Sorry, I don’t have a modern example.
  
  Reply
Percep March 11, 2020 at 6:32 pm #

if i want predict the next sequence of six numbers. what should i do?

Reply
- Jason Brownlee March 12, 2020 at 8:43 am #
  
  call model.predict()
  
  Perhaps this will help:
  https://machinelearningmastery.com/make-predictions-long-short-term-memory-models-keras/
  
  Reply
Lucy March 14, 2020 at 11:36 pm #

How does seq2seq do multivariate prediction，like weather prediction

Reply
- Jason Brownlee March 15, 2020 at 6:14 am #
  
  It can be used for that purpose. See examples here:
  https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
  
  Reply
drishty April 16, 2020 at 12:51 am #

Hi Jason,

Can we also use encoder decoder model for univariate data and give infinite output length also? For example giving input as 50 data points and output as 1000 data point? and will this work and give good results? If not, what will be the best way to train encoder and decoder model for seq to seq univariate prediction with arbitrary input and output length?

Thank you.

Reply
- Jason Brownlee April 16, 2020 at 6:01 am #
  
  Yes.
  
  Good results depend on the specifics of the data and the model.
  
  Reply
drishty April 16, 2020 at 5:13 pm #

can you give me any example where you have trained the model with infinite output length of data points. My data is basically a univariate data of temperature program

Reply
- Jason Brownlee April 17, 2020 at 6:15 am #
  
  I don’t have an example, but the above model can achieve this.
  
  Reply
  - drishty April 20, 2020 at 4:50 pm #
    
    But isn’t it that the above model is used for categorical or classification task?
    
    Reply
    - Jason Brownlee April 21, 2020 at 5:46 am #
      
      It is a sequence prediction task.
      
      Reply
      - drishty April 21, 2020 at 7:14 pm #
        
        But I am not able to understand what is the categorical here for and cardinality?? and why you did one hot encoding instead of scaling the data
      - Jason Brownlee April 22, 2020 at 5:52 am #
        
        This is a contrived problem of modeling sequences of numerical characters.
      - drishty April 27, 2020 at 7:27 pm #
        
        and can this model also be used for regression seq-seq task ? if yes, how can I use it?
      - Jason Brownlee April 28, 2020 at 6:44 am #
        
        Perhaps try it and see?
John April 19, 2020 at 11:12 am #

Hi Jason!
Can I use seq2seq model for multi-step time series forecasting?

Reply
- Jason Brownlee April 19, 2020 at 1:17 pm #
  
  Sure, see the examples here:
  https://machinelearningmastery.com/how-to-develop-lstm-models-for-multi-step-time-series-forecasting-of-household-power-consumption/
  
  Reply
  - John April 20, 2020 at 5:54 pm #
    
    Thank you for your reply!
    But why the encoder-decoder in your link doesn’t use the decoder_inputs or inference model like in this post, instead of using the repeatvector() directly?
    
    Reply
    - Jason Brownlee April 21, 2020 at 5:49 am #
      
      There are two main approaches to the model, one reuses the internal state, the the other uses an auto-encoder structure.
      
      I prefer the latter as its much simpler and generally as effective.
      
      Reply
Sampath April 20, 2020 at 12:44 am #

Hi Jason,
Greetings,

Currently i am a java developer with 4 years of experience. I would like to move A&I.

But i am getting little bit confused to start or not.

Could you please guid me how can i proceed.

please send mail : peyyilasampath@gmail.com

Reply
- Jason Brownlee April 20, 2020 at 5:27 am #
  
  Start here:
  https://machinelearningmastery.com/start-here/#getstarted
  
  Reply
Abdessalem April 26, 2020 at 8:01 am #

What is the difference between the approach you took here, and the approach in this tutorial: https://machinelearningmastery.com/how-to-develop-lstm-models-for-multi-step-time-series-forecasting-of-household-power-consumption/

From what I understood here we train the model to predict only one step into the future, then using recursion we predict multiple steps, meanwhile the the lesson that I’ve linked you too, it trains the model to predict as many times as RepeatVector(n)

Is my understanding correct ?

Reply
- Jason Brownlee April 27, 2020 at 5:22 am #
  
  They are both encoder-decoder models but the linked model uses a simpler auto-encoder based architecture, whereas the example in this post is more complex but matches the original paper.
  
  Reply
Karthik S May 7, 2020 at 11:19 am #

Thank you for sharing. This is pretty comprehensive. I wonder how you write so many blogs on so many topics being a single person. Really helpful to understand the models!

Reply
- Jason Brownlee May 7, 2020 at 11:51 am #
  
  Thanks!
  
  This is a common question that I answer here:
  https://machinelearningmastery.com/faq/single-faq/how-do-you-write-so-much
  
  Reply
Law May 8, 2020 at 3:06 am #

Thanks Jason for this presentation, please kindly help out, im not sure i know how to give new input sequence to the model without having to retrain it.

Reply
- Jason Brownlee May 8, 2020 at 6:39 am #
  
  This tutorial will show you how to make predictions generally:
  https://machinelearningmastery.com/how-to-make-classification-and-regression-predictions-for-deep-learning-models-in-keras/
  
  This tutorial will show you how to make predictions with LSTMs:
  https://machinelearningmastery.com/make-predictions-long-short-term-memory-models-keras/
  
  Reply
Phil May 14, 2020 at 1:06 am #

Hi Jason. Great tutorial!

I modified the example for my application, it is fixing an input series of events each with a start and end point in seconds. The output sequence is a list of ‘fixes’ needed to repair the input sequence. The features/categorical values were high (upto 3000) so I converted/condensed the data into different formats first to experiment example shape below.

Test train split (198500, 164) (500, 164) (198500, 82) (500, 82) (198500, 82) (500, 82)
to_categorical (198500, 164, 121) (500, 164, 121) (198500, 82, 52) (500, 82, 52) (198500, 82, 52) (500, 82, 52)

It works ok, but I think the output sequence is so long that there is often a few mistakes in there which has a big outcome on the results, as they all go out of sync.

Is there a way to skip the one_hot_encoding and feed the integer values straight in? I tinkered but it seems to be always expecting 3 dimensions in.

Then I can try the raw data. At the moment the arrays would take up too much ram at higher categorical levels.

Thanks

Reply
- Jason Brownlee May 14, 2020 at 5:53 am #
  
  Well done.
  
  Yes, you can use a sparse cross entropy loss and use target integers directly instead of one hot. For input, you can use an embedding and pas integers directly in.
  
  Reply
  - Phil May 14, 2020 at 9:43 pm #
    
    Thanks Jason. Can you use the raw integers for output (y) also? Or do they always need to be one hot encoded? I updated the model and it fits ok, but having issues with the predict function.
    
    I had to make the decoder dense layer = n_units (128) rather than features which is now 1, so within the predict function it goes wrong because I now have 1 feature and it is expecting more. I.e the decoder is.
    
    “expected input_67 to have shape (82, 1) but got array with shape (82, 128)”
    
    Reply
    - Jason Brownlee May 15, 2020 at 6:01 am #
      
      One hot encoded allows the model to perform better. Use argmax to convert to integers:
      https://machinelearningmastery.com/argmax-in-machine-learning/
      
      Reply
Gagan May 24, 2020 at 6:55 am #

When saving the model for future use, I understand the train will need to be saved. But infenc and infdec need to be saved to make predictions as well right?

Reply
Philip June 1, 2020 at 7:34 pm #

Hi Jason.

I am trying to add a masking layer to this example, as my sequences are all variable length padded with zeros at the end. It still works with no masking, but would like to see what difference masking makes to the output. Do you have any good tutuorials covering masking?

I tried the below code, however it makes no difference to the output, but increases training time by about x4.

def define_models_mask(n_in, n_out, n_units):
# define encoder
enc_inputs = Input(shape=(None, n_in))
encoder_inputs = Masking(mask_value=0)(enc_inputs) #****** TEST *****
encoder = LSTM(n_units, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

# define decoder
dec_inputs = Input(shape=(None, n_out))
decoder_inputs = Masking(mask_value=0)(dec_inputs) #****** TEST *****
decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(n_out, activation=’softmax’)
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([enc_inputs, dec_inputs], decoder_outputs)

# define inference encoder
encoder_model = Model(enc_inputs, encoder_states)

# define inference decoder
decoder_state_input_h = Input(shape=(n_units,))
decoder_state_input_c = Input(shape=(n_units,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([dec_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

# return all models
return model, encoder_model, decoder_model

Reply
- Jason Brownlee June 2, 2020 at 6:12 am #
  
  This may help:
  https://machinelearningmastery.com/handle-missing-timesteps-sequence-prediction-problems-python/
  
  Reply
Sayddaa June 12, 2020 at 12:54 pm #

(100000, 1, 6, 51) (100000, 1, 3, 51) (100000, 1, 3, 51)
—————————————————————————
ValueError Traceback (most recent call last)
in ()
95 print(X1.shape,X2.shape,y.shape)
96 # train model
—> 97 train.fit([X1, X2], y, epochs=1)
98 # evaluate LSTM
99 total, correct = 100, 0

2 frames
/usr/local/lib/python3.6/dist-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
133 ‘: expected ‘ + names[i] + ‘ to have ‘ +
134 str(len(shape)) + ‘ dimensions, but got array ‘
–> 135 ‘with shape ‘ + str(data_shape))
136 if not check_batch_axis:
137 data_shape = data_shape[1:]

ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (100000, 1, 6, 51)

This is showing when I run this code

Reply
- Jason Brownlee June 12, 2020 at 1:39 pm #
  
  I’m sorry to hear that, this may help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Hai Yisha June 12, 2020 at 4:54 pm #

As always, great post !

one thing to add is, I usually want to output some sequences with different length, (eg: a simple conversation model), so what I am usually doing is, add an feature on target data. so, n_features will be:
n_features = 50 + 2
0 is in target data,
1~50 are real features
51 will be

target_in = [0] + target[:-1]
will be
target_in = [0] + target

target will be:
target = target + [51]

not sure if this is the right idea though.

plus, attention layer is implemented in keras. I am having hard time in implementing it on this seq2seq method…

Reply
Atefeh June 16, 2020 at 7:15 pm #

Hello Mr Jason
thank you for your great posts.

my problem is to predict pen trajectory from a handwritting character image.
i have a feature vector extracted from the image which its size is (1,4096),(as my input vector).
i have a x and y coordinate vector which is the pen trajectory for the image,that its size is (1,20),(as my output vector).
the elements of these two vectors, have different values.I mean the elements of the output vectors are integers but the input vectors are real numbers or decimal.
i have 2000 train input vectors.
I am looking for an algorithm that after training it,for a specific input feature vector (1,4096),it could predict the output vector(1,20),pen trajectory coordinates.
can i use your “Encoder-Decoder Model for Sequence-to-Sequence Prediction”?!
it dose not matter that my input and output values are not in a same type?!

Reply
- Jason Brownlee June 17, 2020 at 6:21 am #
  
  That sounds like a fun project!
  
  I recommend first checking the literature for similar projects, then test a suite of models in order to discover what works well/best for your specific dataset.
  
  Reply
Asko June 18, 2020 at 9:27 pm #

Hi Jason,
Thank you for sharing this, I have two questions

1- can we use seq2seq for human activity recognition?
2- have you used attention with seq2seq ?

Thank you

Reply
- Jason Brownlee June 19, 2020 at 6:13 am #
  
  Yes.
  
  No.
  
  Reply
Philip Maxwell June 25, 2020 at 9:09 pm #

Hi Jason.

The purpose of the X_shifted input seems to be just to let the model where to start in the seq e.g. the first value. Why then do we feed in an exact shifted copy of X, and not just some sort of mask e.g. 0,1,1,1,1,1,1, etc?

Is there any other purpose to the X shifted input as the model works on through the seq past the first value? In my input sequences there are setions with a lot of repetative values, so not sure if this will cause issues.

Thanks
Phil

Reply
- Jason Brownlee June 26, 2020 at 5:34 am #
  
  Perhaps this will help in understanding how we represent sequence prediction as a supervised learning problem:
  https://machinelearningmastery.com/time-series-forecasting-supervised-learning/
  
  Reply
  - Philip Maxwell June 26, 2020 at 7:28 am #
    
    Thanks Jason, ill have a proper read of that tomorrow.
    
    Phil
    
    Reply
Giovanni Coral July 2, 2020 at 11:51 pm #

Hi Jason, I’m working on an autoencoder for anomaly detection. I’ve tried your other tutorial on lstm autoencoders but it doesn’t fit the data (I get only the mean value as reconstructed signal). However if use only the “train” model of this tutorial as autoencoder (so [X,X] as input encoder and input decoder respectively) the model works perfectly as an autoencoder, recostructing a simplified signal. But if I run the inference model it doesn’t reconstruc the signal. Why does this happen? Can I use only the “train” model as an autoencoder?

Reply
- Jason Brownlee July 3, 2020 at 6:18 am #
  
  We train an autoencoder and the decoder part of the model after training can be used to encode data.
  
  Reply
Natasha July 29, 2020 at 8:50 am #

Hi Jason! Thank you so much for your blog posts! I trained seq2seq network on one batch with 638 each with dimensions: 26*30 and predicts a sequence of 7 numbers using ‘mse’ as the loss function. However, when I calculate the mse loss and compare it to the one returned by history[‘loss’] they differ by ~0.2. Do you know how ‘mse’ is calculated for this input/output shape?

Reply
- Jason Brownlee July 29, 2020 at 1:39 pm #
  
  Yes, the loss reported during training is an average across batches in an epoch. I would recommend trusting your own loss calculations.
  
  Reply
Mairon August 5, 2020 at 10:47 pm #

Hi Jason, thank you for you great posts.
I am not sure if I am getting something wrong here, but I run the complete code of this example and got an error.
The shape of the dataset returned by method get_dataset is (n_samples, 1, 3, 51) instead of (n_samples, 3, 51), which I think is the expected one to the model.

Then I get the error “expected input_9 to have 3 dimensions, but got array with shape (2, 1, 6, 51)”. I just copy and paste your code. What am I missing?

Reply
- Jason Brownlee August 6, 2020 at 6:14 am #
  
  Perhaps confirm that your tensorlfow and keras are up to date?
  
  Reply
Atefeh August 11, 2020 at 2:04 am #

Hello Mr jason

I ask you to guide me for a code that it can predict an output sequence from an input sequence.
for example, for every sample we have:
X(input)={1 ,2, 3, …,10} , Y(output) ={ 80,10,64,79,60}
that the output sequence type is completely different from the input sequence (X).
our input sequence is the feature vector extracted from an image in a specified class.
and the output sequence for every input sample is the place of a spesific object in an image from a specified class.

also i have a little database which it contains just 4000 sample.
can i use one-shot learning algorithm for training the Lstm network.
would you please guide me a code to implement one shot for sequence to sequence Lstm?

I really thank you for your help.

Reply
- Jason Brownlee August 11, 2020 at 6:35 am #
  
  Perhaps start with the simple models here and adapt them for your project:
  https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
  
  Reply
Pierre-Henri SIMON August 12, 2020 at 11:50 pm #

Hi,

Thanks for all this material !

In few words what is the difference/advantages between the implementation above (dealing with state) and the simple one you propose as well here with RepaetVector layer to make the connection possible with seq of differente sizes:https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/ ?

BR.

Reply
- Jason Brownlee August 13, 2020 at 6:16 am #
  
  This example is simple and efficient and is based on an LSTM autoencoder.
  
  The other example is a lot more complex and is based on a dynamic rnn and a is a closer match to the original paper.
  
  Both appear to have similar performance in practice in my experience.
  
  Reply
Dan August 15, 2020 at 4:07 am #

I think there’s an isssue with the dimensions of the generated data.

The example code doesn’t run (for me anyway), because the data is four dimensional rather than three dimensional. I fixed it by taking the first element of src, tar and tar2 encoded. Alternatively, call to_categorical with ‘source’ instead of ‘[source]’

Took me a while to figure out 😉

Reply
- Jason Brownlee August 15, 2020 at 6:36 am #
  
  Interesting, are your libs up to date?
  
  Reply
Dan August 15, 2020 at 7:24 am #

Not sure about my libs. Keras version is 2.4.3, Tensorflow is 2.3.0

This issue is not new though. I had a look through the comments, it’s been reported as far back as Nov. 30 2017 (comment by ‘Python’).

‘Kadirou’ reports the same problem, April 30, 2018.

Gary (April 30 2018) offers exactly what I suggested:
[quote]change this:

src_encoded = to_categorical([source], num_classes=cardinality)[0]
tar_encoded = to_categorical([target], num_classes=cardinality)[0]
tar2_encoded = to_categorical([target_in], num_classes=cardinality)[0][/quote]

There are other instances of the same issue in the comments section, even fairly recent: see Mairon, August 5 2020.
[quote]Then I get the error “expected input_9 to have 3 dimensions, but got array with shape (2, 1, 6, 51)”. I just copy and paste your code. What am I missing?[quote]

Upshot:

The example code (in the call to ‘to_categorical’) wraps/may wrap an addittional array/list around the generated training examples where this is not expected by the model, and the code doesn’t run as a result.

I don’t know why or if it’s a platform issue. I think we can agree it’s not desirable behaviour for example code though.

Reply
- Jason Brownlee August 15, 2020 at 1:23 pm #
  
  Thanks Dan. Perhaps the API changed since the code was written and has not been updated to reflect the change.
  
  Reply
Dominique August 19, 2020 at 8:03 pm #

Dear Jason,

I have just finished your book “”Long Short Term Memory Networks With Python” and I have published a review here: http://questioneurope.blogspot.com/2020/08/long-short-term-memory-networks-with.html

Thanks for your very valuable knowledge that you share in your book.

Kind regards,
Dominique

Reply
- Jason Brownlee August 20, 2020 at 6:40 am #
  
  Well done on your progress, very impressive Dominique!
  
  Reply
Hamed September 24, 2020 at 9:49 am #

Dear Jason,
I have a problem and don’t know how to use LSTM for that.
The problem is arranging a set of actions, such as {B,D,F,A}, to the proper arrangement, i.e. (A,B,C,F). I have a database that includes the true arrangemetn of actions. Which architecture of the LSTM can I use?

Reply
- Jason Brownlee September 24, 2020 at 10:44 am #
  
  Perhaps you can use a seq2seq model with an encoder-decoder architecture. Try it and see if it is effective.
  
  Reply
  - Hamed September 24, 2020 at 12:53 pm #
    
    I used a seq2seq model as you described here:
    https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/
    But the accuracy is so low (around 50%). I used multiple layers and also increased epoches, but didn’t work out. Is there any other model that can be effective?
    
    Reply
    - Jason Brownlee September 24, 2020 at 2:37 pm #
      
      Perhaps try some of the techniques here to improve model performance:
      https://machinelearningmastery.com/start-here/#better
      
      Reply
najla November 13, 2020 at 8:40 am #

when a turn the code i have this error:
ValueError: Input 0 of layer lstm_2 is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: [32, 1, 6, 51]

Reply
- Jason Brownlee November 13, 2020 at 9:06 am #
  
  Sorry to hear that, this may help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Preeti November 18, 2020 at 1:03 am #

Hello Jason, I want to know what is the significance of using X2 in this model. I didnt get this clear in the blog.

Reply
- Jason Brownlee November 18, 2020 at 6:43 am #
  
  See the end of section “Encoder-Decoder Model in Keras” where we explain how data is prepared and provided to the model.
  
  Reply
Andrija December 6, 2020 at 2:57 am #

Hi Jason.
I’ve been struggling with the sek2sek problem for a while.
I have multi feature input of timeseries, about 10 features(less after PCA, but it is not in main focus of the question) and I need to predict sequence of (0, 1) values as output.

Let’s say my data look like this:
A, B y
0.3 0.3 0
0.4 0.1 0
0.2 0.2 0
0.7 0.5 1
. . 1
. . .

For sake of simplicity I want to predict sequence of 2 outputs based on 3 inputs.

From the example above one training example has shape of :
(3, 2), 3=input_size, 2=num_of_features and it should predict 2 values ex. [1, 0]

I know how to make windows, reshape inputs, build encoder and decoder, train model, test model…

The main problem is what should I do with discrete output?

Maybe I can try to observe it as single number [0, 1] (binary) = 2
Maybe I can try to make some embedding of my output.
Maybe I can just convert output to float.

Reply
- Jason Brownlee December 6, 2020 at 7:07 am #
  
  Perhaps test a suite of different algorithms and algorithm configurations in order to discover what works well or best. This may help with examples to start with:
  https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
  
  It sounds like your problem is a binary classification, ensure you have a sigmoid activation function in the output layer and use binary cross entropy loss
  
  Reply
Luis Cordero December 29, 2020 at 8:44 am #

Hello, I have a problem that I thought could be solved by an autocoder-decoder, the problem is that I don’t know if it is possible to start with a set of carctarisricas(18) and the latent space between autoencoder-decoder can be a matrix of dimension nxm, then use the decoder to rebuild the features

Could you tell me your opinion about it? it’s possible?

Reply
- Jason Brownlee December 29, 2020 at 9:24 am #
  
  Not sure I follow, but my general advice is to try it and see – e.g. develop a small prototype and use it to learn more about how to frame the problem and what might work.
  
  Reply
Adam January 5, 2021 at 7:28 pm #

Hi Jason,
How can I improve the runtime of the network? In your example there are 2 for loops that take too much time.

Reply
- Jason Brownlee January 6, 2021 at 6:26 am #
  
  Good question, here are some ideas:
  https://machinelearningmastery.com/faq/single-faq/how-do-i-speed-up-the-training-of-my-model
  
  Reply
TTran January 16, 2021 at 2:36 am #

Hi Jason,
thank you for your posts and books. how would you modify the encoder-decoder if my input sequences consist of multiple features? For example, I have time series of temperature, pressure, humidity as input?
Thanks,

Reply
- Jason Brownlee January 16, 2021 at 6:59 am #
  
  You’re welcome.
  
  The model supports multiple features directly, perhaps this will help:
  https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-samples-timesteps-and-features-for-lstm-input
  
  Reply
TTran January 29, 2021 at 12:48 am #

Jason,
another question if you don’t mind. I noticed you used teacher-forcing by shifting the target sequence by one time step. How would you modify the code to either removing teacher-forcing and using teacher forcing only, say 50% of the training steps?
Thank you

Reply
- Jason Brownlee January 29, 2021 at 6:05 am #
  
  You can remove teacher forcing by feeding predictions in as inputs on subsequent samples/time steps instead of true values.
  
  Perhaps this will help:
  https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/
  
  Reply
DP February 25, 2021 at 2:43 am #

Hello, I see all the helpful answers you are providing in relation to seq2seq models… There is one question I have that nobody seems to be able to show a practical way to do… In the Keras examples https://keras.io/examples/nlp/lstm_seq2seq/ (pretty similar to yours), it has a function decode_sequence that decodes 1 sequence at a time (like your example)… I want to be able to ‘batch’ predict the seq2seq as 1 by one is ‘very’ slow’, whereas the model can quickly be built and evaluate in batch very quickly. The goal would be to provide a list of the training examples that are not correct predicted, again in a batch rather than 1 by 1. Is it even possible to do this? If so, what would the changes in the decode_sequence routine look like to handle batches? Thanks in advance.

Reply
- Jason Brownlee February 25, 2021 at 5:36 am #
  
  I don’t believe that model can operate in the way you want.
  
  You must simplify the model to an autoencoder-based encoder-decoder and then you can operate in batches. I have tens of examples of this on the blog – you can use the search box to find them, perhaps start here:
  https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
  
  Reply
mike gt March 18, 2021 at 11:56 pm #

Hi Jason

when i run the code, in my notebook, i often get warning message, about retracing :

WARNING:tensorflow:11 out of the last 11 calls to <function Model.make_predict_function..predict_function at 0x000001B88FDEA268> triggered tf.function retracing

Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing.

Yes i can ignore but is make slow, any help to avoid this 😀

Thanks
Mike

Reply
- Jason Brownlee March 19, 2021 at 6:23 am #
  
  I believe you safely can ignore that warning for now.
  
  Reply
mike gt March 27, 2021 at 12:28 am #

Hi Jason.

Your rock 😀

Intuitively what is the benefit of this encoder-decoder model, I mean if compare using normal
LSTM/RNN design when predicting sequence?
Is it more accurate? sequence prediction using this architecture?

Thanks
Mike

Reply
- Jason Brownlee March 29, 2021 at 5:53 am #
  
  Thank you!
  
  The architecture lets you handle problems where the output sequence has a different length from the input sequence.
  
  Reply
Koen S April 7, 2021 at 6:15 am #

Thanks for this great article Jason! Very informative.

When running this code without making any changes I get the following error when trying to train the model: “ValueError: Input 0 is incompatible with layer model: expected shape=(None, None, 51), found shape=(32, 1, 6, 51)”

It seems the “to_categorical” function adds an empty dimension to the generated array, resulting in a shape of: (100000, 1, 6, 51). I managed to remove this dimension and successfully train the model with the following code added after line 29 to the “get_dataset” function:

# reshape
src_encoded = np.reshape(src_encoded,(n_in, cardinality))
tar_encoded = np.reshape(tar_encoded,(n_out, cardinality))
tar2_encoded = np.reshape(tar2_encoded,(n_out, cardinality))

There might be a way to do this more efficiently. I’m no Python hero 😉

Reply
- Jason Brownlee April 8, 2021 at 5:04 am #
  
  You’re welcome.
  
  Thanks for sharing!
  
  Reply
- jamie January 13, 2022 at 7:34 pm #
  
  you’re the real mvp
  
  Reply
Koen S April 13, 2021 at 5:59 am #

Dear Jason, do you know the reasoning behind choosing a ‘non informative’ start of sequence character for the decoder like 0, instead of using the last previous known value as a starting point?

Lets assume I’m trying to predict the path of a rolling ball, wouldn’t it make sense to provide the last vector of the ball in addition to the encoded state instead of a generic start index?

Reply
- Jason Brownlee April 13, 2021 at 6:12 am #
  
  Yes, it may not make sense in your case.
  
  Reply
  - Koen S April 14, 2021 at 2:30 am #
    
    Thanks for your reply!
    
    Reply
    - Jason Brownlee April 14, 2021 at 6:28 am #
      
      You’re welcome.
      
      Reply
Siam April 19, 2021 at 6:48 pm #

Can I use it as the next basket product prediction? Suppose

order 1 contains items[3, 4, 6]

order 2 contains items [4, 5, 1]

order 3 contains items [10, 29, 3, 1]

Now Want to find out the next 4th no order items[??????]. is it the right approach?

Reply
- Jason Brownlee April 20, 2021 at 5:56 am #
  
  Perhaps each order can be modelled as one sample with a sequence of items.
  
  Reply
Vivek Kumar Kandeyang April 23, 2021 at 4:17 am #

LSTM(n_units)
TensorFlow website mentioned that n_units refer dimensionality of the output vector space ( i.e. the number of feature for the hidden state ) and it does not refers to the number of LSTM unit created.
By default the no. of LSTM cell units is equal to the length of input Sequence

Reply
- Jason Brownlee April 23, 2021 at 5:06 am #
  
  Okay.
  
  Reply
Andrea July 2, 2021 at 6:33 pm #

Hi, thanks for you tutorial. I have a question: if I train the model with a seq length of, let’s say, 600 and then use it for variable lengths sequences (e.g., 300, 1200) how my model is doing inference?

Also, how should I choose the training seq length and what a Bidirectional LSTM is learning from these sequences??

Reply
- Jason Brownlee July 3, 2021 at 6:09 am #
  
  Perhaps experiment and discover what works well or best for your dataset.
  
  Reply
jianshun July 4, 2021 at 7:26 pm #

Hi Jason,

What if the my input sequence is a set of coordinates say [(3,0), (1,2), (4,5),(9,10),…] and my target sequence is a set of binary array say [0,1,1,0,…].

In this case, each time-step consists of 2 integers; (3, 0). In this case, how can i one-hot encode my input sequence?

Thanks

Reply
- Jason Brownlee July 5, 2021 at 5:08 am #
  
  Each integer would be encoded and the resulting vectors would be concatenated as input to the model.
  
  Reply
jianshun July 8, 2021 at 3:33 pm #

Thanks Jason!

Another question. I have variable sequence length for both input and output. My output is a binary array of 0s and1s.

The target has to be one time step ahead of the decoder input so i added a zero in front of the binary array. I also padded the binary array with zeros in order to create a fixed length length to the network.

Since my output is a binary arrays of 0s and1s, will the added zeros (starting value and padding values) confuse the network? Any suggestions?

Thank you very much.

Reply
- Jason Brownlee July 9, 2021 at 5:03 am #
  
  Perhaps compare results with the current approach to alternatives, such as using a different token to mark the start or end of the sequence.
  
  Reply
Rob July 21, 2021 at 8:52 am #

Hi!
Thank you for this tutorial. It was highly informative.

I have a question on the “predict_sequence” function:

Why do you use a target_seq which is all zeroes? I would have expected an initial target sequence of [1,0, 0…, n_features], which should represent a 0 in the decoded sequence. Am I missing something?

Thank you for everything, this site is amazing.

Reply
- Jason Brownlee July 22, 2021 at 5:34 am #
  
  You’re welcome.
  
  From memory, I think we feed in zeros at the start and then feed in the last outputs as inputs to prime the prediction of the next step.
  
  I believe zero is reserved for unknown/special, or we can teach the model to prime the process with anything we want – that would be my guess.
  
  Reply
  - Rob July 22, 2021 at 5:35 pm #
    
    Thank you for your answer. I’m still a little confused.
    
    You write:
    “We will use the value of 0 as the padding or start of sequence character, therefore it is reserved and we cannot use it in our source sequences. To achieve this, we will add 1 to our configured cardinality to ensure the one-hot encoding is large enough (e.g. a value of 1 maps to a ‘1’ value in index 1).”
    
    So if “0” is the , shouldn’t it be one-hot encoded too?
    “0” should be [1, 0, 0, …, 0].
    
    This is confusing because in the inference decoder you use an initial input character which is all zeroes [0, 0, 0, …, 0] instead of [1, 0, 0, …, 0]. Shouldn’t we feed the symbol to the decoder to start the inference?
    
    Reply
    - Jason Brownlee July 23, 2021 at 5:57 am #
      
      We could, but in this case, we chose to feed in a zero to start inference.
      
      Perhaps experiment with other reserved characters and see if it makes a difference.
      
      Reply
      - Rob July 23, 2021 at 6:46 pm #
        
        Ok thank you. I will experiment
      - Jason Brownlee July 24, 2021 at 5:13 am #
        
        You’re welcome.
jian September 11, 2021 at 11:19 am #

Hi Jason,

Is the decoder in your code being implemented as a Greedy Search decoder? If we want to use a beam search decoder, which part of the code should we change? Or is this question not relevant to this example? Thank you so much.

Reply
- Adrian Tam September 14, 2021 at 1:07 pm #
  
  If you’re interested in beam search decoder, this may help: https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/
  
  Reply
Mary September 28, 2021 at 1:59 am #

Hi Jason,

I used an encoder-decoder model for generating summarization news, but the predicted sequence is like this:

actual: [[‘startseq as of thursday facebook allows users to edit comments rather than retype them each comment will show its editing history in a dropdown menu to give users context editing will be rolled out to users gradually over the next few days endseq’]]
predicted: [‘startseq the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the’]

As you can see it generated a lot of “the”, have you ever encountered such a problem?
do you have any suggestion?

Reply
- Adrian Tam September 28, 2021 at 9:44 am #
  
  That happens when your model is not well-trained, or when the model is too simple for this job so it cannot hold the correct state to produce the sentence.
  
  Reply
Mohamed Abdelwahab February 3, 2022 at 6:52 am #

In the line:
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

what’s the reason for adding decoder_states_inputs and decoder_states to the input and output, respectively? It may make sense for the input (to provide the initial state), but what’s the point in doing so with the output?

Also, does doing so ensure that the inference decoder_model will use the learned parameters in the decoder part of the trained model (the training decoder)? What I understand is that to use learned parameters from one model to another, we have to use the exact same input and output of the trained model, so that they are linked to the same computer graph (I’m not sure of the correctness of my understanding though)

Reply
- James Carmichael February 17, 2022 at 1:38 pm #
  
  Hi Mohamed…The following is a great resource for the practical application of encoder-decoder architectures for seq2seq prediction:
  
  https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346
  
  Reply
MUNA April 1, 2024 at 4:00 pm #

Can you please advise why am I getting these errors:

retval_ = ag__.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
^^^^^
ValueError: in user code:

ValueError: Input 0 of layer “model” is incompatible with the layer: expected shape=(None, None, 51), found shape=(32, 1, 6, 51)

Reply
- James Carmichael April 2, 2024 at 9:00 am #
  
  Hi Muna…Did you copy and paste the code or type it in? Also, you may wish to try your implemetation in Google Colab.
  
  Reply
Tariq A April 5, 2024 at 1:06 pm #

Hi Jason, Thanks for another amazing and very helpful blog!

My question is, can we use this method for predicting random sequences that are not the part of input?

For example, I have a csv file with random sequences. I give first 100 values as input and I wish to predict the next 20 (lets say). In such a case, will I need to replace the target vector with these ‘next 20’ values? And if so, will this be the value of X_target? or y?

Reply
- James Carmichael April 7, 2024 at 7:17 am #
  
  Hi Tariq…When explaining to your supervisor why a hybrid model combining CNN (Convolutional Neural Network) and LSTM (Long Short-Term Memory) networks yields the best results for a skin cancer binary classification task, it’s important to discuss both the theoretical and practical aspects of your model architecture. Here’s a structured way to present your explanation academically:
  
  ### 1. **Explanation of the CNN Component**
  
  Start by discussing the role of the CNN layers in your hybrid model:
  
  – **Feature Extraction**: CNNs are highly effective in image recognition tasks because they can automatically detect and learn the most important features without any need for manual feature extraction. In the context of skin cancer images, CNN layers can effectively learn to identify key features such as shapes, edges, textures, and color patterns that are indicative of healthy vs. not healthy skin conditions.
  
  – **Spatial Hierarchy**: CNNs process images through multiple layers, capturing various levels of abstraction. Lower layers might capture generic features like edges and textures, while deeper layers can learn more complex patterns specific to the types of skin lesions present in your dataset.
  
  ### 2. **Explanation of the LSTM Component**
  
  Next, articulate the benefits of incorporating LSTM layers:
  
  – **Sequential Information Processing**: Although less traditional in pure image processing tasks where data is not inherently sequential, LSTMs can be extremely useful if the images can be treated as sequences — either as sequences of rows of pixels or patches of images, or if temporal or order-related information is relevant (such as in video frames or multi-angle shots in medical imaging).
  
  – **Contextual Data Handling**: LSTMs are designed to handle long-term dependencies in sequence data. By using an LSTM, your model might be capturing not just individual features but also the context or spatial relationships among features across the image.
  
  ### 3. **Integration Benefits of CNN and LSTM**
  
  Discuss how integrating these two types of networks leverages the strengths of both:
  
  – **Hybrid Architecture Efficiency**: The hybrid model combines the local feature extraction capabilities of CNNs with the sequence processing power of LSTMs. This can be particularly beneficial if your images have inherent sequential patterns or if interpreting parts of an image relative to others in a sequence-like manner gives better insights, which traditional CNNs might miss.
  
  – **Enhanced Learning Capability**: By merging CNN and LSTM, the model can learn both spatial hierarchies through CNNs and temporal or contextual dependencies through LSTMs. This could lead to a more robust understanding of the images, which is critical in complex classification tasks like distinguishing between healthy and cancerous skin.
  
  ### 4. **Practical Outcomes and Validation**
  
  Finally, validate your architecture choice with practical outcomes:
  
  – **Performance Metrics**: Highlight the accuracy (90%) and compare it with the performance of standalone models like VGG16 or other CNN architectures. Discuss how metrics like sensitivity, specificity, and possibly the AUC score have improved with the hybrid model.
  
  – **Parameter Efficiency**: With a total of 2,432,231 parameters, explain how this configuration was optimal for achieving high accuracy while keeping the computational load reasonable. Compare it with the complexity and parameter count of other tested models.
  
  ### 5. **Academic Contextualization**
  
  Link back to academic theories or previous studies that support your architecture choice:
  
  – **Literature Support**: Cite studies or theories that support the use of hybrid models or the application of sequential models to image data, highlighting any similar successful applications.
  
  – **Innovative Application**: Position your model as an innovative approach in the field of medical imaging, particularly in dermatology, where traditional models might fall short.
  
  By structuring your explanation in this way, you not only justify your architectural choices but also underscore the theoretical and practical benefits of your approach, thereby providing a comprehensive answer to your supervisor’s query.
  
  Reply

Navigation

How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras

Tutorial Overview

Python Environment

Encoder-Decoder Model in Keras

Scalable Sequence-to-Sequence Problem

Encoder-Decoder LSTM for Sequence Prediction

Further Reading

Related Posts

Keras Resources

Summary

Develop LSTMs for Sequence Prediction Today!

Develop Your Own LSTM models in Minutes

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

More On This Topic

390 Responses to How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Python Environment

Encoder-Decoder Model in Keras

Scalable Sequence-to-Sequence Problem

Encoder-Decoder LSTM for Sequence Prediction

Further Reading

Related Posts

Keras Resources

Summary

Develop LSTMs for Sequence Prediction Today!

Develop Your Own LSTM models in Minutes

Finally Bring LSTM Recurrent Neural Networks to Your Sequence Predictions Projects

More On This Topic

390 Responses to How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras

Leave a Reply Click here to cancel reply.

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects