How to Develop an Encoder-Decoder Model with Attention in Keras

By Jason Brownlee on August 27, 2020 in Long Short-Term Memory Networks 358

The encoder-decoder architecture for recurrent neural networks is proving to be powerful on a host of sequence-to-sequence prediction problems in the field of natural language processing such as machine translation and caption generation.

Attention is a mechanism that addresses a limitation of the encoder-decoder architecture on long sequences, and that in general speeds up the learning and lifts the skill of the model no sequence to sequence prediction problems.

In this tutorial, you will discover how to develop an encoder-decoder recurrent neural network with attention in Python with Keras.

After completing this tutorial, you will know:

How to design a small and configurable problem to evaluate encoder-decoder recurrent neural networks with and without attention.
How to design and evaluate an encoder-decoder network with and without attention for the sequence prediction problem.
How to robustly compare the performance of encoder-decoder networks with and without attention.

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Note May/2020: The underlying APIs have changed and this tutorial may no longer be current. You may require older versions of Keras and TensorFlow, e.g. Keras 2 and TF 1.

How to Develop an Encoder-Decoder Model with Attention for Sequence-to-Sequence Prediction in Keras
Photo by Angela and Andrew, some rights reserved.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

Encoder-Decoder with Attention
Test Problem for Attention
Encoder-Decoder without Attention
Custom Keras Attention Layer
Encoder-Decoder with Attention
Comparison of Models

Python Environment

This tutorial assumes you have a Python 3 SciPy environment installed.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this post:

How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda

Encoder-Decoder with Attention

The encoder-decoder model for recurrent neural networks is an architecture for sequence-to-sequence prediction problems.

It is comprised of two sub-models, as its name suggests:

Encoder: The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called a context vector.
Decoder: The decoder is responsible for stepping through the output time steps while reading from the context vector.

A problem with the architecture is that performance is poor on long input or output sequences. The reason is believed to be because of the fixed-sized internal representation used by the encoder.

Attention is an extension to the architecture that addresses this limitation. It works by first providing a richer context from the encoder to the decoder and a learning mechanism where the decoder can learn where to pay attention in the richer encoding when predicting each time step in the output sequence.

For more on attention in the encoder-decoder architecture, see the posts:

Test Problem for Attention

Before we develop models with attention, we will first define a contrived scalable test problem that we can use to determine whether attention is providing any benefit.

In this problem, we will generate sequences of random integers as input and matching output sequences comprised of a subset of the integers in the input sequence.

For example, an input sequence might be [1, 6, 2, 7, 3] and the expected output sequence might be the first two random integers in the sequence [1, 6].

We will define the problem such that the input and output sequences are the same length and pad the output sequences with “0” values as needed.

First, we need a function to generate sequences of random integers. We will use the Python randint() function to generate random integers between 0 and a maximum value and use this range as the cardinality for the problem (e.g. the number of features or an axis of difficulty).

The function generate_sequence() below will generate a random sequence of integers to a fixed length and with the specified cardinality.

from random import randint

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# generate random sequence
sequence = generate_sequence(5, 50)
print(sequence)

from random import randint

# generate a sequence of random integers

def generate_sequence(length, n_unique):

return [randint(0, n_unique-1) for _ in range(length)]

# generate random sequence

sequence = generate_sequence(5, 50)

print(sequence)

Running this example generates a sequence of 5 time steps where each value in the sequence is a random integer between 0 and 49.

[43, 3, 28, 34, 33]

1	[43, 3, 28, 34, 33]

Next, we need a function to one hot encode the discrete integer values into binary vectors.

If a cardinality of 50 is used, then each integer will be represented by a 50-element vector of 0 values and 1 in the index of the specified integer value.

The one_hot_encode() function below will one hot encode a given sequence of integers.

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# one hot encode sequence

def one_hot_encode(sequence, n_unique):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

We also need to be able to decode an encoded sequence. This will be needed to turn a prediction from the model or an encoded expected sequence back into a sequence of integers we can read and evaluate.

The one_hot_decode() function below will decode a one hot encoded sequence back into a sequence of integers.

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

We can test out these operations in the example below.

from random import randint
from numpy import array
from numpy import argmax

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# generate random sequence
sequence = generate_sequence(5, 50)
print(sequence)
# one hot encode
encoded = one_hot_encode(sequence, 50)
print(encoded)
# decode
decoded = one_hot_decode(encoded)
print(decoded)

from random import randint

from numpy import array

from numpy import argmax

# generate a sequence of random integers

def generate_sequence(length, n_unique):

return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence

def one_hot_encode(sequence, n_unique):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# generate random sequence

sequence = generate_sequence(5, 50)

print(sequence)

# one hot encode

encoded = one_hot_encode(sequence, 50)

print(encoded)

# decode

decoded = one_hot_decode(encoded)

print(decoded)

Running the example first prints a randomly generated sequence, then the one hot encoded version, then finally the decoded sequence again.

[3, 18, 32, 11, 36]
[[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]]
[3, 18, 32, 11, 36]

[3, 18, 32, 11, 36]

[[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]]

[3, 18, 32, 11, 36]

Finally, we need a function that can create input and output pairs of sequences to train and evaluate a model.

The function below named get_pair() will return one input and output sequence pair given a specified input length, output length, and cardinality. Both input and output sequences are the same length, the length of the input sequence, but the output sequence will be taken as the first n characters of the input sequence and padded with zero values to the required length.

The sequences of integers are then encoded then reshaped into a 3D format required for the recurrent neural network, with the dimensions: samples, time steps, and features. In this case, samples is always 1 as we are only generating one input-output pair, the time steps is the input sequence length and features is the cardinality of each time step.

# prepare data for the LSTM
def get_pair(n_in, n_out, n_unique):
	# generate random sequence
	sequence_in = generate_sequence(n_in, n_unique)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, n_unique)
	y = one_hot_encode(sequence_out, n_unique)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# prepare data for the LSTM

def get_pair(n_in, n_out, n_unique):

# generate random sequence

sequence_in = generate_sequence(n_in, n_unique)

sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]

# one hot encode

X = one_hot_encode(sequence_in, n_unique)

y = one_hot_encode(sequence_out, n_unique)

# reshape as 3D

X = X.reshape((1, X.shape[0], X.shape[1]))

y = y.reshape((1, y.shape[0], y.shape[1]))

return X,y

We can put this all together and demonstrate the data preparation code.

from random import randint
from numpy import array
from numpy import argmax

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, n_unique):
	# generate random sequence
	sequence_in = generate_sequence(n_in, n_unique)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, n_unique)
	y = one_hot_encode(sequence_out, n_unique)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# generate random sequence
X, y = get_pair(5, 2, 50)
print(X.shape, y.shape)
print('X=%s, y=%s' % (one_hot_decode(X[0]), one_hot_decode(y[0])))

from random import randint

from numpy import array

from numpy import argmax

# generate a sequence of random integers

def generate_sequence(length, n_unique):

return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence

def one_hot_encode(sequence, n_unique):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM

def get_pair(n_in, n_out, n_unique):

# generate random sequence

sequence_in = generate_sequence(n_in, n_unique)

sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]

# one hot encode

X = one_hot_encode(sequence_in, n_unique)

y = one_hot_encode(sequence_out, n_unique)

# reshape as 3D

X = X.reshape((1, X.shape[0], X.shape[1]))

y = y.reshape((1, y.shape[0], y.shape[1]))

return X,y

# generate random sequence

X, y = get_pair(5, 2, 50)

print(X.shape, y.shape)

print('X=%s, y=%s' % (one_hot_decode(X[0]), one_hot_decode(y[0])))

Running the example generates a single input-output pair and prints the shape of both arrays.

The generated pair is then printed in a decoded form where we can see that the first two integers of the sequence are reproduced in the output sequence followed by a padding of zero values.

(1, 5, 50) (1, 5, 50)
X=[12, 20, 36, 40, 12], y=[12, 20, 0, 0, 0]

1 2	(1, 5, 50) (1, 5, 50) X=[12, 20, 36, 40, 12], y=[12, 20, 0, 0, 0]

Encoder-Decoder Without Attention

In this section, we will develop a baseline in performance on the problem with an encoder-decoder model without attention.

We will fix the problem definition at input and output sequences of 5 time steps, the first 2 elements of the input sequence in the output sequence and a cardinality of 50.

# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2

# configure problem

n_features = 50

n_timesteps_in = 5

n_timesteps_out = 2

We can develop a simple encoder-decoder model in Keras by taking the output from an encoder LSTM model, repeating it n times for the number of timesteps in the output sequence, then using a decoder to predict the output sequence.

For more detail on how to define an encoder-decoder architecture in Keras, see the post:

Encoder-Decoder Long Short-Term Memory Networks

We will configure the encoder and decoder with the same number of units, in this case 150. We will use the efficient Adam implementation of gradient descent and optimize the categorical cross entropy loss function, given that the problem is technically a multi-class classification problem.

The configuration for the model was found after a little trial and error and is by no means optimized.

The code for an encoder-decoder architecture in Keras is listed below.

# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# define model

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))

model.add(RepeatVector(n_timesteps_in))

model.add(LSTM(150, return_sequences=True))

model.add(TimeDistributed(Dense(n_features, activation='softmax')))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

We will train the model on 5,000 random input-output pairs of integer sequences.

# train LSTM
for epoch in range(5000):
	# generate new random sequence
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	# fit model for one epoch on this sequence
	model.fit(X, y, epochs=1, verbose=2)

# train LSTM

for epoch in range(5000):

# generate new random sequence

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

# fit model for one epoch on this sequence

model.fit(X, y, epochs=1, verbose=2)

Once trained, we will evaluate the model on 100 new randomly generated integer sequences and only mark a prediction correct when the entire output sequence matches the expected value.

# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
		correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

# evaluate LSTM

total, correct = 100, 0

for _ in range(total):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):

correct += 1

print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

Finally, we will print 10 examples of expected output sequences and sequences predicted by the model.

Putting all of this together, the complete example is listed below.

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import RepeatVector

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality):
	# generate random sequence
	sequence_in = generate_sequence(n_in, cardinality)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, cardinality)
	y = one_hot_encode(sequence_out, cardinality)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2
# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# train LSTM
for epoch in range(5000):
	# generate new random sequence
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	# fit model for one epoch on this sequence
	model.fit(X, y, epochs=1, verbose=2)
# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
		correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))
# spot check some examples
for _ in range(10):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

from random import randint

from numpy import array

from numpy import argmax

from numpy import array_equal

from keras.models import Sequential

from keras.layers import LSTM

from keras.layers import Dense

from keras.layers import TimeDistributed

from keras.layers import RepeatVector

# generate a sequence of random integers

def generate_sequence(length, n_unique):

return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence

def one_hot_encode(sequence, n_unique):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM

def get_pair(n_in, n_out, cardinality):

# generate random sequence

sequence_in = generate_sequence(n_in, cardinality)

sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]

# one hot encode

X = one_hot_encode(sequence_in, cardinality)

y = one_hot_encode(sequence_out, cardinality)

# reshape as 3D

X = X.reshape((1, X.shape[0], X.shape[1]))

y = y.reshape((1, y.shape[0], y.shape[1]))

return X,y

# configure problem

n_features = 50

n_timesteps_in = 5

n_timesteps_out = 2

# define model

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))

model.add(RepeatVector(n_timesteps_in))

model.add(LSTM(150, return_sequences=True))

model.add(TimeDistributed(Dense(n_features, activation='softmax')))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# train LSTM

for epoch in range(5000):

# generate new random sequence

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

# fit model for one epoch on this sequence

model.fit(X, y, epochs=1, verbose=2)

# evaluate LSTM

total, correct = 100, 0

for _ in range(total):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):

correct += 1

print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

# spot check some examples

for _ in range(10):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

Running this example will not take long, perhaps a few minutes on the CPU, no GPU is required.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The accuracy of the model was reported at just under 20%.

Accuracy: 19.00%

1	Accuracy: 19.00%

We can see from the sample outputs that the model does get one number in the output sequence correct for most or all cases, and only struggles with the second number. All zero padding values are predicted correctly.

Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0]
Expected: [43, 31, 0, 0, 0] Predicted [43, 31, 0, 0, 0]
Expected: [14, 22, 0, 0, 0] Predicted [14, 14, 0, 0, 0]
Expected: [39, 31, 0, 0, 0] Predicted [39, 39, 0, 0, 0]
Expected: [6, 4, 0, 0, 0] Predicted [6, 4, 0, 0, 0]
Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0]
Expected: [39, 33, 0, 0, 0] Predicted [39, 39, 0, 0, 0]
Expected: [23, 2, 0, 0, 0] Predicted [23, 23, 0, 0, 0]
Expected: [19, 28, 0, 0, 0] Predicted [19, 3, 0, 0, 0]
Expected: [32, 33, 0, 0, 0] Predicted [32, 32, 0, 0, 0]

Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0]

Expected: [43, 31, 0, 0, 0] Predicted [43, 31, 0, 0, 0]

Expected: [14, 22, 0, 0, 0] Predicted [14, 14, 0, 0, 0]

Expected: [39, 31, 0, 0, 0] Predicted [39, 39, 0, 0, 0]

Expected: [6, 4, 0, 0, 0] Predicted [6, 4, 0, 0, 0]

Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0]

Expected: [39, 33, 0, 0, 0] Predicted [39, 39, 0, 0, 0]

Expected: [23, 2, 0, 0, 0] Predicted [23, 23, 0, 0, 0]

Expected: [19, 28, 0, 0, 0] Predicted [19, 3, 0, 0, 0]

Expected: [32, 33, 0, 0, 0] Predicted [32, 32, 0, 0, 0]

Custom Keras Attention Layer

Now we need to add attention to the encoder-decoder model.

At the time of writing, Keras does not have the capability of attention built into the library, but it is coming soon.

Until attention is officially available in Keras, we can either develop our own implementation or use an existing third-party implementation.

To speed things up, let’s use an existing third-party implementation.

Zafarali Ahmed an intern at Datalogue developed a custom layer for Keras that provides support for attention, presented in a post titled “How to Visualize Your Recurrent Neural Network with Attention in Keras” in 2017 and GitHub project called “keras-attention“.

The custom attention layer is called AttentionDecoder and is available in the custom_recurrents.py file in the GitHub project. We can reuse this code under the GNU Affero General Public License v3.0 license of the project.

A copy of the custom layer is listed below for completeness. Copy it and paste it into a new and separate file in your current working directory called ‘attention_decoder.py‘.

import tensorflow as tf
from keras import backend as K
from keras import regularizers, constraints, initializers, activations
from keras.layers.recurrent import Recurrent, _time_distributed_dense
from keras.engine import InputSpec

tfPrint = lambda d, T: tf.Print(input_=T, data=[T, tf.shape(T)], message=d)

class AttentionDecoder(Recurrent):

    def __init__(self, units, output_dim,
                 activation='tanh',
                 return_probabilities=False,
                 name='AttentionDecoder',
                 kernel_initializer='glorot_uniform',
                 recurrent_initializer='orthogonal',
                 bias_initializer='zeros',
                 kernel_regularizer=None,
                 bias_regularizer=None,
                 activity_regularizer=None,
                 kernel_constraint=None,
                 bias_constraint=None,
                 **kwargs):
        """
        Implements an AttentionDecoder that takes in a sequence encoded by an
        encoder and outputs the decoded states
        :param units: dimension of the hidden state and the attention matrices
        :param output_dim: the number of labels in the output space

        references:
            Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio.
            "Neural machine translation by jointly learning to align and translate."
            arXiv preprint arXiv:1409.0473 (2014).
        """
        self.units = units
        self.output_dim = output_dim
        self.return_probabilities = return_probabilities
        self.activation = activations.get(activation)
        self.kernel_initializer = initializers.get(kernel_initializer)
        self.recurrent_initializer = initializers.get(recurrent_initializer)
        self.bias_initializer = initializers.get(bias_initializer)

        self.kernel_regularizer = regularizers.get(kernel_regularizer)
        self.recurrent_regularizer = regularizers.get(kernel_regularizer)
        self.bias_regularizer = regularizers.get(bias_regularizer)
        self.activity_regularizer = regularizers.get(activity_regularizer)

        self.kernel_constraint = constraints.get(kernel_constraint)
        self.recurrent_constraint = constraints.get(kernel_constraint)
        self.bias_constraint = constraints.get(bias_constraint)

        super(AttentionDecoder, self).__init__(**kwargs)
        self.name = name
        self.return_sequences = True  # must return sequences

    def build(self, input_shape):
        """
          See Appendix 2 of Bahdanau 2014, arXiv:1409.0473
          for model details that correspond to the matrices here.
        """

        self.batch_size, self.timesteps, self.input_dim = input_shape

        if self.stateful:
            super(AttentionDecoder, self).reset_states()

        self.states = [None, None]  # y, s

        """
            Matrices for creating the context vector
        """

        self.V_a = self.add_weight(shape=(self.units,),
                                   name='V_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.W_a = self.add_weight(shape=(self.units, self.units),
                                   name='W_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.U_a = self.add_weight(shape=(self.input_dim, self.units),
                                   name='U_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.b_a = self.add_weight(shape=(self.units,),
                                   name='b_a',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for the r (reset) gate
        """
        self.C_r = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_r = self.add_weight(shape=(self.units, self.units),
                                   name='U_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_r = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_r = self.add_weight(shape=(self.units, ),
                                   name='b_r',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)

        """
            Matrices for the z (update) gate
        """
        self.C_z = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_z = self.add_weight(shape=(self.units, self.units),
                                   name='U_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_z = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_z = self.add_weight(shape=(self.units, ),
                                   name='b_z',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for the proposal
        """
        self.C_p = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_p = self.add_weight(shape=(self.units, self.units),
                                   name='U_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_p = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_p = self.add_weight(shape=(self.units, ),
                                   name='b_p',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for making the final prediction vector
        """
        self.C_o = self.add_weight(shape=(self.input_dim, self.output_dim),
                                   name='C_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_o = self.add_weight(shape=(self.units, self.output_dim),
                                   name='U_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_o = self.add_weight(shape=(self.output_dim, self.output_dim),
                                   name='W_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_o = self.add_weight(shape=(self.output_dim, ),
                                   name='b_o',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)

        # For creating the initial state:
        self.W_s = self.add_weight(shape=(self.input_dim, self.units),
                                   name='W_s',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)

        self.input_spec = [
            InputSpec(shape=(self.batch_size, self.timesteps, self.input_dim))]
        self.built = True

    def call(self, x):
        # store the whole sequence so we can "attend" to it at each timestep
        self.x_seq = x

        # apply the a dense layer over the time dimension of the sequence
        # do it here because it doesn't depend on any previous steps
        # thefore we can save computation time:
        self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a,
                                             input_dim=self.input_dim,
                                             timesteps=self.timesteps,
                                             output_dim=self.units)

        return super(AttentionDecoder, self).call(x)

    def get_initial_state(self, inputs):
        # apply the matrix on the first time step to get the initial s0.
        s0 = activations.tanh(K.dot(inputs[:, 0], self.W_s))

        # from keras.layers.recurrent to initialize a vector of (batchsize,
        # output_dim)
        y0 = K.zeros_like(inputs)  # (samples, timesteps, input_dims)
        y0 = K.sum(y0, axis=(1, 2))  # (samples, )
        y0 = K.expand_dims(y0)  # (samples, 1)
        y0 = K.tile(y0, [1, self.output_dim])

        return [y0, s0]

    def step(self, x, states):

        ytm, stm = states

        # repeat the hidden state to the length of the sequence
        _stm = K.repeat(stm, self.timesteps)

        # now multiplty the weight matrix with the repeated hidden state
        _Wxstm = K.dot(_stm, self.W_a)

        # calculate the attention probabilities
        # this relates how much other timesteps contributed to this one.
        et = K.dot(activations.tanh(_Wxstm + self._uxpb),
                   K.expand_dims(self.V_a))
        at = K.exp(et)
        at_sum = K.sum(at, axis=1)
        at_sum_repeated = K.repeat(at_sum, self.timesteps)
        at /= at_sum_repeated  # vector of size (batchsize, timesteps, 1)

        # calculate the context vector
        context = K.squeeze(K.batch_dot(at, self.x_seq, axes=1), axis=1)
        # ~~~> calculate new hidden state
        # first calculate the "r" gate:

        rt = activations.sigmoid(
            K.dot(ytm, self.W_r)
            + K.dot(stm, self.U_r)
            + K.dot(context, self.C_r)
            + self.b_r)

        # now calculate the "z" gate
        zt = activations.sigmoid(
            K.dot(ytm, self.W_z)
            + K.dot(stm, self.U_z)
            + K.dot(context, self.C_z)
            + self.b_z)

        # calculate the proposal hidden state:
        s_tp = activations.tanh(
            K.dot(ytm, self.W_p)
            + K.dot((rt * stm), self.U_p)
            + K.dot(context, self.C_p)
            + self.b_p)

        # new hidden state:
        st = (1-zt)*stm + zt * s_tp

        yt = activations.softmax(
            K.dot(ytm, self.W_o)
            + K.dot(stm, self.U_o)
            + K.dot(context, self.C_o)
            + self.b_o)

        if self.return_probabilities:
            return at, [yt, st]
        else:
            return yt, [yt, st]

    def compute_output_shape(self, input_shape):
        """
            For Keras internal compatability checking
        """
        if self.return_probabilities:
            return (None, self.timesteps, self.timesteps)
        else:
            return (None, self.timesteps, self.output_dim)

    def get_config(self):
        """
            For rebuilding models on load time.
        """
        config = {
            'output_dim': self.output_dim,
            'units': self.units,
            'return_probabilities': self.return_probabilities
        }
        base_config = super(AttentionDecoder, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

import tensorflow as tf

from keras import backend as K

from keras import regularizers, constraints, initializers, activations

from keras.layers.recurrent import Recurrent, _time_distributed_dense

from keras.engine import InputSpec

tfPrint = lambda d, T: tf.Print(input_=T, data=[T, tf.shape(T)], message=d)

class AttentionDecoder(Recurrent):

def __init__(self, units, output_dim,

activation='tanh',

return_probabilities=False,

name='AttentionDecoder',

kernel_initializer='glorot_uniform',

recurrent_initializer='orthogonal',

bias_initializer='zeros',

kernel_regularizer=None,

bias_regularizer=None,

activity_regularizer=None,

kernel_constraint=None,

bias_constraint=None,

**kwargs):

"""

Implements an AttentionDecoder that takes in a sequence encoded by an

encoder and outputs the decoded states

:param units: dimension of the hidden state and the attention matrices

:param output_dim: the number of labels in the output space

references:

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio.

"Neural machine translation by jointly learning to align and translate."

arXiv preprint arXiv:1409.0473 (2014).

"""

self.units = units

self.output_dim = output_dim

self.return_probabilities = return_probabilities

self.activation = activations.get(activation)

self.kernel_initializer = initializers.get(kernel_initializer)

self.recurrent_initializer = initializers.get(recurrent_initializer)

self.bias_initializer = initializers.get(bias_initializer)

self.kernel_regularizer = regularizers.get(kernel_regularizer)

self.recurrent_regularizer = regularizers.get(kernel_regularizer)

self.bias_regularizer = regularizers.get(bias_regularizer)

self.activity_regularizer = regularizers.get(activity_regularizer)

self.kernel_constraint = constraints.get(kernel_constraint)

self.recurrent_constraint = constraints.get(kernel_constraint)

self.bias_constraint = constraints.get(bias_constraint)

super(AttentionDecoder, self).__init__(**kwargs)

self.name = name

self.return_sequences = True # must return sequences

def build(self, input_shape):

"""

See Appendix 2 of Bahdanau 2014, arXiv:1409.0473

for model details that correspond to the matrices here.

"""

self.batch_size, self.timesteps, self.input_dim = input_shape

if self.stateful:

super(AttentionDecoder, self).reset_states()

self.states = [None, None] # y, s

"""

Matrices for creating the context vector

"""

self.V_a = self.add_weight(shape=(self.units,),

name='V_a',

initializer=self.kernel_initializer,

regularizer=self.kernel_regularizer,

constraint=self.kernel_constraint)

self.W_a = self.add_weight(shape=(self.units, self.units),

name='W_a',

initializer=self.kernel_initializer,

regularizer=self.kernel_regularizer,

constraint=self.kernel_constraint)

self.U_a = self.add_weight(shape=(self.input_dim, self.units),

name='U_a',

initializer=self.kernel_initializer,

regularizer=self.kernel_regularizer,

constraint=self.kernel_constraint)

self.b_a = self.add_weight(shape=(self.units,),

name='b_a',

initializer=self.bias_initializer,

regularizer=self.bias_regularizer,

constraint=self.bias_constraint)

"""

Matrices for the r (reset) gate

"""

self.C_r = self.add_weight(shape=(self.input_dim, self.units),

name='C_r',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.U_r = self.add_weight(shape=(self.units, self.units),

name='U_r',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.W_r = self.add_weight(shape=(self.output_dim, self.units),

name='W_r',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.b_r = self.add_weight(shape=(self.units, ),

name='b_r',

initializer=self.bias_initializer,

regularizer=self.bias_regularizer,

constraint=self.bias_constraint)

"""

Matrices for the z (update) gate

"""

self.C_z = self.add_weight(shape=(self.input_dim, self.units),

name='C_z',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.U_z = self.add_weight(shape=(self.units, self.units),

name='U_z',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.W_z = self.add_weight(shape=(self.output_dim, self.units),

name='W_z',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.b_z = self.add_weight(shape=(self.units, ),

name='b_z',

initializer=self.bias_initializer,

regularizer=self.bias_regularizer,

constraint=self.bias_constraint)

"""

Matrices for the proposal

"""

self.C_p = self.add_weight(shape=(self.input_dim, self.units),

name='C_p',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.U_p = self.add_weight(shape=(self.units, self.units),

name='U_p',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.W_p = self.add_weight(shape=(self.output_dim, self.units),

name='W_p',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.b_p = self.add_weight(shape=(self.units, ),

name='b_p',

initializer=self.bias_initializer,

regularizer=self.bias_regularizer,

constraint=self.bias_constraint)

"""

Matrices for making the final prediction vector

"""

self.C_o = self.add_weight(shape=(self.input_dim, self.output_dim),

name='C_o',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.U_o = self.add_weight(shape=(self.units, self.output_dim),

name='U_o',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.W_o = self.add_weight(shape=(self.output_dim, self.output_dim),

name='W_o',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.b_o = self.add_weight(shape=(self.output_dim, ),

name='b_o',

initializer=self.bias_initializer,

regularizer=self.bias_regularizer,

constraint=self.bias_constraint)

# For creating the initial state:

self.W_s = self.add_weight(shape=(self.input_dim, self.units),

name='W_s',

initializer=self.recurrent_initializer,

regularizer=self.recurrent_regularizer,

constraint=self.recurrent_constraint)

self.input_spec = [

InputSpec(shape=(self.batch_size, self.timesteps, self.input_dim))]

self.built = True

def call(self, x):

# store the whole sequence so we can "attend" to it at each timestep

self.x_seq = x

# apply the a dense layer over the time dimension of the sequence

# do it here because it doesn't depend on any previous steps

# thefore we can save computation time:

self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a,

input_dim=self.input_dim,

timesteps=self.timesteps,

output_dim=self.units)

return super(AttentionDecoder, self).call(x)

def get_initial_state(self, inputs):

# apply the matrix on the first time step to get the initial s0.

s0 = activations.tanh(K.dot(inputs[:, 0], self.W_s))

# from keras.layers.recurrent to initialize a vector of (batchsize,

# output_dim)

y0 = K.zeros_like(inputs) # (samples, timesteps, input_dims)

y0 = K.sum(y0, axis=(1, 2)) # (samples, )

y0 = K.expand_dims(y0) # (samples, 1)

y0 = K.tile(y0, [1, self.output_dim])

return [y0, s0]

def step(self, x, states):

ytm, stm = states

# repeat the hidden state to the length of the sequence

_stm = K.repeat(stm, self.timesteps)

# now multiplty the weight matrix with the repeated hidden state

_Wxstm = K.dot(_stm, self.W_a)

# calculate the attention probabilities

# this relates how much other timesteps contributed to this one.

et = K.dot(activations.tanh(_Wxstm + self._uxpb),

K.expand_dims(self.V_a))

at = K.exp(et)

at_sum = K.sum(at, axis=1)

at_sum_repeated = K.repeat(at_sum, self.timesteps)

at /= at_sum_repeated # vector of size (batchsize, timesteps, 1)

# calculate the context vector

context = K.squeeze(K.batch_dot(at, self.x_seq, axes=1), axis=1)

# ~~~> calculate new hidden state

# first calculate the "r" gate:

rt = activations.sigmoid(

K.dot(ytm, self.W_r)

+ K.dot(stm, self.U_r)

+ K.dot(context, self.C_r)

+ self.b_r)

# now calculate the "z" gate

zt = activations.sigmoid(

K.dot(ytm, self.W_z)

+ K.dot(stm, self.U_z)

+ K.dot(context, self.C_z)

+ self.b_z)

# calculate the proposal hidden state:

s_tp = activations.tanh(

K.dot(ytm, self.W_p)

+ K.dot((rt * stm), self.U_p)

+ K.dot(context, self.C_p)

+ self.b_p)

# new hidden state:

st = (1-zt)*stm + zt * s_tp

yt = activations.softmax(

K.dot(ytm, self.W_o)

+ K.dot(stm, self.U_o)

+ K.dot(context, self.C_o)

+ self.b_o)

if self.return_probabilities:

return at, [yt, st]

else:

return yt, [yt, st]

def compute_output_shape(self, input_shape):

"""

For Keras internal compatability checking

"""

if self.return_probabilities:

return (None, self.timesteps, self.timesteps)

else:

return (None, self.timesteps, self.output_dim)

def get_config(self):

"""

For rebuilding models on load time.

"""

config = {

'output_dim': self.output_dim,

'units': self.units,

'return_probabilities': self.return_probabilities

}

base_config = super(AttentionDecoder, self).get_config()

return dict(list(base_config.items()) + list(config.items()))

We can make use of this custom layer in our projects by importing it as follows:

from attention_decoder import AttentionDecoder

1	from attention_decoder import AttentionDecoder

The layer implements attention as described by Bahdanau, et al. in their paper “Neural Machine Translation by Jointly Learning to Align and Translate.”

The code is explained well in the original post and linked to both the LSTM and attention equations.

A limitation of this implementation is that it must output sequences that are the same length as the input sequences, the specific limitation that the encoder-decoder architecture was designed to overcome.

Importantly, the new layer manages both the repeating of the decoding as performed by the second LSTM, as well as the softmax output for the model as was performed by the Dense output layer in the encoder-decoder model without attention. This greatly simplifies the code for the model.

It is important to note that the custom layer is built upon the Recurrent layer in Keras, which, at the time of writing, is marked as legacy code, and presumably will be removed from the project at some point.

Encoder-Decoder With Attention

Now that we have an implementation of attention that we can use, we can develop an encoder-decoder model with attention for our contrived sequence prediction problem.

The model with the attention layer is defined below. We can see that the layer handles some of the machinery of the encoder-decoder model itself, making defining the model simpler.

# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(AttentionDecoder(150, n_features))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# define model

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))

model.add(AttentionDecoder(150, n_features))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

That’s it. The rest of the example is the same.

The complete example is listed below.

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from attention_decoder import AttentionDecoder

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality):
	# generate random sequence
	sequence_in = generate_sequence(n_in, cardinality)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, cardinality)
	y = one_hot_encode(sequence_out, cardinality)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2

# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(AttentionDecoder(150, n_features))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# train LSTM
for epoch in range(5000):
	# generate new random sequence
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	# fit model for one epoch on this sequence
	model.fit(X, y, epochs=1, verbose=2)
# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
		correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))
# spot check some examples
for _ in range(10):
	X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
	yhat = model.predict(X, verbose=0)
	print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

from random import randint

from numpy import array

from numpy import argmax

from numpy import array_equal

from keras.models import Sequential

from keras.layers import LSTM

from attention_decoder import AttentionDecoder

# generate a sequence of random integers

def generate_sequence(length, n_unique):

return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence

def one_hot_encode(sequence, n_unique):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM

def get_pair(n_in, n_out, cardinality):

# generate random sequence

sequence_in = generate_sequence(n_in, cardinality)

sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]

# one hot encode

X = one_hot_encode(sequence_in, cardinality)

y = one_hot_encode(sequence_out, cardinality)

# reshape as 3D

X = X.reshape((1, X.shape[0], X.shape[1]))

y = y.reshape((1, y.shape[0], y.shape[1]))

return X,y

# configure problem

n_features = 50

n_timesteps_in = 5

n_timesteps_out = 2

# define model

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))

model.add(AttentionDecoder(150, n_features))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# train LSTM

for epoch in range(5000):

# generate new random sequence

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

# fit model for one epoch on this sequence

model.fit(X, y, epochs=1, verbose=2)

# evaluate LSTM

total, correct = 100, 0

for _ in range(total):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):

correct += 1

print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

# spot check some examples

for _ in range(10):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

Running the example prints the skill of the model on 100 randomly generated input-output pairs.

With the same resources and same amount of training, the model with attention performs much better.

Accuracy: 95.00%

1	Accuracy: 95.00%

Spot-checking some sample outputs and predicted sequences, we can see very few errors, even in cases when there is a zero value in the first two elements.

Expected: [48, 47, 0, 0, 0] Predicted [48, 47, 0, 0, 0]
Expected: [7, 46, 0, 0, 0] Predicted [7, 46, 0, 0, 0]
Expected: [32, 30, 0, 0, 0] Predicted [32, 2, 0, 0, 0]
Expected: [3, 25, 0, 0, 0] Predicted [3, 25, 0, 0, 0]
Expected: [45, 4, 0, 0, 0] Predicted [45, 4, 0, 0, 0]
Expected: [49, 9, 0, 0, 0] Predicted [49, 9, 0, 0, 0]
Expected: [22, 23, 0, 0, 0] Predicted [22, 23, 0, 0, 0]
Expected: [29, 36, 0, 0, 0] Predicted [29, 36, 0, 0, 0]
Expected: [0, 29, 0, 0, 0] Predicted [0, 29, 0, 0, 0]
Expected: [11, 26, 0, 0, 0] Predicted [11, 26, 0, 0, 0]

Expected: [48, 47, 0, 0, 0] Predicted [48, 47, 0, 0, 0]

Expected: [7, 46, 0, 0, 0] Predicted [7, 46, 0, 0, 0]

Expected: [32, 30, 0, 0, 0] Predicted [32, 2, 0, 0, 0]

Expected: [3, 25, 0, 0, 0] Predicted [3, 25, 0, 0, 0]

Expected: [45, 4, 0, 0, 0] Predicted [45, 4, 0, 0, 0]

Expected: [49, 9, 0, 0, 0] Predicted [49, 9, 0, 0, 0]

Expected: [22, 23, 0, 0, 0] Predicted [22, 23, 0, 0, 0]

Expected: [29, 36, 0, 0, 0] Predicted [29, 36, 0, 0, 0]

Expected: [0, 29, 0, 0, 0] Predicted [0, 29, 0, 0, 0]

Expected: [11, 26, 0, 0, 0] Predicted [11, 26, 0, 0, 0]

Comparison of Models

Although we are getting better results from the model with attention, the results were reported from a single run of each model.

In this case, we seek a more robust finding by repeating the evaluation of each model multiple times and reporting the average performance over those runs. For more information on this robust approach to evaluating neural network models, see the post:

How to Evaluate the Skill of Deep Learning Models

We can define a function to create each type of model, as follows.

# define the encoder-decoder model
def baseline_model(n_timesteps_in, n_features):
	model = Sequential()
	model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
	model.add(RepeatVector(n_timesteps_in))
	model.add(LSTM(150, return_sequences=True))
	model.add(TimeDistributed(Dense(n_features, activation='softmax')))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

# define the encoder-decoder with attention model
def attention_model(n_timesteps_in, n_features):
	model = Sequential()
	model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
	model.add(AttentionDecoder(150, n_features))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

# define the encoder-decoder model

def baseline_model(n_timesteps_in, n_features):

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))

model.add(RepeatVector(n_timesteps_in))

model.add(LSTM(150, return_sequences=True))

model.add(TimeDistributed(Dense(n_features, activation='softmax')))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

return model

# define the encoder-decoder with attention model

def attention_model(n_timesteps_in, n_features):

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))

model.add(AttentionDecoder(150, n_features))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

return model

We can then define a function to fit and evaluate the accuracy of a fit model and return the accuracy score.

# train and evaluate a model, return accuracy
def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):
	# train LSTM
	for epoch in range(5000):
		# generate new random sequence
		X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
		# fit model for one epoch on this sequence
		model.fit(X, y, epochs=1, verbose=0)
	# evaluate LSTM
	total, correct = 100, 0
	for _ in range(total):
		X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
		yhat = model.predict(X, verbose=0)
		if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
			correct += 1
	return float(correct)/float(total)*100.0

# train and evaluate a model, return accuracy

def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):

# train LSTM

for epoch in range(5000):

# generate new random sequence

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

# fit model for one epoch on this sequence

model.fit(X, y, epochs=1, verbose=0)

# evaluate LSTM

total, correct = 100, 0

for _ in range(total):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):

correct += 1

return float(correct)/float(total)*100.0

Putting this together, we can repeat the process of creating, training, and evaluating each type of model multiple times and reporting the mean accuracy over the repeats. To keep running times down, we will repeat each model evaluation 10 times, although if you have the resources, you could increase this to 30 or 100 times.

The complete example is listed below.

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import RepeatVector
from attention_decoder import AttentionDecoder

# generate a sequence of random integers
def generate_sequence(length, n_unique):
	return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
	encoding = list()
	for value in sequence:
		vector = [0 for _ in range(n_unique)]
		vector[value] = 1
		encoding.append(vector)
	return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
	return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality):
	# generate random sequence
	sequence_in = generate_sequence(n_in, cardinality)
	sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
	# one hot encode
	X = one_hot_encode(sequence_in, cardinality)
	y = one_hot_encode(sequence_out, cardinality)
	# reshape as 3D
	X = X.reshape((1, X.shape[0], X.shape[1]))
	y = y.reshape((1, y.shape[0], y.shape[1]))
	return X,y

# define the encoder-decoder model
def baseline_model(n_timesteps_in, n_features):
	model = Sequential()
	model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
	model.add(RepeatVector(n_timesteps_in))
	model.add(LSTM(150, return_sequences=True))
	model.add(TimeDistributed(Dense(n_features, activation='softmax')))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

# define the encoder-decoder with attention model
def attention_model(n_timesteps_in, n_features):
	model = Sequential()
	model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
	model.add(AttentionDecoder(150, n_features))
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

# train and evaluate a model, return accuracy
def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):
	# train LSTM
	for epoch in range(5000):
		# generate new random sequence
		X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
		# fit model for one epoch on this sequence
		model.fit(X, y, epochs=1, verbose=0)
	# evaluate LSTM
	total, correct = 100, 0
	for _ in range(total):
		X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
		yhat = model.predict(X, verbose=0)
		if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
			correct += 1
	return float(correct)/float(total)*100.0

# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2
n_repeats = 10
# evaluate encoder-decoder model
print('Encoder-Decoder Model')
results = list()
for _ in range(n_repeats):
	model = baseline_model(n_timesteps_in, n_features)
	accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
	results.append(accuracy)
	print(accuracy)
print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))
# evaluate encoder-decoder with attention model
print('Encoder-Decoder With Attention Model')
results = list()
for _ in range(n_repeats):
	model = attention_model(n_timesteps_in, n_features)
	accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
	results.append(accuracy)
	print(accuracy)
print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))

from random import randint

from numpy import array

from numpy import argmax

from numpy import array_equal

from keras.models import Sequential

from keras.layers import LSTM

from keras.layers import Dense

from keras.layers import TimeDistributed

from keras.layers import RepeatVector

from attention_decoder import AttentionDecoder

# generate a sequence of random integers

def generate_sequence(length, n_unique):

return [randint(0, n_unique-1) for _ in range(length)]

# one hot encode sequence

def one_hot_encode(sequence, n_unique):

encoding = list()

for value in sequence:

vector = [0 for _ in range(n_unique)]

vector[value] = 1

encoding.append(vector)

return array(encoding)

# decode a one hot encoded string

def one_hot_decode(encoded_seq):

return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM

def get_pair(n_in, n_out, cardinality):

# generate random sequence

sequence_in = generate_sequence(n_in, cardinality)

sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]

# one hot encode

X = one_hot_encode(sequence_in, cardinality)

y = one_hot_encode(sequence_out, cardinality)

# reshape as 3D

X = X.reshape((1, X.shape[0], X.shape[1]))

y = y.reshape((1, y.shape[0], y.shape[1]))

return X,y

# define the encoder-decoder model

def baseline_model(n_timesteps_in, n_features):

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))

model.add(RepeatVector(n_timesteps_in))

model.add(LSTM(150, return_sequences=True))

model.add(TimeDistributed(Dense(n_features, activation='softmax')))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

return model

# define the encoder-decoder with attention model

def attention_model(n_timesteps_in, n_features):

model = Sequential()

model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))

model.add(AttentionDecoder(150, n_features))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

return model

# train and evaluate a model, return accuracy

def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):

# train LSTM

for epoch in range(5000):

# generate new random sequence

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

# fit model for one epoch on this sequence

model.fit(X, y, epochs=1, verbose=0)

# evaluate LSTM

total, correct = 100, 0

for _ in range(total):

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

yhat = model.predict(X, verbose=0)

if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):

correct += 1

return float(correct)/float(total)*100.0

# configure problem

n_features = 50

n_timesteps_in = 5

n_timesteps_out = 2

n_repeats = 10

# evaluate encoder-decoder model

print('Encoder-Decoder Model')

results = list()

for _ in range(n_repeats):

model = baseline_model(n_timesteps_in, n_features)

accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)

results.append(accuracy)

print(accuracy)

print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))

# evaluate encoder-decoder with attention model

print('Encoder-Decoder With Attention Model')

results = list()

for _ in range(n_repeats):

model = attention_model(n_timesteps_in, n_features)

accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)

results.append(accuracy)

print(accuracy)

print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))

Running this example prints the accuracy for each model repeat to give you an idea of the progress of the run.

Encoder-Decoder Model
20.0
23.0
23.0
18.0
28.000000000000004
28.999999999999996
23.0
26.0
21.0
20.0
Mean Accuracy: 23.10%

Encoder-Decoder With Attention Model
98.0
91.0
94.0
93.0
96.0
99.0
97.0
94.0
99.0
96.0
Mean Accuracy: 95.70%

Encoder-Decoder Model

20.0

23.0

18.0

28.000000000000004

28.999999999999996

23.0

26.0

21.0

20.0

Mean Accuracy: 23.10%

Encoder-Decoder With Attention Model

98.0

91.0

94.0

93.0

96.0

99.0

97.0

94.0

99.0

96.0

Mean Accuracy: 95.70%

We can see that even averaged over 10 runs, the attention model still shows better performance than the encoder-decoder model without attention, 23.10% vs 95.70%.

A good extension to this evaluation would be to capture the model loss each epoch for each model, take the average, and compare how the loss changes over time for the architecture with and without attention.

I expect that this trace would show attention achieving better skill much faster and sooner than the non-attentional model, further highlighting the benefit of the approach.

Summary

In this tutorial, you discovered how to develop an encoder-decoder recurrent neural network with attention in Python with Keras.

Specifically, you learned:

How to design a small and configurable problem to evaluate encoder-decoder recurrent neural networks with and without attention.
How to design and evaluate an encoder-decoder network with and without attention for the sequence prediction problem.
How to robustly compare the performance of encoder-decoder networks with and without attention.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

358 Responses to How to Develop an Encoder-Decoder Model with Attention in Keras

Chetan October 17, 2017 at 6:11 am #

The timing of this post couldn’t have been more accurate. I’ve spent hours and days on google looking for a reliable Keras implementation of attention. Can’t wait to test this on my specific problem definition. Thanks a ton Jason!

Reply
- Jason Brownlee October 17, 2017 at 4:03 pm #
  
  I’m glad to hear that Chetan!
  
  Let me know how you go.
  
  Reply
ChrisJew October 17, 2017 at 10:35 pm #

test soft

Reply
- Jason Brownlee October 18, 2017 at 5:36 am #
  
  Your test worked.
  
  Reply
Mateo October 18, 2017 at 11:30 pm #

Thank you for this post!

Unfortunately the kernel crashes on my laptop! I don’t know why (no RAM issues)
I use Keras==2.0.8 and TF==1.3.0

Reply
- Jason Brownlee October 19, 2017 at 5:37 am #
  
  Ouch. Perhaps there is something up with your environment.
  
  This post might help if you need to set things up from scratch:
  https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/
  
  Reply
Ravi Annaswamy October 20, 2017 at 7:09 pm #

Jason, very nice tutorial on probably the most important and most powerful neural application architecture (seq2seq with attention – since it is equivalent to a self programming turing machine – it sees an input stream of symbols, then can move back and forth using attention and write out a stream of symbols).

In fact theoretically it is super-turing, because it works with continuous (real) representation instead of Turing symbolic notation. google ‘recurrent networks super turing’ for proofs.

I am looking forward to attention being integrated into Keras and your revised code later, but no one can match your ability to setup the problem, generate data, explain step by step.. Keep up the great work.

Ravi Annaswamy

Reply
- Jason Brownlee October 21, 2017 at 5:29 am #
  
  Thanks Ravi, I really appreciate your support! You made my day 🙂
  
  Reply
Ravi Annaswamy October 20, 2017 at 8:12 pm #

Jason, I think in order to show the power of sequence mapping, we need to try two things:
1. The input sequence should be of variable length (not always 5). For example you can make it a length of 10 max, but it should generate sequences of any length between say 4 to 10 (remaining zeros).
2. The output should not be just zeroing of values, but more complex output for example, the first and last non zero value of the sequence…

Reply
- Ravi Annaswamy October 20, 2017 at 8:15 pm #
  
  something like the example built here:
  https://talbaumel.github.io/attention/
  
  Reply
  - Ravi Annaswamy October 20, 2017 at 9:49 pm #
    
    I am working on a modification of your excellent code, to illustrate this extended task, will post shortly.
    
    Reply
- Jason Brownlee October 21, 2017 at 5:34 am #
  
  Yes, you could easily modify the above example to achieve these requirements.
  
  Reply
Ravi Annaswamy October 20, 2017 at 10:25 pm #

Dr.Jason,

You have done an excellent application and framework code.

I wanted to expose the great value of this architecture and modularity of
this code by attempting a harder problem. Harder in two ways:

First we want to make the input sequence variable length from example to example.

Second, we want the output to be one that requires attention and long term memory,
across the length!

So we come up with this task:

Given a input sequence which is variable length with zero padding…
[6, 8, 7, 2, 2, 6, 6, 4, 0, 0]
I wanted the network to pick out and output the first and last non-zero of the series
[6, 4, 0, 0, 0, 0, 0, 0, 0, 0]

To make it even more interesting task for memory, we want it to output
the two numbers in reverse order:

input:
[6, 8, 7, 2, 2, 6, 6, 4, 0, 0]
output
[4, 6, 0, 0, 0, 0, 0, 0, 0, 0]

This would require that the algorithm figure out that we are selecting the first and last of the sequence,
and then writing out them in reverse order! It really needs some kind of a turing machine that can
go back and forth on the sequence and decide when to write what! Can the seq2seq with attention LSTM do this?
Let us try out.

Here are few more training cases created:
[5, 5, 3, 3, 2, 0, 0, 0, 0, 0] [2, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[4, 7, 7, 4, 3, 9, 0, 0, 0, 0] [9, 4, 0, 0, 0, 0, 0, 0, 0, 0]
[2, 6, 7, 6, 5, 0, 0, 0, 0, 0] [5, 2, 0, 0, 0, 0, 0, 0, 0, 0]
[9, 8, 2, 8, 8, 7, 9, 1, 5, 0] [5, 9, 0, 0, 0, 0, 0, 0, 0, 0]

I made the following changes to your excellent code to make this possible:

1. In order to use 0 as the padding character, we make the unique letters from 1 to n_unique.

# generate a sequence of random integers
def generate_sequence(length, n_unique):
return [randint(0, n_unique-2)+1 for _ in range(length)]

I think in your original code also you should adopt the above mechanism so that 0 is reserved as padding
symbol and generated sequence only contains 1 to n_unique. I think this will increase accuracy to 100% in your tests too.

2. In order to simplify the domain, for faster training, I restricted the range of values:

n_features = 8
n_timesteps_in = 10
n_timesteps_out = 2

That is the input has a max of 10 positions but anywhere between 4 to 9 of these could be nonzero sequence, as shown below.
The input only uses an alphabet of 8 numbers instead of the 50 you used.

3. Correspondingly the get_pair was modified to generate the series above:

# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality, verbose=False): # edited this to add verbose flag
# generate random sequence
sequence_in = generate_sequence(n_in, cardinality)
real_length = randint(4,n_in-1) # i added this
sequence_in = sequence_in[:real_length] + [0 for _ in range(n_in-real_length)] # i added this
sequence_out = [sequence_in[real_length-1]]+[sequence_in[0]] + [0 for _ in range(n_in-2)] # i edited this
if verbose: # added this for testing
print(sequence_in,sequence_out) # added this
# one hot encode
X = one_hot_encode(sequence_in, cardinality)
y = one_hot_encode(sequence_out, cardinality)
# reshape as 3D
X = X.reshape((1, X.shape[0], X.shape[1]))
y = y.reshape((1, y.shape[0], y.shape[1]))
return X,y

4. With these changes:
for _ in range(5):
a=get_pair(10,2,10,verbose=True)

generates:

[6, 8, 7, 2, 2, 6, 6, 4, 0, 0] [4, 6, 0, 0, 0, 0, 0, 0, 0, 0]
[5, 5, 3, 3, 2, 0, 0, 0, 0, 0] [2, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[4, 7, 7, 4, 3, 9, 0, 0, 0, 0] [9, 4, 0, 0, 0, 0, 0, 0, 0, 0]
[2, 6, 7, 6, 5, 0, 0, 0, 0, 0] [5, 2, 0, 0, 0, 0, 0, 0, 0, 0]
[9, 8, 2, 8, 8, 7, 9, 1, 5, 0] [5, 9, 0, 0, 0, 0, 0, 0, 0, 0]

5. Result of training on this dataset:
Encoder-Decoder Model
20.0
12.0
18.0
19.0
9.0
10.0
16.0
12.0
12.0
11.0

Encoder-Decoder With Attention Model
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0

Yes!

This shows the capacity of recurrent neural models to learn arbitrary programs from example input and output pairs!
Of course, one can increase length of the sequence and also the n_unique to make the task harder, but I do not expect
dramatic failure as we gradually increase to reasonable values.

I am really very happy that you put together this excellent example. Please feel free to add this extension application to your excellent article/books if it will add value. Also please review the changes to make sure I have not made any errors.

The only complaint I have is that the keras implementation of attention is very slow. (I think the pytorch implementation will be
far faster because of avoiding a few layers of abstraction..but I may be wrong, will try it..)

Ravi

Attached the complete code for reproducibility:

from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import RepeatVector
from attention_decoder import AttentionDecoder

# generate a sequence of random integers
def generate_sequence(length, n_unique):
return [randint(0, n_unique-2)+1 for _ in range(length)]

# one hot encode sequence
def one_hot_encode(sequence, n_unique):
encoding = list()
for value in sequence:
vector = [0 for _ in range(n_unique)]
vector[value] = 1
encoding.append(vector)
return array(encoding)

# decode a one hot encoded string
def one_hot_decode(encoded_seq):
return [argmax(vector) for vector in encoded_seq]

# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality, verbose=False):
# generate random sequence
sequence_in = generate_sequence(n_in, cardinality)
real_length = randint(4,n_in-1)
sequence_in = sequence_in[:real_length] + [0 for _ in range(n_in-real_length)]
sequence_out = [sequence_in[real_length-1]]+[sequence_in[0]] + [0 for _ in range(n_in-2)]
if verbose:
print(sequence_in,sequence_out)
# one hot encode
X = one_hot_encode(sequence_in, cardinality)
y = one_hot_encode(sequence_out, cardinality)
# reshape as 3D
X = X.reshape((1, X.shape[0], X.shape[1]))
y = y.reshape((1, y.shape[0], y.shape[1]))
return X,y

# define the encoder-decoder model
def baseline_model(n_timesteps_in, n_features):
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation=’softmax’)))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘acc’])
return model

# define the encoder-decoder with attention model
def attention_model(n_timesteps_in, n_features):
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(AttentionDecoder(150, n_features))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘acc’])
return model

# train and evaluate a model, return accuracy
def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):
# train LSTM
for epoch in range(5000):
# generate new random sequence
X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
# fit model for one epoch on this sequence
model.fit(X, y, epochs=1, verbose=0)
# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
yhat = model.predict(X, verbose=0)
if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
correct += 1
return float(correct)/float(total)*100.0

# configure problem
n_features = 8
n_timesteps_in = 10
n_timesteps_out = 2
n_repeats = 10

# evaluate encoder-decoder model
print(‘Encoder-Decoder Model’)
results = list()
for _ in range(n_repeats):
model = baseline_model(n_timesteps_in, n_features)
accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
results.append(accuracy)
print(accuracy)
print(‘Mean Accuracy: %.2f%%’ % (sum(results)/float(n_repeats)))
# evaluate encoder-decoder with attention model
print(‘Encoder-Decoder With Attention Model’)
results = list()
for _ in range(n_repeats):
model = attention_model(n_timesteps_in, n_features)
accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
results.append(accuracy)
print(accuracy)
print(‘Mean Accuracy: %.2f%%’ % (sum(results)/float(n_repeats)))

Reply
- Jason Brownlee October 21, 2017 at 5:38 am #
  
  Great work!
  
  Reply
  - Bhuvana June 27, 2019 at 12:11 am #
    
    I get the following error while import the .py file
    
    from attention_decoder import AttentionDecoder
    
    ImportError Traceback (most recent call last)
    in ()
    —-> 1 from attention_decoder import AttentionDecoder as ad
    
    ImportError: cannot import name ‘AttentionDecoder’
    
    Reply
    - Jason Brownlee June 27, 2019 at 7:52 am #
      
      Sorry to hear that, I have some suggestions here:
      https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
      
      Reply
      - Bilal Chandio October 11, 2020 at 6:48 pm #
        
        Dimensions must be equal, but are 150 and 50 for ‘AttentionDecoder_1/MatMul_4’ (op: ‘MatMul’) with input shapes: [?,150], [50,150].
        
        I am using Keras==2.0
        and tensorflow==1.13.1
        
        Please specify for which version code is fully functional.
        
        Thanks in advance.
      - Jason Brownlee October 12, 2020 at 6:40 am #
        
        I recommend using the new attention layers in TensorFlow 2.
- Ashima February 16, 2019 at 2:35 am #
  
  Hi @Ravi, @Jason,
  
  Thanks for the great post. Is it possible to give variable timesteps as the input for RepeatVector for variable input length ?
  
  For instance, instead of defining a fixed size of n_timesteps_in as 10, I want to read the entire input sequence as a whole.
  
  model.add(RepeatVector(n_timesteps_in))
  
  Reply
  - Jason Brownlee February 16, 2019 at 6:20 am #
    
    Not really.
    
    Reply
    - kuldeep January 3, 2020 at 8:30 pm #
      
      I have a problem i am making a model in NMT for English to Hindi but its prediction is not good very bad how can i improve my prediction.
      
      Reply
      - Jason Brownlee January 4, 2020 at 8:30 am #
        
        Here are some suggestions:
        https://machinelearningmastery.com/improve-deep-learning-performance/
ravi annaswamy October 20, 2017 at 10:44 pm #

here is verbose evaluation

for _ in range(5):
X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features,verbose=True)
yhat = model.predict(X, verbose=0)
print(one_hot_decode(yhat[0]))

[5, 5, 1, 6, 1, 4, 5, 0, 0, 0] [5, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[5, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[5, 5, 4, 7, 2, 1, 3, 0, 0, 0] [3, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[3, 5, 0, 0, 0, 0, 0, 0, 0, 0]
[3, 4, 7, 6, 3, 1, 3, 1, 1, 0] [1, 3, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 3, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 4, 1, 4, 7, 2, 2, 3, 4, 0] [4, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[4, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 5, 1, 4, 7, 6, 3, 7, 7, 0] [7, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[7, 1, 0, 0, 0, 0, 0, 0, 0, 0]

Reply
- Meeklai October 24, 2017 at 8:07 pm #
  
  Awesome work Ravi!
  
  Would you please upload these codes into https://gist.github.com?
  
  Reply
Meeklai October 21, 2017 at 2:57 am #

First of all, thank you so much for this worth reading article. It clarified me a lot about how to implement autoencoder model in Keras.

I just have a little confused point that I wish you would explain. Why do you need to transform an original vector of integers into a 2D matrix containing a one hot vector of each integer? Can’t you just send that original vector of integers into the encoder as an input?

Thank you again for this worthful article, Dr. Brownlee

Reply
- Jason Brownlee October 21, 2017 at 5:43 am #
  
  You can, but the one hot encoding is richer and often results in better model skill.
  
  Reply
  - Meeklai October 23, 2017 at 2:21 am #
    
    Thank you Dr. Brownlee, would one hot encoding is better for a situation that the number of cardinality is much greater than this example? Like fitting an encoder with lots of text documents, which will result in huge number of encoder’s keys
    
    Reply
    - Jason Brownlee October 23, 2017 at 5:49 am #
      
      In that case, it might be better to use a distributed representation like a word embedding:
      https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
      
      Reply
Hendrik October 24, 2017 at 7:00 pm #

In case of multiple LSTM layers, is the AttentionDecoder layer supposed to stay after all LSTMs only once or it must be inserted after each LSTM layert?

Reply
- Jason Brownlee October 25, 2017 at 6:44 am #
  
  The attention is only used directly after the encoder.
  
  Reply
  - AP May 18, 2018 at 7:31 pm #
    
    Hi Jason, following up on Hendirk’s question, then how can I stack multple LSTM layers with attention. Do i initialise the first decoder layer as AttentionDecoder and follow it up with Keras’s LSTM layers? Thanks for the super informative post!
    
    Reply
    - Jason Brownlee May 19, 2018 at 7:37 am #
      
      Attention would only be required on the first level of the decoder. LSTM layers may then be added after that.
      
      Reply
Trialcritic October 25, 2017 at 8:21 am #

Usually, when people have 5 input and 2 output steps, we use

model.add(LSTM(size, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_out)) # this is different from input steps
model.add(LSTM(size, return_sequences=True))

This makes sense, as suggested

“we need to repeat the single vector outputted from the encoder network to obtain a sequence which has the same length with the output sequences”.

Wonder if this must be changed.

Reply
- Jason Brownlee October 25, 2017 at 3:57 pm #
  
  Yes, the RepeatVector approach is not a pure encoder-decoder as defined in the first papers, but often performs as well or better in my experience.
  
  Reply
Aayushee November 3, 2017 at 8:17 pm #

Hi Jason,

Thanks for such a well explained post on this topic. You mention the limitation that output sequences are the same length as the input sequences in case of the attention encoder decoder model used.
Could you please give an idea what should be done in an attention based model when output and input lengths are not same? I was wondering if we can use a RepeatVector(output_timesteps) in the current attention model on the encoder output and then feed it to the AttentionDecoder?

Reply
- Jason Brownlee November 4, 2017 at 5:29 am #
  
  This implementation of attention cannot handle input and output sequences with different lengths, sorry.
  
  Reply
  - Sravan Malla May 27, 2019 at 7:42 pm #
    
    Hi Json, If this implementation of attention cannot handle input and output sequences with different lengths…then it cant be used for language translation task right? please advise
    
    Reply
    - Jason Brownlee May 28, 2019 at 8:13 am #
      
      Probably not.
      
      Reply
caichao November 4, 2017 at 11:43 pm #

By running your example (the “with attention part”, I’ve gotten the following error:
ValueError: Dimensions must be equal, but are 150 and 50 for ‘AttentionDecoder/MatMul_4’ (op: ‘MatMul’) with input shapes: [?,150], [50,150].

Reply
- Jason Brownlee November 5, 2017 at 5:16 am #
  
  Ensure you have the latest version of Keras.
  
  Reply
  - caichao November 5, 2017 at 11:51 am #
    
    My keras version is 2.0.2
    
    Reply
    - Jason Brownlee November 6, 2017 at 4:48 am #
      
      Perhaps try 2.0.8 or higher?
      
      Reply
      - caichao November 6, 2017 at 11:47 pm #
        
        also when I upgrade keras to 2.0.9
        I got the following problem
        
        from keras.layers.recurrent import Recurrent, _time_distributed_dense
        “unresolved reference _time_distributed_dense”
      - Jason Brownlee November 7, 2017 at 9:50 am #
        
        Interesting, perhaps the example requires Keras 2.0.8. This was the version I used when developing the example.
      - j May 4, 2021 at 9:12 pm #
        
        Recurrent is not found in tensorflow 2, got error when import it
        
        ImportError: cannot import name ‘Recurrent
        
        The line itself is “from tensorflow.keras.layers import Recurrent ”
        
        How do you import that layer? any Idea
      - Jason Brownlee May 5, 2021 at 6:10 am #
        
        I believe the above tutorial is not compatible with the latest version of the APIs.
      - Woo March 2, 2022 at 11:33 pm #
        
        Then how can I use this tutorial? I tried to find some ways, but failed.
      - James Carmichael March 3, 2022 at 1:42 pm #
        
        Hi Woo…Please provide more detail regarding what exactly failed in your implementation of the code listings so that I can better assist you.
  - caichao November 5, 2017 at 12:08 pm #
    
    also when I upgrade keras to 2.0.9
    I got the following problem
    
    from keras.layers.recurrent import Recurrent, _time_distributed_dense
    “unresolved reference _time_distributed_dense”
    
    Reply
kamal November 6, 2017 at 12:54 am #

Hi Jason. thank you for your great tutorials. I have 2 questions:

1) is there any Dense layer after Decoder in Attention code?
2)should features input be equal to features output or not ( their length should be equal as you mentioned)?

thank you, again

Reply
- Jason Brownlee November 6, 2017 at 4:53 am #
  
  Yes, there is normally a dense output after the decoder (or a part of the decoder).
  
  Features can vary. Normally/often you would have more input features than output features.
  
  Reply
Nandini November 29, 2017 at 5:33 pm #

Hi Jason,

from keras.models import Model,

How this Model() layer will works in keras?

Reply
- Jason Brownlee November 30, 2017 at 8:07 am #
  
  Great question, you can learn more in this post:
  https://machinelearningmastery.com/keras-functional-api-deep-learning/
  
  Reply
Basma November 30, 2017 at 9:57 pm #

Hi Jason,

thank you so much for this great tutorial, I’m actually trying to build an encoder with attention, so the attention should be in the encoder part, can you explain please how this can be adapted ?

Many thanks 🙂

Reply
- Jason Brownlee December 1, 2017 at 7:32 am #
  
  Generally, attention is in the decoder, not the encoder. Sorry, I don’t have an example of an encoder with attention.
  
  Reply
  - Basma December 6, 2017 at 7:52 pm #
    
    Hi Jason,
    
    i’m trying to use this great implementation for seq2seq to encode text. I have a dialogue turn from user A that I’ll decode to get dialogue turn from user B. I am using the following code
    
    seq2seq = Sequential()
    seq2seq.add(Embedding(output_dim=args.emb_dim,
    input_dim=MAX_NB_WORDS,
    input_length=MAX_SEQUENCE_LENGTH,
    weights=[embedding_matrix],
    mask_zero=True,
    trainable=True))
    
    seq2seq.add(LSTM(units=args.hidden_size, return_sequences=True))
    seq2seq.add(AttentionDecoder(args.hidden_size, args.emb_dim))
    seq2seq.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘acc’])
    
    But actually I don’t know how I can compare the decoded vector to the turn vector that I already have.
    
    My dialogue vectors are already encoded using keras preprocessing text_to_sequence and padded.
    
    Many thanks !
    
    Reply
    - Jason Brownlee December 7, 2017 at 7:53 am #
      
      I assume you are outputting a sequence of integers. These integers must be mapped back to words using whatever scheme you used to encode your training data.
      
      Reply
Leo January 2, 2018 at 8:58 pm #

Hi, Jason

Thanks for this tutorial. I’m trying a word embedding seq2seq model. But I’m stuck with how to build the model.
I use tokenizer and pad_sequences to encode Xtrain and ytrain, and then processing ytrain through to_categorical.
The format of input fed into the model is just like the ones in this tutorial: 1 input of Xtrain and ytrain for each epoch.
And it seems there’s something wrong with the embedding layer. But I can’t figure out why.

model = Sequential()
model.add(Embedding(vocab_size, 150, input_length=max_length))
model.add(Bidirectional(LSTM(150, return_sequences=True)))
model.add(AttentionDecoder(150, n_features))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘acc’])

ValueError: Error when checking input: expected embedding_8_input to have shape (None, 148) but got array with shape (148, 1)

Reply
- Leo January 2, 2018 at 9:35 pm #
  
  Sorry, I have another question. Can I just fit the model directly instead of using for loop to train the model for each epoch?
  If I just fit the model directly, I got another error message:
  Error when checking target: expected AttentionDecoder to have 3 dimensions, but got array with shape (200, 1321)
  
  Thank you very much.
  
  Reply
  - Jason Brownlee January 3, 2018 at 5:36 am #
    
    You can, but you must change the shape of the data you are feeding to the network.
    
    Reply
- Jason Brownlee January 3, 2018 at 5:34 am #
  
  This post might help to get you started with embedding layers:
  https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
  
  Remember to one hot encode your input data.
  
  Reply
  - Leo January 4, 2018 at 3:34 pm #
    
    Hi, Jason
    
    Thanks for your suggestion. and it works. However, I change the problem in this tutorial a little bit, and get stuck again.
    
    In this tutorial, the definition of the problem is given Xtrain, for example, [3, 5, 12, 10, 18, 20], and then, we echo the first two element, so the ytrain looks like [3, 5, 0, 0, 0, 0].
    
    Now, I want to find the specific continuous two numbers in a sequence but those two continuous numbers are located at different location within each sequence.
    
    For example, what I want is [16, Z] where Z is any number, and [16, Z] is within an sequence, Xtrain.
    
    So, Xtrain and ytrain look like:
    Xtrain ytrain
    [3, 1, 10, 14, 8, 20, 16, 7, 9, 19] [16, 7, 0, 0, 0, 0, 0, 0, 0, 0]
    [6, 1, 23, 16, 9, 12, 22, 8, 0, 17] [16, 9, 0, 0, 0, 0, 0, 0, 0, 0]
    [9, 13, 15, 12, 16, 2, 5, 1, 10, 8] [16, 2, 0, 0, 0, 0, 0, 0, 0, 0]
    
    I think the key point is to transform the format of Xtrain and ytrain. One-hot encoding Xtrain remains the same just like this tutorial. But right now I have no idea how to fit ytrain into the model. I tried several ways to transform the format of ytrain, such as,
    1. One-hot encoding ytrain, but it doesn’t work.
    2. One-hot encoding the location of [16, Z], but it seems nonsense.
    3. Changing the format of ytrain to, for example, [0, 0, 0, 0, 0, 0 16, 7, 0, 0], and then one-hot encoding this sequence, but it
    still doesn’t work.
    
    Do you have any suggestion or idea on such problem? Thank you very much.
    
    Reply
    - Jason Brownlee January 5, 2018 at 5:17 am #
      
      You could model it as a summarization task with the full sequence in and 2 elements out.
      
      For that you can use an encoder-decoder without attention.
      
      Reply
Paul January 7, 2018 at 2:27 am #

Is there an updated version of this example that uses TimeDistributed in Keras instead of _time_distributed_dense ?

Reply
- Jason Brownlee January 7, 2018 at 5:11 am #
  
  Not at this stage, I am waiting for attention to be officially supported by Keras.
  
  Reply
  - Denis January 12, 2018 at 3:50 am #
    
    About _time_distributed_dense problem: I copied the code of the _time_distributed_dense function from keras/python/keras/layers/recurrent.py file into the attention_decoder.py file (before the AttentionDecoder class) and the Jason’s code worked for me.
    
    Reply
    - Jason Brownlee January 12, 2018 at 5:54 am #
      
      Nice!
      
      Reply
    - Jacob February 21, 2018 at 9:39 am #
      
      Can you send me the code to copy into the AttentionDecoder class? ninjajake@gmail.com
      
      Reply
    - Alimur Razi Rana August 9, 2018 at 3:02 pm #
      
      This trick saves my day. Who wants the code – https://github.com/keras-team/keras/blob/b587aeee1c1be3633a56b945af3e7c2c303369ca/keras/layers/recurrent.py
      
      Reply
  - Abdi December 10, 2022 at 4:34 am #
    
    Now we have in Keras.io https://keras.io/api/layers/attention_layers/attention/
    was that one you referred to? but I didn’t understand the arguments completely to use between the encoder and decoder. Is any example available?
    
    Reply
Nipun Batra January 13, 2018 at 8:28 am #

Hi Jason,
Many thanks for the excellent post (as always!)

I was wondering: could we have learnt a model for the continuous data? That is, instead of one hot encoding the input and the output, if we feed in the raw sequence? I wondered as I have not yet seen a Seq2Seq with attention for continuous data. I was thinking of writing a simple model based on your post, to denoise a sine signal. It should be a case of Seq2seq on same length sequences.

Reply
- Jason Brownlee January 14, 2018 at 6:33 am #
  
  Sure, but it may be a harder problem for the model to learn.
  
  Reply
  - Nipun Batra January 14, 2018 at 11:50 pm #
    
    Thanks! When I just used the code in this blog post as it is (with attention), I didn’t see any reduction in loss function when using continuous data. That’s when I wondered if the attention implementation shared here is only for discrete data?
    
    Reply
    - Jason Brownlee January 15, 2018 at 6:59 am #
      
      Nice.
      
      No, it is independent of the data.
      
      Reply
- Shreya Bhatia April 2, 2020 at 2:49 am #
  
  Hey Nipun Batra… I am doing my final year project based on continuous data. I was wondering if you can share with me the encoder-decoder code you have been able to make work for your purpose. Thank You Shreya
  
  Reply
moses January 30, 2018 at 5:40 pm #

How should the decode differ between 0 and empty row
(zero is a one_hot_code vector that has 1 on the first entry, and empty is line of zeros)?

Reply
- Jason Brownlee January 31, 2018 at 9:39 am #
  
  Sorry, not sure I follow. Are you able to give more context?
  
  Reply
moses January 31, 2018 at 12:55 am #

One more

If the number of features of the input differs from the features of the output. we have to change this line:
model.add(TimeDistributed(Dense(n_features, activation=’softmax’)))

Should we do more?

Reply
- Jason Brownlee January 31, 2018 at 9:46 am #
  
  Features or time steps in the output?
  
  Reply
  - moses February 1, 2018 at 10:31 pm #
    
    My question was on the case of n_feauteres (make it in and out) but I belive that the legnth of the seuqnece matters too. The first one I resloved as I wrote not sure it is enough/
    
    Reply
Nathan D. February 2, 2018 at 5:13 am #

Hi Jason,

It’s a great demonstration and thank you very much for that.

I am wondering if you are aware of any way to get back the attention vector *at*? Since it is not model’s parameters, accessing via keras.backend.get_value() seems doesn’t work. Thank you.

Reply
Amy February 15, 2018 at 4:16 pm #

Hi Jason,

Great tutorial! This really helped my understanding a lot. May I ask how to modify this attention seq2seq to a batched version? Thanks!

Reply
- Jason Brownlee February 16, 2018 at 8:32 am #
  
  How would the batch approach to updates impact attention?
  
  Reply
haya March 13, 2018 at 6:27 am #

Hi Jason,
in the step function in AttentionDecoder can we use keras lstm layer iinstead of building it from scratch?

Reply
- Jason Brownlee March 13, 2018 at 6:33 am #
  
  Not sure I follow. Perhaps try it.
  
  Reply
Paul March 20, 2018 at 2:21 am #

It seems that the poor score without attention is mostly due to an optimization problem. I am able to achieve >95% without accuracy by just using a reasonable batch size (32)

Reply
- Jason Brownlee March 20, 2018 at 6:26 am #
  
  Nice, thanks for the note Paul. What config did you use?
  
  Reply
Dan March 26, 2018 at 8:37 am #

I just wanted to let you know your hard work on this site is appreciated. It’s been incredibly helpful for me learning something so complex 😀

Thank you so much!

Reply
- Jason Brownlee March 26, 2018 at 10:05 am #
  
  Thanks, I’m glad to hear that.
  
  Reply
Eduardo April 23, 2018 at 5:44 pm #

Hi, thanks for the website. Is really saving me on my Bachelor Thesis.

Can we train an SVM using the context vector?

Reply
- Jason Brownlee April 24, 2018 at 6:23 am #
  
  You’re welcome.
  
  Sure. To what end?
  
  Reply
jimbung April 23, 2018 at 8:08 pm #

hi Jason,
i met an issue when running the code.

Traceback (most recent call last):
File “gutils.py”, line 50, in
model.add(AttentionDecoder(150, n_features))
……
# calculate the attention probabilities
# this relates how much other timesteps contributed to this one.
et = K.dot(activations.tanh(_Wxstm + self._uxpb),
K.expand_dims(self.V_a))
……
File “/home/wanjb/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py”, line 421, in make_tensor_proto
raise ValueError(“None values not supported.”)
ValueError: None values not supported.

my environment:
Anaconda: conda 4.4.10
Python 3.6.4 :: Anaconda, Inc.

could u have a look on this? thanks!

Reply
- Jason Brownlee April 24, 2018 at 6:32 am #
  
  Sorry to hear that:
  
  – Are you able to confirm that TensorFlow and Keras are up to date?
  – Are you able to confirm that you copied all of the code?
  – Are you able to confirm that you are running the code from the command line?
  
  Reply

jimbung April 24, 2018 at 1:22 pm #

hi Jason,

the problem have been solved as ‘Denis January 12, 2018 at 3:50 am’ described.
thank you Denis!

def _time_distributed_dense(x, w, b=None, dropout=None,
                           input_dim=None, output_dim=None, timesteps=None):
    '''Apply y.w + b for every temporal slice y of x.
    '''
    if not input_dim:
        # won't work with TensorFlow
        input_dim = K.shape(x)[2]
    if not timesteps:
        # won't work with TensorFlow
        timesteps = K.shape(x)[1]
    if not output_dim:
        # won't work with TensorFlow
        output_dim = K.shape(w)[1]

    if dropout:
        # apply the same dropout pattern at every timestep
        ones = K.ones_like(K.reshape(x[:, 0, :], (-1, input_dim)))
        dropout_matrix = K.dropout(ones, dropout)
        expanded_dropout_matrix = K.repeat(dropout_matrix, timesteps)
        x *= expanded_dropout_matrix

    # collapse time dimension and batch dimension together
    x = K.reshape(x, (-1, input_dim))

    x = K.dot(x, w)
    if b:
        x = x + b
    # reshape to 3D tensor
    x = K.reshape(x, (-1, timesteps, output_dim))
    return x

def _time_distributed_dense(x, w, b=None, dropout=None,

input_dim=None, output_dim=None, timesteps=None):

'''Apply y.w + b for every temporal slice y of x.

'''

if not input_dim:

# won't work with TensorFlow

input_dim = K.shape(x)[2]

if not timesteps:

# won't work with TensorFlow

timesteps = K.shape(x)[1]

if not output_dim:

# won't work with TensorFlow

output_dim = K.shape(w)[1]

if dropout:

# apply the same dropout pattern at every timestep

ones = K.ones_like(K.reshape(x[:, 0, :], (-1, input_dim)))

dropout_matrix = K.dropout(ones, dropout)

expanded_dropout_matrix = K.repeat(dropout_matrix, timesteps)

x *= expanded_dropout_matrix

# collapse time dimension and batch dimension together

x = K.reshape(x, (-1, input_dim))

x = K.dot(x, w)

if b:

x = x + b

# reshape to 3D tensor

x = K.reshape(x, (-1, timesteps, output_dim))

return x

Jason Brownlee April 24, 2018 at 2:50 pm #

Great!

Reply
Jack October 17, 2022 at 7:46 pm #

Hi
name ‘K’ is not defined

Reply

Jorn May 2, 2018 at 2:53 am #

Thank you very much for yet another great post! Would this architecture be suitable for time series forecasting where you have sequences of multiple features to forecast a sequence of a single target? The sequence lengths of the features are longer then the length of the target sequence to be forecast.
All the examples I have seen so far are showing one feature sequence as input to one output target sequence.

Reply
- Jason Brownlee May 2, 2018 at 5:46 am #
  
  Maybe, try it and see.
  
  Reply
Ahmad Aldali May 5, 2018 at 5:39 am #

Hi Jason ..
Thank you for the this information ..
I have one question ..
Can I use this implementation in my Translation Model ..
I use encoder – decoder as following:

“””””
embedded_output = embedding_layer(embedding_input)

# ================================ Encoder ================================
encoder = LSTM(lstm_units, return_sequences=True, return_state=True, name=’encoder’)
encoder_outputs, state_h, state_c = encoder(embedded_output)
encoder_states = [state_h, state_c]

#….
embedding_Ar_input = Input(shape=(MAX_Ar_SEQUENCE_LENGTH,))
embedded_Ar_output = embedding_Ar_layer(embedding_Ar_input)

# ================================ Decoder ================================
# We set up our decoder to return full output sequences,
decoder_lstm = LSTM(lstm_units, return_sequences=True, return_state=True, name=’decoder’)

decoder_outputs, _, _ = decoder_lstm(embedded_Ar_output, initial_state=encoder_states)

# SoftMax
decoder_dense = Dense(output_vector_length, activation=’softmax’, name=’softmax’)
outputs_model = decoder_dense(attention)

“””””
and what is n_features? what it is representing??

Reply
- Ahmad Aldali May 5, 2018 at 5:42 am #
  
  n_features mean max decoder sequence ?
  
  Reply
fatime May 7, 2018 at 8:10 pm #

hi, Jason can you tell me what actually verbose do ?

Reply
- Jason Brownlee May 8, 2018 at 6:11 am #
  
  It turns on output during training so that you can see what the model us doing during training (e.g. skill and progress).
  
  Reply
  - fatime May 8, 2018 at 8:14 pm #
    
    so, what is the difference between verbose =1 or 2 or none , and which attention mechanism is the best for machine translation ?
    
    Reply
    - Jason Brownlee May 9, 2018 at 6:20 am #
      
      Verbose 0 turns off verbose output, verbose 1 gives a progress bar, verbose 2 gives one line per epoch.
      
      See this post on good NMT architectures:
      https://machinelearningmastery.com/configure-encoder-decoder-model-neural-machine-translation/
      
      Reply
fatime May 8, 2018 at 8:14 pm #

so, what is the difference between verbose =1 or 2 or none , and which attention mechanism is the best for machine translation ?

Reply
radhika May 8, 2018 at 8:49 pm #

hi, can we use this model for the translation of one language to another ?

Reply
- Jason Brownlee May 9, 2018 at 6:23 am #
  
  Here is an example:
  https://machinelearningmastery.com/develop-neural-machine-translation-system-keras/
  
  Reply
YoonseokHeo May 17, 2018 at 5:59 pm #

Thanks for wonderful tutorial.
In a Custom Keras Attention Layer(AttentionDecoder Class), I am wondering if you can let me know why you implement the predicted word(yt) at time step t
in such a way that the previous generated word(ytm), previous hidden state(stm), and calculated context vector(context)
are added with its weights.

What you implemented is as follows:
yt = activations.softmax(
K.dot(ytm, self.W_o)
+ K.dot(stm, self.U_o)
+ K.dot(context, self.C_o)
+ self.b_o)

I coudn’t find any mentions except the very first definition about calculating the next word like: P(yt|y1,…yt-1, X) = g(yi-1, si, ci)

I am not sure if this equation indicates the way you did when calculating yt.

Reply
- Jason Brownlee May 18, 2018 at 6:20 am #
  
  As mentioned, I did not implement the custom attention layer. I am not the best person to answer questions about it.
  
  Reply
chris May 20, 2018 at 1:49 am #

Hi Jason I try to understand LSTMs and I am very new. Could you please explain this following code a bit easier:
# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation=’softmax’)))

I understand the part of repeatvector and timdistrubuted. What I do not understand are the 150 hidden units, do they have to be the same ? and what happens if they are 1 ? It would be nice if you have any source for a visualized explanation of the structure. Thank you in advance.

Reply
- Jason Brownlee May 20, 2018 at 6:39 am #
  
  You can change the number of units to anything you wish.
  
  Reply
Rui May 28, 2018 at 1:04 am #

How could we apply this with multivariable time series ?

Reply
- Jason Brownlee May 28, 2018 at 6:02 am #
  
  Sure.
  
  Reply
Santy May 30, 2018 at 4:47 pm #

Hi Jason!

I am trying to understand image captioning with attention model. I had seen your tutorial on image captioning. Can you please suggest me some resource, so that i can implement it using attention model in keras?

Thanks You!

Reply
- Jason Brownlee May 31, 2018 at 6:13 am #
  
  I am waiting for Keras to get official support for attention.
  
  Reply
Santy June 14, 2018 at 1:58 am #

Hi Jason !

I have gone through your tutorial on image captioning as given on following link.

https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/

Can we use this attention model that is given by you for image captioning where CNN is used as Encoder and RNN is used as Decode?

Please suggest me.

Thank You.

Reply
- Jason Brownlee June 14, 2018 at 6:10 am #
  
  Perhaps.
  
  I hope to give more examples of attention when Keras officially supports it.
  
  Reply
Gang July 24, 2018 at 3:36 am #

Thanks for wonderful tutorials. I learned a lot from your site.

I tried your code. It seems to simply removing the first LSTM from the baseline model will get perfect predictions for this example. Not sure attention layer is necessary here.

model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(Dense(n_features, activation=’softmax’))

Reply
- Jason Brownlee July 24, 2018 at 6:22 am #
  
  It may not be, it’s just a demonstration.
  
  Reply
Atilla August 2, 2018 at 11:00 pm #

Hi Jason, I want to to make model encoder Bidirectional-LSTM, decoder Bidirectional-LSTM. In theoretically is it possible as Bi-LSTM in your proposed models?

Reply
- Jason Brownlee August 3, 2018 at 6:02 am #
  
  Sure.
  
  Reply
Elias Lousseief August 11, 2018 at 1:21 am #

Hi J! Thanks for a great hands-on tutorial … It works as intended and results are indeed improved with attention… however, when examining the at vector from attention_decoder, it does not show the desired activations…

Example:

Input: [29, 9, 47, 0, 12], output: [29, 9, 0, 0, 0] (correct)

at vector at first output (rounded): [6.2*10^(-12), 5.6*10^(-7), 1.5, 90.0, 8.4]

I would have expected the first of these numbers to be greatest as it should influence the output more than the remining four… What do you think about this? Could you inspect the at vector and see if you get the same results?

Reply
- Jason Brownlee August 11, 2018 at 6:13 am #
  
  Nice observation, it may need further investigation.
  
  Reply
- Niranjan September 13, 2018 at 4:10 am #
  
  Even I am seeing the same thing. Most of the probabilities are on the last 3 digits and it is never on the first 2 digits.
  
  Thanks Jason for the great tutorial! Very helpful.
  
  Reply
  - Jason Brownlee September 13, 2018 at 8:06 am #
    
    You’re welcome.
    
    Reply
Ling August 11, 2018 at 4:31 am #

Great work, Jason!

Reply
- Jason Brownlee August 11, 2018 at 6:13 am #
  
  You’re welcome.
  
  Reply
Nilanjan August 14, 2018 at 7:06 pm #

Hi Jason,

Thanks for the wonderful post. I have a small query, to give you context I am working on text data and in my case the input and output lengths are quite different. So wanted to check if we can tweak this code so that attention can be applied where encoder and decoder have different lengths. It will be helpful if you can direct me to an resource where this has been implemented or guide as to how can I make changes in the Attention class to incorporate this.

Thanks,
Nilanjan

Reply
- Jason Brownlee August 15, 2018 at 5:57 am #
  
  You might have to use a different implementation of attention.
  
  Reply
Md. Zakir Hossain August 14, 2018 at 11:37 pm #

Hi Jason,

Many thanks for very helpful posts. I have also gone through your image captioning code:

def define_model(vocab_size, max_length):
# feature extractor model
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation=’relu’)(fe1)
# sequence model
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation=’relu’)(decoder1)
outputs = Dense(vocab_size, activation=’softmax’)(decoder2)
# tie it together [image, seq] [word]
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)
# summarize model
print(model.summary())
plot_model(model, to_file=’model.png’, show_shapes=True)
return model

Here how can we use this attention Layer. please, will be grateful to you.

Reply
- Jason Brownlee August 15, 2018 at 6:04 am #
  
  Sorry, I don’t have an example of attention for the photo captioning. I don’t want to give off the cuff advice without developing the code to back it up.
  
  Reply
Raza August 28, 2018 at 4:38 pm #

The aforementioned attention decoder doesn’t seem to work with keras version 2.1.6 or versions above 2.0.0, why is so?

Reply
- Jason Brownlee August 29, 2018 at 8:07 am #
  
  Sorry to hear that. Perhaps contact the developer of the new layer?
  
  Reply
  - Raza August 29, 2018 at 8:32 pm #
    
    Can above method be used to correct contextual mistakes in sentences?
    e.g
    Input: fishing is suffering from fever
    Expected output: patient is suffering from fever.
    
    If not above, what will you propose for such problem statement.
    
    Reply
    - Jason Brownlee August 30, 2018 at 6:28 am #
      
      Maybe. Try it and see?
      
      Reply
Alice August 29, 2018 at 3:26 pm #

Hi Janson,

You used a list of random number in this post.

I have a list of number (not random) in sequence. How to use my own number as input data to predict the next number as output? May you give an example?

Thank you!

regards,
Alice

Reply
- Jason Brownlee August 30, 2018 at 6:26 am #
  
  What problem are you having exactly?
  
  Reply
Yasir September 5, 2018 at 8:48 am #

Hi, I am want to find out if you have an example for Copying Mechanism. Thanks.

Reply
- Jason Brownlee September 5, 2018 at 2:41 pm #
  
  What do you mean by “Copying Mechanism”?
  
  Reply
- jackzy September 7, 2018 at 7:25 pm #
  
  Do you mean pointer-network ?
  
  Reply
victor eloy September 13, 2018 at 11:04 pm #

A quick sugestion if you change your LSTM cell by GRU cells after you add the attention layer you will able to get 100% accuracy (what is pretty amazing).

Reply
- Jason Brownlee September 14, 2018 at 6:36 am #
  
  Cool! Nice tip.
  
  Reply
Gary September 20, 2018 at 1:34 am #

Hi, two questions for yout:

1) If I have a CNN followed by LSTM and then I just add an Attention layer, is the architecture still an encoder-decoder?

2) can encoder-decoder model be used in sequence labeling (outputs are IOB labels only), and if yes, why are they not used very often in tasks like named entity recognition where LSTM-CRFs are more popular?

Reply
- Jason Brownlee September 20, 2018 at 8:05 am #
  
  Sure. I think about a CNN-LSTM as an encoder-decoder.
  
  Yes, CNNs and hybrids do very well in sequence classification. I have examples in my new book for human activity recognition.
  
  Also, they do very well with sequences of text, e.g. state of the art for sentiment analysis (a sequence classification task).
  
  Reply
Dave October 11, 2018 at 11:34 am #

Hi Jason,
Enjoyed the post on attention and was able to get your example running great, then modified it to use some real-world data with interesting preliminary results. Ran into a problem when I saved the model then tried to reload it. Seems it didn’t recognize the AttentionDecoder. Has anyone else run into this? Are you aware of any fix?
Thanks,
Dave

Reply
- Jason Brownlee October 11, 2018 at 4:14 pm #
  
  I have not tried to save the model, perhaps it requires special handling to save the custom layer.
  
  Reply
Judd Brau October 29, 2018 at 12:16 pm #

Hi Jason,

In this article you use the AttentionDecoder model for Seq2Seq learning, but could this model be used to get a context vector for text classification? For example, could this be used to turn a variable length LSTM output into an input for a Feed-forward NN?

Reply
- Jason Brownlee October 29, 2018 at 2:13 pm #
  
  I’m not sure I follow, sorry. Perhaps you could elaborate?
  
  Reply
  - Judd brau October 30, 2018 at 8:47 am #
    
    Sure. Could this model be used to get a vector that’s value represents the meaning of the entire text? I’m no expert, but when I was researching text classification I saw lots of papers were talking about attention mechnisms, in particular this one: http://univagora.ro/jour/index.php/ijccc/article/view/3142/pdf
    
    Could the model that you show in this article be used for that purpose too?
    
    Reply
    - Jason Brownlee October 30, 2018 at 2:10 pm #
      
      Perhaps. Sorry, I don’t have a tutorial of LSTMS with attention for text classification.
      
      Reply
Jairo November 17, 2018 at 1:25 am #

Thanks for your help, Jason. Do you think it’s practical to try to implement Attention using the functional API instead of using a pre-built layer and Sequential?

Reply
- Jason Brownlee November 17, 2018 at 5:47 am #
  
  Sure.
  
  Reply
Zh LM November 18, 2018 at 10:59 pm #

Hi Jason,

Why do we train one sample at a epoch and not more?

# train LSTM
for epoch in range(5000):
# generate new random sequence
X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
# fit model for one epoch on this sequence
model.fit(X, y, epochs=1, verbose=2)

Reply
- Jason Brownlee November 19, 2018 at 6:46 am #
  
  We are manually controlling the training epochs.
  
  Reply
  - Zh LM November 19, 2018 at 6:50 pm #
    
    But when I modified the Seq2Seq without attention code as:
    
    batch_size = 10
    epochs = 10
    
    # train LSTM
    X_data = []
    y_data = []
    for sample in range(5000):
    # generate new random sequence
    X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
    X_data.append(X)
    y_data.append(y)
    
    X_data = array(X_data).reshape(5000, X.shape[1], X.shape[2])
    y_data = array(y_data).reshape(5000, X.shape[1], X.shape[2])
    
    model.fit(X_data, y_data, batch_size = batch_size, epochs = epochs, verbose = 2)
    
    I got a much higher test accuracy(91.00%) than yours(19.00%),
    did it means that your networks are not trained well when you train one sample at a epoch?
    
    Reply
    - Jason Brownlee November 20, 2018 at 6:34 am #
      
      Perhaps.
      
      Reply
Malik December 14, 2018 at 12:12 am #

I have a question related to ATTENTION, you have already shared ”
Multivariate Time Series Forecasting with LSTMs in Keras”
https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

so my question is “ATTENTION” better than LSTM for the same example? do I need to modify it based on ATTENTION?

I just try to get a better understanding of ATTENTION

Reply
- Jason Brownlee December 14, 2018 at 5:32 am #
  
  Perhaps try it and compare results.
  
  Reply
Koon Wai Choong December 16, 2018 at 10:03 pm #

Hi Jason,

As always excellent tutorial !

Could you please explain how to pad my own time series instead of using generate_sequence()

X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)

I try to use my own time series for X,y

Thank you sir

Regards,

Joe

Reply
- Jason Brownlee December 17, 2018 at 6:21 am #
  
  Perhaps this will help:
  https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
  
  Reply
  - Koon Wai Choong December 17, 2018 at 7:05 pm #
    
    Got it. Thank you
    
    Reply
Jenna Ma December 21, 2018 at 1:28 pm #

Do we have Attention in Keras library up to now?
You said it was coming soon in this post. 🙂
Thank you in advance.

Reply
- Jason Brownlee December 21, 2018 at 3:18 pm #
  
  It does not look like it is there yet (!!!), perhaps when TensorFlow 2.0 comes out.
  
  Reply
xiaoxx December 25, 2018 at 7:52 am #

Hi Jason,

I got this problem when i was trying the exactly same codes:

ValueError: Dimensions must be equal, but are 150 and 50 for ‘AttentionDecoder/MatMul_4’ (op: ‘MatMul’) with input shapes: [?,150], [50,150].

Could you tell me what happened?

Reply
- Jason Brownlee December 26, 2018 at 6:40 am #
  
  Perhaps the API has changed and the code does not work with the latest version of Keras?
  
  What version of Keras are you using?
  
  Reply
xiaoxx December 27, 2018 at 4:43 am #

keras version ： 2.2.0

tensorflow version ： 1.12.0

Reply
Zalman January 2, 2019 at 5:14 am #

Hi Jason,
First, thank you for this article!

I have a straightforward network and i’m trying to use this AttentionDecoder and I get:
“Input 0 is incompatible with layer AttentionDecoder”

My network:
model = Sequential()

model.add(LSTM(500, input_shape=(None, 145), init=”he_normal”, return_sequences=True))
model.add(Dropout(0.2))

model.add(LSTM(500, input_shape=(None, 145), init=”he_normal”, return_sequences=False))
model.add(Dropout(0.2))

model.add(AttentionDecoder(500, 145, ‘softmax’))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

Any idea?

Reply
- Jason Brownlee January 2, 2019 at 6:42 am #
  
  The layer may no longer work with the latest version of Keras.
  
  Reply
  - Zalman January 2, 2019 at 6:43 am #
    
    I’m working with keras 2.1.2, is it compatible?
    
    Reply
    - Jason Brownlee January 2, 2019 at 6:44 am #
      
      It should be. Perhaps try another version in the 2.xx range.
      
      Reply
Zalman January 2, 2019 at 6:56 am #

So any idea what the “Input 0 is incompatible with layer AttentionDecoder” could be?

Reply
- Jason Brownlee January 2, 2019 at 7:48 am #
  
  Not off hand, perhaps try a different Keras?
  
  Reply
  - Zalman January 2, 2019 at 8:15 am #
    
    What is the version you worked on? I’ll try the same
    
    Thanks!
    
    Reply
    - Jason Brownlee January 2, 2019 at 12:00 pm #
      
      2.0.8 or around there.
      
      Reply
Kushal Davendra January 5, 2019 at 9:24 pm #

Hi Jason,

I am trying to use your attention network to learn seq2seq machine translation with attention. My source lang output vocab is of size 32,000 and target vocab size 34,000. The following step blows up the RAM usage while making the model (understandably, as its trying to manage a 34K x 34K float matrix): It fails as it goes above the 2G protobuf limit.

self.W_o = self.add_weight(shape=(self.output_dim, self.output_dim),
name=’W_o’,
initializer=self.recurrent_initializer,
regularizer=self.recurrent_regularizer,
constraint=self.recurrent_constraint)

Here is my model:
n_units:128, src_vocab_size:32000,tar_vocab_size:34000,src_max_length:11, tar_max_length:11

def define_model(n_units, src_vocab_size, tar_vocab_size, src_max_length, tar_max_length):
model = Sequential()
model.add(Embedding(src_vocab_size, n_units, input_length=src_max_length, mask_zero=True))
model.add(LSTM(n_units, return_sequences=True))
model.add(AttentionDecoder(n_units, tar_vocab_size))
return model

Is there any solution to the add_weight step to which adds variable with Output_dim * output_dim to the network?

Reply
- Jason Brownlee January 6, 2019 at 10:17 am #
  
  Perhaps use a smaller sample of data or try progressive loading?
  
  Reply
Jenna Ma January 8, 2019 at 6:49 pm #

GREAT tutorial!
It is brilliant to add 0 to make sure the n_timestep equal for input and output. It helps me a lot! Thank you!
Since Keras eliminate _time_distributed_dense, the author who developed AttentionDecoder has updated his code with a tdd.py. You might be interested in updating this post for the successful use of this tutorial under higher Keras. 🙂

Reply
- Jason Brownlee January 9, 2019 at 8:42 am #
  
  Thanks for the tip.
  
  I note the built-in implementation of attention in Keras is nearly ready for release. Perhaps in the next version of Keras!
  
  Reply
  - NISHANK GARG February 10, 2019 at 9:28 pm #
    
    Thanks for the amazing post.
    
    Please update this post for keras with attention. I need it urgently.
    
    Reply
    - Jason Brownlee February 11, 2019 at 7:58 am #
      
      Thanks. I am waiting for Keras to officially support attention.
      
      Reply
Kartik Sharma January 12, 2019 at 7:41 am #

__init__() takes 2 positional arguments but 3 were given

plz help,
thanks

Reply
- Jason Brownlee January 13, 2019 at 5:37 am #
  
  The most recent Keras API may not support this attention layer.
  
  Reply
Victor Calle January 15, 2019 at 7:06 am #

Hi Jason! Could you explain how to get the output of the encoder part of the encoder-decoder model with attention?

Reply
Navneet Singh January 22, 2019 at 5:31 am #

Hi Jason,
I am getting an error on line 158 in ‘attention_decoder.py‘ file

Line:
self.b_p = self.add_weight(shape=(self.units, ),
name=’b_p’,
initializer=self.bias_initializer,
regularizer=self.bias_regularizer,
constraint=self.bias_constraint)

Error is as below :
Dimensions must be equal, but are 150 and 50 for ‘AttentionDecoder/MatMul_4’ (op: ‘MatMul’) with input shapes: [?,150], [50,150].

Can you please help me resolve this error it would be of great help.
Thanks in advance.

Reply
nandini January 25, 2019 at 6:26 pm #

I have a requirement for chat bot applications using rnn , (i.e.) after tranining on some huge amount data , for chat bot applcations we need to remembered the previous conservations atleast 3 or sentences before the current consevations .

is it possible ,if yes please suggest on this requirement how to go further to attain this requirement.

kindly provide any links or articles are there related this requirement .

thanks in advance

Reply
- Jason Brownlee January 26, 2019 at 6:11 am #
  
  Sorry, I don’t have any tutorials on chat bots. I cannot give you good advice.
  
  Reply
Maddy February 19, 2019 at 7:27 am #

Hi Jason, thanks a lot for the post! it is amazing.

When I try to use the same attention model to predict a univariate multi-step time series, for exmaple, use [1, 2, 4, 2, 3] to predict [2, 4, 2, 3, 6], the predicted output are all 1s [1,1,1,1,1]. Do you know how should I fix the model? (Because it is a time-series problem, I didn’t do one-hot encoding as listed in your post during data preparation. ) Thanks!

model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(AttentionDecoder(150, n_features))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘acc’])

Reply
- Jason Brownlee February 19, 2019 at 7:29 am #
  
  Perhaps the attention layer is not supported in the latest version of Keras?
  
  Reply
Max Power March 2, 2019 at 12:57 pm #

Hi Jason,

you page is truly great, thanks for all the work you put into this. Especially this article is very helpful.
Still im struggle with something: How can I change the TimeDistributionLayer to consider the seqeunce within the output arbitrary? I would like to use the jaccard-distance and therefore dont care about the order of output_elements as long as all are in.

I would like to do something like the following but cannot get it right:
model.add(Dense(2D(n_timeSteps_out ,n_Features), activation=’relu, axis=0′))
model.add(Dense(n_Features , activation=’softmax’))

Thanks for all your work here!

Best,
Max

Reply
- Jason Brownlee March 3, 2019 at 7:57 am #
  
  When using this implementation of attention, the number of inputs and outputs must match.
  
  Perhaps try using an encoder-decoder directly, then change the value in the RepeatVector to change the number of output steps?
  
  Reply
zied March 21, 2019 at 3:19 am #

we i try to run code i found this error :
TypeError: The added layer must be an instance of class Layer. Found:
and i don’t found solution.
thank u for your help

Reply
- Jason Brownlee March 21, 2019 at 8:20 am #
  
  Sorry to hear that, I have not seen this error.
  
  Perhaps confirm that your Keras library is up to date and that you copied all of the code exactly?
  
  Reply
  - zied March 21, 2019 at 7:29 pm #
    
    thank u very much for your response , i have the old version of keras 2.0.8 the new one generate an error for i cannot import _time_distributed_dense .
    i added the attention layer to your code of machine translation
    
    def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
    
    model = Sequential()
    model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(AttentionDecoder(n_units, tar_vocab))
    return model
    
    Reply
    - saria March 29, 2019 at 11:25 am #
      
      This is because the new version of the Keras no longer support _time_distributed_dense.
      
      You can fix it with this StackOverflow
      
      https://stackoverflow.com/questions/45631235/importerror-cannot-import-name-timedistributeddense-in-keras
      
      Reply
      - Olatunji Omisore November 25, 2020 at 3:54 pm #
        
        Hi,
        
        Please how did you fix this issue. I have the same problem and I am unable to move on with a project for a while.
        
        Thanks
saria March 29, 2019 at 11:23 am #

Thank you, Jason, for the great post.
Can you please explain based on what logic we may choose the n_timesteps_out?
Please give a real-world example where this number can come from.

Thanks!

Reply
- Jason Brownlee March 29, 2019 at 2:02 pm #
  
  Yes, I have tens of examples, you can get started here:
  https://machinelearningmastery.com/start-here/#deep_learning_time_series
  
  Reply
saria March 29, 2019 at 11:27 am #

I mean a different number for n_timestamp_ou, like here you chose 2, in which situation we may choose a different number?

Reply
- Jason Brownlee March 29, 2019 at 2:03 pm #
  
  You can perform a sensitivity analysis to discover what works best for your specific model and dataset.
  
  Reply
saria March 29, 2019 at 11:43 am #

That would be great if you can give me another real-world application other than for translating, and text generation. As I understand in translate we will make seq_out based on the corresponding sentence in the first language for each sentence in another language.
In text generation, I think we should give the same seq_in to seq_out.

But as you chose timestamp_out=2. I would like to know why did you use this? and particularly real-world cases we may choose either timestamp_out=1, timestamp_out=2=2, timestamp_out=3 whatever

Reply
saria April 11, 2019 at 2:11 pm #

Hi Jason,
I have a question here. my model looks like this without employing attention layer:

inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name=”input”)
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode=”sum”, name=”encoder_lstm”)(inputs)
decoded = RepeatVector(SEQUENCE_LEN, name=”repeater”)(encoded)
decoded = Bidirectional(LSTM(EMBED_SIZE, return_sequences=True), merge_mode=”sum”, name=”decoder_lstm”)(decoded)
autoencoder = Model(inputs, decoded)

Does it make sense to change it to the code below to embed the attention layer in my model?

inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name=”input”)
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode=”sum”, name=”encoder_lstm”)(inputs)
attention = AttentionDecoder(LATENT_SIZE, n_features)
autoencoder = Model(inputs, attention)

It is complaining that one argument missing!

Thanks for your help

Reply
- Jason Brownlee April 11, 2019 at 2:22 pm #
  
  Sorry, I cannot debug your code, this may help:
  https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code
  
  Reply
ben kubi April 17, 2019 at 1:41 am #

hello
I tried running this code but it I always get into this
ImportError: cannot import name ‘_time_distributed_dense’

Reply
- Jason Brownlee April 17, 2019 at 7:03 am #
  
  I believe it is no longer supported with the latest version of Keras.
  
  Reply
NookLook April 17, 2019 at 10:33 pm #

Hi Jason,
When dealing with Inputs of variable timesteps, i can modified input like:
input = Input(shape=(None, n_features)), then followed
encoded = LSTM(….)(input)

but how should i do with the repeating in the next line ?
decode = RepeatVector(???)(encoded)

I tried setting None and shape[1], but didn’t work

Reply
- Jason Brownlee April 18, 2019 at 8:45 am #
  
  Perhaps try this tutorial:
  https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
  
  Reply
Antonio May 1, 2019 at 12:44 am #

Dear Dr. Jason,

Do you think that the LSTM encoder-decoder with attention might have potential for a problem of timeseries forescasting (aircraft fuel consumption forecasting) using with multivariate input (8 variables of sensor data) and univariate multistep output (future fuel consumption for the next X time steps)?

Best

Reply
- Jason Brownlee May 1, 2019 at 7:07 am #
  
  Perhaps start with a vanilla LSTM, or even a linear method and go from there.
  
  Reply
Alexey May 2, 2019 at 2:05 am #

Is using Attention (paper 2014) in Autoencoder a cheat? Because the decoder will know about all encoder states and can decode with 100% accuracy and bottleneck will be useless. Am I wrong?

Reply
- Jason Brownlee May 2, 2019 at 8:06 am #
  
  I don’t think so.
  
  Reply
Asha May 10, 2019 at 12:14 am #

Hi Jason,
Thanks for a wonderful tutorial ! Am trying to use this to see if I can solve the problem of text correction. Where the input is an incorrect sentence and output the correct one.
In a typical encoder decoder architecture, I understand the encoders cell states have to be passed to the decoder. I wasnt sure if this was happening in this one.
Can you please confirm.

Reply
- Jason Brownlee May 10, 2019 at 8:18 am #
  
  Sounds like a fun problem.
  
  Perhaps compare your approach to what others have described in the literature?
  
  Reply
Adam Oudad May 11, 2019 at 2:17 am #

Hi, thanks for this tutorial,

Why is the Encoder-Decoder without attention unable to correctly predict the second integer in the sequence ? Is it a problem of vanishing gradient ? I would not have thought vanilla LSTM to perform so badly (20% accuracy…) in this simple autoencoder application.

Thanks for any suggestion.

Reply
- Jason Brownlee May 11, 2019 at 6:18 am #
  
  The problem as designed to be hard for an encoder-decoder and easy for the same model with attention.
  
  It could be any one of many reasons, e.g. insufficient capacity.
  
  Reply
Mayra May 11, 2019 at 3:28 am #

Hi Jason,

Thank you very much for the blog. It has been of great assistance. Could you give me your opinion on the following matter please? Is it possible to develop an LSTM Autoenconder Model with Attention for the reconstruction of the input in Keras? Any hints about how I should adapt the method demonstrated in the example?.

Thanks in advance,

Reply
- Jason Brownlee May 11, 2019 at 6:19 am #
  
  Yes, see this post:
  https://machinelearningmastery.com/lstm-autoencoders/
  
  Reply
Alexander May 14, 2019 at 10:29 pm #

Hi Jason,
thank you very much for your great tutorial.
I want to implement an Encoder-Decoder model with attention and teacher forcing. Francois Chollet implemented a seq2seq model with 1 encoder input, 1 decoder input and 1 decoder target sequence, where the two sequences for the decoder differed by one time-step (teacher forcing).
As I understand, GRU uses teacher forcing by default (Bahdanau et al (2015), p.13, A.2.2).
I am puzzled about whether the custom layer uses the true y values to predict y conditional on x.
In your model with attention, the AttentionDecoder receives for each unit in the LSTM layer for each time step an encoded value, so 150 x 5, since return_sequence=True is hard coded:
Line 47: model.add(AttentionDecoder(150, n_features))
In the custom layer code by Zafarali Ahmed, I suppose these encoded sequences are saved to the cell in the call(self, x) definition in Line 200 with
self.x_seq = x
Correct?
In the step function definition I found the first hint to y.
Line 227 ytm, stm = states
I noticed how the y values are imported, but these are constructed within the very same Recurrent cell (Line 67 self.states = [None, None] # y, s )
So, I can’t find that the ground truth values are imported at any point. Only the forecasted values of the cell itself are used for the step function (Line 278). Is that correct?
My approach would be, to replace x in the call function by a list of the encoded sequence and the ground truth (but offset by one time step). What do you think?

Reply
- Jason Brownlee May 15, 2019 at 8:16 am #
  
  You must implement teacher forcing, it does not come for free.
  
  You can learn more about teacher forcing here:
  https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/
  
  I often recommend using an autoencoder based encoder-decoder for LSTMs, you can use the dynamic RNN approach here:
  https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/
  
  Reply
Bernardo May 28, 2019 at 8:08 pm #

Hi Jason, great tutorial!

I want to ask you an advice. I’m studying for my thesis in Machine Learning.
Basically i have words, each character of the word is represented with an integer.
For instance, i have the word: ‘develop’, so this is represented with the sequence: [4 5 22 5 12 15 16].
I’m training a recurrent neural network that takes in input the sequence ‘develo’ and want to predict the following character ‘p’. I have tried to use your attention layer, so as X I give the subsequence [4 5 22 5 12 15] and as y I give [16 0 0 0 0]. In this case the accuracy is very low, 25%, it depends by how much the dataset is big; but I have never obtained high result. Maybe, i’m not using the attention layer properly.
So, i’m training the RNN that take as X the sequence [4 5 22 5 12 15 16] and as y the sequence [0 0 0 0 0 0 16]. Now, the accuracy is very high, but i think it’s because i’m doing overfitting.
Do you think, attention layer could be used properly in my case? How?
Thank you!

Reply
- Jason Brownlee May 29, 2019 at 8:41 am #
  
  No attention is needed, I think this post will help:
  https://machinelearningmastery.com/develop-character-based-neural-language-model-keras/
  
  Reply
Jack June 4, 2019 at 1:19 pm #

Hi Jason,
Thank you for bringing such a concise tutorial，I have some questions for you to ask, I want to know if AttentionDecoder can be used in the cnn-lstm encoder-decoder model,Is your examples in this tutorial (https://machinelearningmastery.com/how-to-develop-lstm-models-for-multi-step-time-series-forecasting-of-household-power-consumption/), I want to know how to use attention improve CNN – LSTM model, please give me some detailed instructions, thank you

Reply
- Jason Brownlee June 4, 2019 at 2:26 pm #
  
  Perhaps. Sorry, I cannot prepare an example for you.
  
  Reply
Alex June 21, 2019 at 7:59 am #

Hi Jason! Great tutorial, thanks!

As Zh LM said
the properly chosen batch size or choice of parameters generally leads to increased accuracy. So in this simple case it is rather faster convergence than end accuracy.

Encoder-Decoder Model
Train on 4500 samples, validate on 500 samples
…….
Epoch 150/150
4500/4500 [==============================] – 3s 563us/step – loss: 9.9794e-05 – acc: 1.0000 – val_loss: 0.0067 – val_acc: 0.9976
100.0
Mean Accuracy: 100.00%

Encoder-Decoder With Attention Model
Train on 4500 samples, validate on 500 samples
….
Epoch 150/150
4500/4500 [==============================] – 3s 742us/step – loss: 1.8149e-05 – acc: 1.0000 – val_loss: 0.0021 – val_acc: 0.9992
100.0
Mean Accuracy: 100.00%

Reply
- Jason Brownlee June 21, 2019 at 2:01 pm #
  
  Thanks.
  
  Reply
Saichand July 17, 2019 at 4:14 pm #

Hi Jason,

It was a great tutorial. I have seen that keras.layers.recurrent no longer works. What is the new solution? I am getting this error —– TypeError: __init__() missing 1 required positional argument: ‘cell’
when I am Using AttentionDecoder(256, 300) after an lstm layer.

Reply
- Jason Brownlee July 18, 2019 at 8:20 am #
  
  You may want to use this code with an older version of Keras.
  
  Reply
  - Saichand July 19, 2019 at 8:33 pm #
    
    Can we use an attention layer to identify duplicate/ non-duplicate sentences? if yes, how?
    I am currently using lstm layer for each sentence and then concatenating them and then passing the concatenated layer through dense layers to give prediction. I now want to use attention layer to improve my predictions. Where should I use attention layer and how.
    Please help me this.
    
    Reply
    - Jason Brownlee July 20, 2019 at 10:52 am #
      
      You could use a normal python program with an if statement to detect duplicate sentences.
      
      Reply
  - jorge September 20, 2021 at 11:52 pm #
    
    Sorry, you can not use RNN instead of recurrent
    
    Reply
joyce July 22, 2019 at 12:01 pm #

hello Jason.
On the time-series prediction problem,I want to use the multi-attention encoder-decoder model.
But as you implement above,can I use attention layer before my encoder model?
Because first I want to check my character whici is more important,and then I will use attention layer after my encoder model and befor my decoder model.

So,here is my question.I dont know which is right.Please help me.Thank.

Reply
- Jason Brownlee July 22, 2019 at 2:07 pm #
  
  Perhaps start with a simpler model described here:
  https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
  
  Reply
Pranjal July 26, 2019 at 4:13 pm #

Jason, this tutorial, although very helpful is now very old(2017). Could you please, if you get the time, make an updated tutorial using the tensorflow.keras layers where you use Tensorflow’s implementation of attention? Because I am unable to find any tutorial anywhere for beginners. Also, since my application is for production purpose, using outdated packages which may have vulnerabilities won’t really help. Thank you.

Reply
- Jason Brownlee July 27, 2019 at 6:06 am #
  
  I hope to write new tutorials when Keras official supports attention:
  https://github.com/keras-team/keras/pull/11421
  
  Reply
joyce August 5, 2019 at 11:43 pm #

https://github.com/andhus/keras/pull/6/files#diff-b4e22ccac72c2e1c47c8ea1ad67cf592
This is lastest.

Reply
- Jason Brownlee August 6, 2019 at 6:39 am #
  
  I’m following along here:
  https://github.com/keras-team/keras/pull/11421
  
  Reply
Kevin August 24, 2019 at 5:40 am #

Excellent article, thank you!

Reply
- Jason Brownlee August 24, 2019 at 8:02 am #
  
  Thanks Kevin.
  
  Reply
Anurag September 29, 2019 at 12:25 am #

Hi Jason Brownlee your article is very useful but I am getting this error while executing the same code (while I am using the keras 2.0.8)

model.add(AttentionDecoder(150, n_features))
Traceback (most recent call last):

File “”, line 1, in
model.add(AttentionDecoder(150, n_features))

TypeError: __init__() missing 1 required positional argument: ‘cell’

Reply
- Jason Brownlee September 29, 2019 at 6:13 am #
  
  Sorry, I don’t know the cause of the fault.
  
  Perhaps try posting to stackoverflow?
  
  Reply
Edgar November 8, 2019 at 10:20 am #

Hi Jason, great post

I have a doubt about attention and I haven’t found the answer anywhere. Maybe I am not understanding it correctly.

I have read many posts using LSTM/GRU with attention, and all of them considers the input as a sequence x1,x2,x3, etc. In that, x2 comes after x1, x3 after x2 and so on.

My doubt is, how attention would work if the input is not a sequence, but a set of values, and where their order does not matter?

For example, the set {1,5,10,15} represent the same thing as {10,1,15,5}, obviously for both cases the output (y) is the same.

For the first set, suppose the attention says the most important element is 5 in the second position, will this result be the same for the second set? (the most important element is 5 in the last position)

Could attention handle this?

Thank you for your time.

Reply
- Jason Brownlee November 8, 2019 at 1:49 pm #
  
  Attention assumes input is a sequence. It does not really make sense otherwise.
  
  If order is not important, you get a attention like behavior as part the weighted inputs of a normal neural net.
  
  Reply
jorge November 21, 2019 at 2:21 am #

Hi Jason

Thanks for another good work, is it possible to use attention on another time series dataset like air pollution, for example predicting PM2.5

Reply
- Jason Brownlee November 21, 2019 at 6:09 am #
  
  I don’t see why not.
  
  Reply
Xu Zhang December 3, 2019 at 11:39 am #

Thank you so much for your great post.

It will be a great help if you will post a tutorial about self-attention not only for sequencial data, but also for image classification and other applications. Many thanks

Reply
- Jason Brownlee December 3, 2019 at 1:34 pm #
  
  Thanks for the suggestion!
  
  Reply
Andrew December 3, 2019 at 1:26 pm #

HI Jason, I’m new comer in keras, I found on the internet that in the custom layer(AttentionDecoder), call() should include all the calculation of tensor, but in your example, step take over the function of call, could please give me some explanations, appreciate it very much

Reply
- Jason Brownlee December 3, 2019 at 1:37 pm #
  
  Yes, I think this code example is a little outdated now.
  
  Reply
  - Andrew December 3, 2019 at 2:14 pm #
    
    thx a lot, having been working for 2 years, your blogs inspire me all the time.
    
    Reply
    - Jason Brownlee December 4, 2019 at 5:28 am #
      
      Thanks.
      
      Reply
juntay December 13, 2019 at 9:24 pm #

Hello, I am a student and am happy to see your post. My problem scenario is to predict the current photovoltaic power generation value (float type) based on a series of weather categories (one-hot coded). Can your model be modified to suit my problem? If so, how can it be changed?

Reply
- Jason Brownlee December 14, 2019 at 6:19 am #
  
  Perhaps start with the models here:
  https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
  
  Reply
  - juntay December 16, 2019 at 1:30 pm #
    
    thanks,Jason. ★
    
    Reply
    - Jason Brownlee December 16, 2019 at 1:39 pm #
      
      You’re welcome.
      
      Reply
Divesh December 13, 2019 at 10:12 pm #

Hi jason,
Excellent article can you make an article on the copy mechanism(copy net) used in some of the recent encoder decoder architectures

Reply
- Jason Brownlee December 14, 2019 at 6:19 am #
  
  Thanks for the suggestion.
  
  Reply
mohammadreza December 15, 2019 at 9:16 am #

Hi, Jason,
I want to add decoding attention to my code, but I don’t know-how.could you help me?
my code is here:
from __future__ import print_function

from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

batch_size = 64 # Batch size for training.
epochs = 100 # Number of epochs to train for.
latent_dim = 256 # Latent dimensionality of the encoding space.
num_samples = 10000 # Number of samples to train on.
# Path to the data txt file on disk.
data_path = ‘fra-eng/fra.txt’

# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(data_path, ‘r’, encoding=’utf-8′) as f:
lines = f.read().split(‘\n’)
for line in lines[: min(num_samples, len(lines) – 1)]:
input_text, target_text = line.split(‘\t’)
# We use “tab” as the “start sequence” character
# for the targets, and “\n” as “end sequence” character.
target_text = ‘\t’ + target_text + ‘\n’
input_texts.append(input_text)
target_texts.append(target_text)
for char in input_text:
if char not in input_characters:
input_characters.add(char)
for char in target_text:
if char not in target_characters:
target_characters.add(char)

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print(‘Number of samples:’, len(input_texts))
print(‘Number of unique input tokens:’, num_encoder_tokens)
print(‘Number of unique output tokens:’, num_decoder_tokens)
print(‘Max sequence length for inputs:’, max_encoder_seq_length)
print(‘Max sequence length for outputs:’, max_decoder_seq_length)

input_token_index = dict(
[(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
[(char, i) for i, char in enumerate(target_characters)])

encoder_input_data = np.zeros(
(len(input_texts), max_encoder_seq_length, num_encoder_tokens),
dtype=’float32′)
decoder_input_data = np.zeros(
(len(input_texts), max_decoder_seq_length, num_decoder_tokens),
dtype=’float32′)
decoder_target_data = np.zeros(
(len(input_texts), max_decoder_seq_length, num_decoder_tokens),
dtype=’float32′)

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
for t, char in enumerate(input_text):
encoder_input_data[i, t, input_token_index[char]] = 1.
encoder_input_data[i, t + 1:, input_token_index[‘ ‘]] = 1.
for t, char in enumerate(target_text):
# decoder_target_data is ahead of decoder_input_data by one timestep
decoder_input_data[i, t, target_token_index[char]] = 1.
if t > 0:
# decoder_target_data will be ahead by one timestep
# and will not include the start character.
decoder_target_data[i, t – 1, target_token_index[char]] = 1.
decoder_input_data[i, t + 1:, target_token_index[‘ ‘]] = 1.
decoder_target_data[i, t:, target_token_index[‘ ‘]] = 1.
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard encoder_outputs and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using encoder_states as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don’t use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation=’softmax’)
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# encoder_input_data & decoder_input_data into decoder_target_data
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Run training
model.compile(optimizer=’rmsprop’, loss=’categorical_crossentropy’,
metrics=[‘accuracy’])
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
# Save model
model.save(‘s2s.h5’)

# Next: inference mode (sampling).
# Here’s the drill:
# 1) encode input and retrieve initial decoder state
# 2) run one step of decoder with this initial state
# and a “start of sequence” token as target.
# Output will be the next target token
# 3) Repeat with the current target token and current states

# Define sampling models
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
(i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
(i, char) for char, i in target_token_index.items())

def decode_sequence(input_seq):
# Encode the input as state vectors.
states_value = encoder_model.predict(input_seq)

# Generate empty target sequence of length 1.
target_seq = np.zeros((1, 1, num_decoder_tokens))
# Populate the first character of target sequence with the start character.
target_seq[0, 0, target_token_index[‘\t’]] = 1.

# Sampling loop for a batch of sequences
# (to simplify, here we assume a batch of size 1).
stop_condition = False
decoded_sentence = ”
while not stop_condition:
output_tokens, h, c = decoder_model.predict(
[target_seq] + states_value)

# Sample a token
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_char = reverse_target_char_index[sampled_token_index]
decoded_sentence += sampled_char

# Exit condition: either hit max length
# or find stop character.
if (sampled_char == ‘\n’ or
len(decoded_sentence) > max_decoder_seq_length):
stop_condition = True

# Update the target sequence (of length 1).
target_seq = np.zeros((1, 1, num_decoder_tokens))
target_seq[0, 0, sampled_token_index] = 1.

# Update states
states_value = [h, c]

return decoded_sentence

for seq_index in range(100):
# Take one sequence (part of the training set)
# for trying out decoding.
input_seq = encoder_input_data[seq_index: seq_index + 1]
decoded_sentence = decode_sequence(input_seq)
print(‘-‘)
print(‘Input sentence:’, input_texts[seq_index])
print(‘Decoded sentence:’, decoded_sentence)

Thank you

Reply
- Jason Brownlee December 16, 2019 at 6:05 am #
  
  This is a common question that I answer here:
  https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code
  
  Reply
Mina December 17, 2019 at 2:32 am #

Hi every body, I am new in machine learning , I got confused about using epoch.What is the difference between these 2 methods?
for epoch in range(5000):
# generate new random sequence
X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
# fit model for one epoch on this sequence
model.fit(X, y, epochs=1, verbose=2)
or
model.fit(X,y,epochs=5000,verboes=2)

Reply
- Jason Brownlee December 17, 2019 at 6:38 am #
  
  One pass on the training data vs 5K passes on the training data.
  
  More on what an epoch is here:
  https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
  
  Reply
Jenna January 8, 2020 at 8:32 pm #

Hi Jason. Sorry to bother you again.
I tested this attention code with my seq2seq prediction project whose input and output are float data, and it performed badly. I thought maybe this attention code is specifically designed for one-hot encoding data？
Do you think Attention can theoretically improve the performance of LSTM Encoder-Decoder model for multi-step time series forecasting? Such as the LSTM Encoder-Decoder model shown in this blog: https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/

Thank you!

Reply
- Jason Brownlee January 9, 2020 at 7:25 am #
  
  I have not tried, sorry.
  
  Reply
  - Jenna January 10, 2020 at 6:17 pm #
    
    Dear Dr. Jason,
    Thank you for replying.
    I have read some other articles in terms of Attention for time series forecasting, and I think the success of attention for NLP inspires researchers to try it for time series forecasting problems. Looking forward to seeing your blog related to this field.
    Thank you for your great posts!
    
    Reply
    - Jason Brownlee January 11, 2020 at 7:21 am #
      
      Thanks.
      
      Reply
Jay January 11, 2020 at 9:53 am #

Hi Jason,

I get an error like: The graph couldn’t be sorted in topological order, in the first two iterations but the training keeps going. I think this might corrupt the whole thing. do you know what is happening ?

Reply
- Jason Brownlee January 12, 2020 at 7:58 am #
  
  Sorry, I don’t have a good idea I believe the attention layer is no longer appropriate for the latest versions of the code libraries.
  
  Reply
Erin January 22, 2020 at 10:10 pm #

Hey Jason,

thanks for this awesome tutorial. 🙂
However I have a question: am I correctly assuming that we do not need to pass an activation function into the model with attention as we already have “tanh” included in the AttentionDecoder?

Thanks and have a lovely day!
Erin

Reply
- Jason Brownlee January 23, 2020 at 6:33 am #
  
  I believe attention is using a model with activation internally.
  
  Reply
Alireza March 31, 2020 at 6:00 am #

HI Jason,

First of all thank u for ur comprehensive useful tutorials. plz continue this route.

And I have a question: Can we use the attention layer for regression problems? (time series prediction)

If yes, how? and it should be in the format of encoder-decoder only?
Especially in the case we have now tf.keras.attention. can we use this built-in layer also?

thank you and stay safe

Reply
- Jason Brownlee March 31, 2020 at 8:19 am #
  
  You’re welcome.
  
  Sure, try it and see. I don’t have examples at this stage.
  
  Reply
najeh April 6, 2020 at 11:14 am #

Hi Jason,
Thanks for the wonderful tutorial! I would like to know the difference between attention mechanism in LSTM and attention mechanism with seq2seq model?
Thank you.

Reply
- Jason Brownlee April 6, 2020 at 1:32 pm #
  
  The same attention methods can be used in different model architectures.
  
  Reply
Natalko April 17, 2020 at 5:38 am #

I have some questions. I tried to run your code for time series forecasting.
I am trying to predict words popularity/occurrences based on its previous occurrences (just time series forecasting).
Model should be kind of universal. One model for all words. But even though i tried to run this for every word separately model still returns ‘1’ for every prediction.

I have removed trend and applied supervised learning (x-1 , x ).
I have also split time series into samples, each of sample has length of 5.

This is my code for model. Maybe i should add some other parameters?

Thanks in advance!

i = Input(shape=(samples_train.shape[1],samples_train.shape[2]), dtype=’float32′)
enc = Bidirectional(GRU(150, return_sequences=True), merge = ‘concat’)(i)
dec = AttentionDecoder(150,samples_train.shape[2])(enc)
model = Model( inputs = i , outputs = dec )
model.compile(loss=’mse’, optimizer=’adam’)

Reply
- Jason Brownlee April 17, 2020 at 6:28 am #
  
  Perhaps try alternate problem framings, data preparations, models, model configurations and training configurations?
  
  Reply
gloria April 21, 2020 at 8:49 pm #

Hi Jason,
Thanks for the tutorial! I would like to know is that use for Tensorflow 2.0 or Tensorflow 1.0
Thank you.

Reply
- Jason Brownlee April 22, 2020 at 5:54 am #
  
  All examples work with TF 2.
  
  Reply
Vinay Kumar May 3, 2020 at 4:33 pm #

Hi Jason,

Great piece, thanks for that.

I am very much new to RNN. I am build a binary classifier model, I was wondering what changes should be made to achieve that result. I believe that a one hot encoding won’t be required in my case. Essentially instead of a random sequence generator, I would possibly need to push in my own data. I started building a RNN classifier by going through https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
and updating the loss function to a Binary Cross-Entropy refering to https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/

The normal RNN model works and I was trying to add the attention layer to but facing some dimension mismatch errors.
ValueError: (‘Error when checking target: expected AttentionDecoder to have 3 dimensions, but got array with shape (70, 1)’, ‘occurred at index 0’)
I have a feeling that it has something to do with the encoding, but not sure
Could you help me figure it out?

Thanks

Reply
- Jason Brownlee May 3, 2020 at 5:11 pm #
  
  Happy it helped.
  
  You can see examples of LSTMs for time series classification here that you can adapt for your needs:
  https://machinelearningmastery.com/start-here/#deep_learning_time_series
  
  Reply
AR May 5, 2020 at 4:59 am #

Hi Jason

Now attention is available in tensorflow 2.1 . Could you prepare a tutorial on how to use it?

Reply
- Jason Brownlee May 5, 2020 at 6:35 am #
  
  Great suggestion!
  
  Reply
Khushbu May 7, 2020 at 10:54 pm #

Hello Jason,

Thank you for the great tutorial.
I want to implement Attention-based encoder-decoder model using tf.keras. Tensorflow has AdditiveAttention layer (https://www.tensorflow.org/api_docs/python/tf/keras/layers/AdditiveAttention?version=nightly).

I have implemented encoder and decoder by the tutorial(https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html).

Now, I want to add AdditiveAttetion between encoder states and decoder states.
According to the document I have to pass decoder states as query and encoder states as value to AdditiveAttention()([decoder_states, encoder_States]) and it will return context vector according to weights distributed over encoder_states.

So how to connect these 3 steps together?

Could you please provide a small tutorial based on AdditiveAttention?

Thank you.

Reply
- Jason Brownlee May 8, 2020 at 6:35 am #
  
  You’re welcome.
  
  Nice work!
  
  I will write a tutorial on this topic and figure it out.
  
  Reply
Alejandro Oñate May 14, 2020 at 4:58 am #

It looks great and easy to use.

But, this example illustrates an autoencoder-style model. How would it be applied to a classical sequence-to-sequence decoder-encoder model?

I use a similar model is this (but with multiple layers):
https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/

Thank you!

Reply
- Jason Brownlee May 14, 2020 at 5:58 am #
  
  I hope to cover it in the future.
  
  Reply
ARJUN June 3, 2020 at 9:23 pm #

i’am getting this error while importing attention decoder:

from keras.layers.recurrent import Recurrent, _time_distributed_dense
ImportError: cannot import name ‘_time_distributed_dense’ from ‘keras.layers.recurrent’ (

please help!

Reply
- Jason Brownlee June 4, 2020 at 6:20 am #
  
  I believe an old version of tensorflow is required for this tutorial.
  
  Reply
Rakesh June 10, 2020 at 12:05 am #

Hello Jason ,

I have used changed your implementation a bit, In order to manage the different vocab size of input and output. Do you suggest the addition repeat vector layer to cater the problems with different input and output sizes.

inputs=Input(shape=(inputSequenceLength,)) embedding=Embedding(input_dim=inputVocabSize,output_dim=embedding_dim,embeddings_initializer=Constant(embeddings_initializer),input_length=input_length,trainable=trainable)(inputs)
x=Bidirectional(LSTM(units=128,return_sequences=True))(embedding)
x=LSTM(units=128)(x)
x=RepeatVector(outputSequenceLength)(x)
outputs=AttentionDecoder(128, outputVocabSize)(x)
model = Model(inputs=inputs, outputs=outputs)

Thanks Rakesh

Reply
- Jason Brownlee June 10, 2020 at 6:17 am #
  
  Nice work.
  
  I find the autoencoder approach to encoder-decoder to be easier to understand/implement and just as effective.
  
  Reply
Arindam Mondal June 12, 2020 at 4:00 pm #

Very nice description. I have a query-is your LSTM ebook covers encoder decoder with attention mechanism?

Reply
- Jason Brownlee June 13, 2020 at 5:47 am #
  
  No, attention is not covered in the book at this stage.
  
  Reply
nutan June 12, 2020 at 5:03 pm #

Hi Jason,

I am running this example in colab. So copied everything in same notebook.
I get an error at this line
model.add(AttentionDecoder(150, n_features))

—————————————————————————
OperatorNotAllowedInGraphError Traceback (most recent call last)
in ()
47 model = Sequential()
48 model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
—> 49 model.add(AttentionDecoder(150, n_features))
50
51 model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

9 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in _disallow_in_graph_mode(self, task)
535 raise errors.OperatorNotAllowedInGraphError(
536 “{} is not allowed in Graph execution. Use Eager execution or decorate”
–> 537 ” this function with @tf.function.”.format(task))
538
539 def _disallow_bool_casting(self):

OperatorNotAllowedInGraphError: using a tf.Tensor as a Python bool is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

There are no comparison operations. Wonder why

Let me know

Reply
- nutan June 12, 2020 at 5:53 pm #
  
  Hi Jason,
  
  Resolved —
  
  I copied this from one of the above links –
  
  def time_distributed_dense(x, w, b=None, dropout=None,
  input_dim=None, output_dim=None, timesteps=None):
  ”’Apply y.w + b for every temporal slice y of x.
  ”’
  if not input_dim:
  # won’t work with TensorFlow
  input_dim = K.shape(x)[2]
  if not timesteps:
  # won’t work with TensorFlow
  timesteps = K.shape(x)[1]
  if not output_dim:
  # won’t work with TensorFlow
  output_dim = K.shape(w)[1]
  
  if dropout:
  # apply the same dropout pattern at every timestep
  ones = K.ones_like(K.reshape(x[:, 0, :], (-1, input_dim)))
  dropout_matrix = K.dropout(ones, dropout)
  expanded_dropout_matrix = K.repeat(dropout_matrix, timesteps)
  x *= expanded_dropout_matrix
  
  # collapse time dimension and batch dimension together
  x = K.reshape(x, (-1, input_dim))
  
  x = K.dot(x, w)
  print(“in time_distributed_dense… 3”)
  print(“b shape “, b.shape)
  print(“Type of b “,type(b))
  print(“Type of x “,type(x))
  print(“x shape”,x.shape)
  
  #if b:
  x = x + b
  
  # reshape to 3D tensor
  print(“in time_distributed_dense… 4”)
  x = K.reshape(x, (-1, timesteps, output_dim))
  print(“in time_distributed_dense…last”)
  return x
  
  _————————————————————————–
  “If b” this comparison was throwing the above error. We can comment it to run.
  
  Thanks
  
  Reply
  - Jason Brownlee June 13, 2020 at 5:53 am #
    
    Well done.
    
    Reply
- Jason Brownlee June 13, 2020 at 5:50 am #
  
  I would not expect the code to work with modern versions of the libraries.
  
  Reply
Arindam Mondal June 12, 2020 at 11:00 pm #

Hi Jason,
Really very nice explanation. But ” _time_distributed_dense” is cannot imported even of tensorflow version 2.0.0. Can you please help?

Reply
- Jason Brownlee June 13, 2020 at 6:03 am #
  
  Yes, this tutorial is no longer current.
  
  Reply
David June 14, 2020 at 10:30 am #

For anybody having problems when using attention using tensorflow >= 2.2 check if this tutorial helps https://medium.com/@dmunozc/using-keras-attention-with-tensorflow-2-2-69da8f8ae7db

Reply
- Jason Brownlee June 15, 2020 at 5:59 am #
  
  Thanks for sharing!
  
  Reply
Alejandro Oñate July 2, 2020 at 8:04 pm #

Hello, I would like to understand how the model connects the LSTM encoder and decoder layers.

model = Sequential ()
model.add (……

Does this know how to make the connection of hidden states? The classic model is much more complex (https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/) and I would like to understand if this option connects automatically the same, or if it does another way (also valid)

Thank you!

Reply
- Jason Brownlee July 3, 2020 at 6:14 am #
  
  Perhaps start here:
  https://machinelearningmastery.com/start-here/#lstm
  
  Reply
Kingsley Udeh July 4, 2020 at 1:13 am #

Hi Jason,

Can I use the custom keras implemenation of attention layer in a time series problem where I’m expected to predict the next hour or current time step? Currently, the attention concept seems to fit well in seq2seq model, but I want the output sequence to be just one time step.

If the forgoing question is possible, can I have CNN models as encoder, followed by attention, recurrent, and dens models in my network architecture?

Thanks in advance.

Reply
- Jason Brownlee July 4, 2020 at 6:02 am #
  
  It might be better now to use the attention layers provided by tensorflow 2.
  
  Reply
Sk July 7, 2020 at 6:08 pm #

Hi,
Can we use TensorFlow Addons [ https://www.tensorflow.org/addons/api_docs/python/tfa/seq2seq/%5D instead of custom attention layer? If yes, which function should we use?

Reply
- Sk July 7, 2020 at 6:09 pm #
  
  https://www.tensorflow.org/addons/api_docs/python/tfa/seq2seq/
  
  Reply
- Jason Brownlee July 8, 2020 at 6:28 am #
  
  Sorry, I don’t know about tensorflow addons.
  
  Reply
Robert July 9, 2020 at 5:17 pm #

Hi, Jason, Everybody

I’m working on a small project which involves ecg signals.

Inputs(ecg signal and picture), analysation and learning, using attention mechanism and cnn from tensors and keras.

Previously I tried to use one of the physionets 2017 challanges miguel-lozano-220 -s code to have an insight of the functioning but had dimension problems when the learning started to finish and the validation process started (used physionets 2017 database).

Than I found this code which is really good, and was thinking to modify it, can this code be modified to handle the inputs, or is there as good guidance which could guide thru an example for this type of application.

Reply
- Jason Brownlee July 10, 2020 at 5:51 am #
  
  Perhaps you can try using the attention layers provided by Keras:
  https://keras.io/api/layers/attention_layers/
  
  Reply
Murat Karakaya July 20, 2020 at 4:32 am #

Can you update the post with Keras Attention layer please?

Reply
- Jason Brownlee July 20, 2020 at 6:18 am #
  
  Thanks for the suggestion. I hope to write new tutorials on the topic soon.
  
  Reply
Chung-Hao Ku July 24, 2020 at 6:33 pm #

Hello Jason, a question I would like to ask is, in this attention implementation framework, there is a method called ‘step,’ where it calculates the attention scores and context vector. However, I don’t see where it is used throughout the implementation. When I looked through the keras subclassing framework, I did not see this method either as a python built in function. Can you give me a clue where that method is used throughout the code? Many Thanks.

Reply
Sameer kumar August 20, 2020 at 7:29 pm #

can i make decoder with attention mechanics with bi-lstm

Reply
- Jason Brownlee August 21, 2020 at 6:26 am #
  
  I don’t see why not.
  
  Reply
Sameer kumar August 20, 2020 at 7:46 pm #

how to apply Bi-Lstm stead of Lstm in attention decoder?
I am working on image captioning project
Please help

Reply
- Jason Brownlee August 21, 2020 at 6:27 am #
  
  Perhaps this will help you to get started:
  https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/
  
  Reply
Prisha August 25, 2020 at 12:51 am #

Hi

I tried to use it but when I am loading the model I get a memory error. Could you please tell me why? And do you have any other example for encoder decoder with attention layer as the time distributed function is not found

Reply
- Jason Brownlee August 25, 2020 at 6:43 am #
  
  Sorry to hear that.
  
  I hope to write more about this topic very soon.
  
  Reply
A_P September 10, 2020 at 6:54 pm #

Hi Jason,
In your example the output (y) is part of the sequence (X): the first two digits of X.
How should it be handle y when is a binary classification problem and y is not contained in the sequence created in X?

Thank you!

Reply
- Jason Brownlee September 11, 2020 at 5:52 am #
  
  Good question, see this tutorial on time series classification:
  https://machinelearningmastery.com/how-to-develop-rnn-models-for-human-activity-recognition-time-series-classification/
  
  Reply
AS September 12, 2020 at 4:29 am #

I want to thank you so much for all your tutorials, you are an amazing teacher. I have learnt about LSTMs, attention, CNN from your posts. Thank you for giving us this excellent resource!

Reply
- Jason Brownlee September 12, 2020 at 6:20 am #
  
  Thanks!
  
  Reply
Kingsley Udeh September 25, 2020 at 9:25 pm #

Hi Dr. Jason,

Thanks for your great work in deep learning practice and research.

I just want to know if you had implemented the encoder-decoder architecture with Keras attention layer? Could you also consder adding self-attention layer in time series regression problem?

Again, thank you

Reply
- Jason Brownlee September 26, 2020 at 6:19 am #
  
  Sure.
  
  Reply
Bilal CHandio October 11, 2020 at 7:37 pm #

Thanks for such a great Attention mechanism implementation. Would you please explain that how to get validation accuracy on this model?. Will it work to call the validation while model.fit()? I hope this will surely work with my text classification problem.

Reply
Cheng October 14, 2020 at 11:31 pm #

Hi Jason,Your blog is very helpful to me, but in my research topic, the sequences are given with high-precision decimals, but in your blog, the sequences are integers and easy to use one_hot encoding. The model can transform the prediction problem into Classification problem, but how should my data be encoded?

x=[113.654,112.1120,110.2354,108.3314………99.1014]
y=[12.3251,13.5564,15.6312,16.3544,………20.3314]

The above is a set of data samples, how should I process it, then use the x sequence to predict the y sequence?

Looking forward to your reply

Best regard

Cheng

Reply
- Jason Brownlee October 15, 2020 at 6:09 am #
  
  Thanks.
  
  You can use any precision you like, start here:
  https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
  
  Reply
Shahad October 21, 2020 at 6:38 pm #

Hi Jason,

Thanks for your high class work, leaned a lot from your tutorials.

I am wondering if I have a stack encoder-decoder with more than one hidden layer, where could we add the attention layer.

For example, I had 3 hidden layers in both encoder and decoder, would it be after the 3rd layer in the decoder part or at before the first layer? Is there a standard way to deal with this.

Reply
- Jason Brownlee October 22, 2020 at 6:38 am #
  
  You’re welcome!
  
  Good question, attention is used at the top of the decoder.
  
  Reply
  - Shahad October 22, 2020 at 4:12 pm #
    
    Thank you very much!
    
    Just one final thought. If I want to use the auto-encoder for time series dimensionality reduction, does it make sense to use the attention layer to get richer/better latent space. Perhaps, self-attention would be more applicable in this case.
    
    I am really looking forwards to hearing your thought in this.
    
    Reply
    - Jason Brownlee October 23, 2020 at 6:03 am #
      
      Perhaps try with and with out, also perhaps start here:
      https://machinelearningmastery.com/lstm-autoencoders/
      
      Reply
Hoda October 30, 2020 at 1:49 am #

Hi Dr. Jason
Thank you very much for this great article.
Could you please teach us how to add a self-attention layer to the encoder-decoder model?

Reply
- Jason Brownlee October 30, 2020 at 6:55 am #
  
  Thanks for the suggestion. I hope to write about that topic soon.
  
  Reply
Alvin October 31, 2020 at 10:57 am #

Hi Jason,

Thank you very much for all these very nicely illustrated tutorials. I personally learned a lot from your explanations of lots of complex ideas.

I was wondering if you have any plans to update this tutorial to make it work with the recent tensorflow 2? And also I noticed that in TF2, there are implementations of Attention layers (e.g., MultiHeadAttention). It would be really great if you could provide a tutorial on how to use these existing package-internal Attention layers for the task. That would help a lot for non-professionals like me 🙂

Reply
- Jason Brownlee October 31, 2020 at 1:55 pm #
  
  You’re welcome!
  
  Yes, I hope to write an updated version of the tutorial soon.
  
  Reply
Olatunji Omisore November 25, 2020 at 4:04 pm #

Hi Jason,

Thank you very much for your tutorials. I am implementing a project for CNN-LSTM and while I got training accuracy over 95% usually, unfortunately my testing accuracy is not quite impressive (Lesser than 70%). I thought of adding an attention layer to the network and I have tried a lot. I found your codes and tutorials useful but deprecation of _time_distributed_dense in tensorflow 2 stopped me to really adapt your code into my implementation.

Please could you provide an alternative way of adding this?

Many thanks

Reply
Laith March 25, 2021 at 5:46 pm #

Hello

is there a Keras code that help in implementing encoder decoder model in using transformers

best wishes

Reply
- Jason Brownlee March 26, 2021 at 6:20 am #
  
  I don’t have an example at this stage.
  
  Reply
Minh March 28, 2021 at 12:09 pm #

Hi for current version of Keras you can’t import Recurrent from keras.layers.recurrent anymore. Do you have any way to resolve this, beside from downgrading Keras?

Reply
- Jason Brownlee March 29, 2021 at 6:15 am #
  
  No, sorry.
  
  Reply
Masud May 29, 2021 at 7:24 am #

Thanks for the nice tutorial. Do you plan to update the code with standard attention layer now?

Reply
- Jason Brownlee May 30, 2021 at 5:44 am #
  
  I hope to write a suite of new tutorials using the standard keras attention layers.
  
  Reply
Wang Hui May 30, 2021 at 3:15 am #

I am dealing with the problem of diabetes classification, and the shape of my data is (125000,219). Can I use your method to do the classification? If so, how, and if not, why. Thank you very much!

Reply
- Jason Brownlee May 30, 2021 at 5:50 am #
  
  I don’t think it would be appropriate, try an MLP model directly.
  
  Reply
MS June 28, 2021 at 5:40 pm #

Hi Jason.
Bahdanau et.al approach of calculating the attention weights is (a=v tanh(w1 ht + w2 hs)) where ht is the query and hs the value. I have created a custom keras attention layer using the innit, build and call function that will take the query , value and return the context vector as well as the attention weights. I use teacher forcing for this seq2seq problem. The query is the decoder last hidden state and the value is encoder all hidden state. My question to you is does the encoder output serve as the value i.e the encoder all hidden state? Also what should be the query? Should it be the encoder_h i.e encoder last hidden state?

Reply
- Jason Brownlee June 29, 2021 at 4:46 am #
  
  Sorry, I don’t recall offhand, perhaps check with the attention layers built into tf.keras.
  
  Reply
frozenade November 24, 2021 at 11:14 pm #

Hi Jason

I got error
TypeError: __init__() missing 1 required positional argument: ‘cell’

on
super(AttentionDecoder, self).__init__(**kwargs)

while I’m using this code:
model = define_model(all_vocab_size, all_length, 256, encoder, decoder, attention)

# define NMT model
def define_model(vocab, timesteps, n_units, encoder, decoder, attention):
model = Sequential()
model.add(Embedding(vocab, n_units, input_length=timesteps, mask_zero=True))
# model.add(Embedding(vocab, n_units, weights=[embedding_vectors], input_length=timesteps, trainable=False))
if(encoder == “LSTM”):
model.add(LSTM(n_units, return_sequences=False, dropout=0.5, recurrent_dropout=0.5))
elif(encoder == “GRU”):
model.add(GRU(n_units, return_sequences=False, dropout=0.5, recurrent_dropout=0.5))

model.add(RepeatVector(timesteps))
if(decoder == “LSTM”):
model.add(LSTM(n_units, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))
elif(decoder == “GRU”):
model.add(GRU(n_units, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))

model.add(BatchNormalization())
if(attention == “ATTNDECODER”):
model.add(AttentionDecoder(n_units, vocab))
else:
model.add(TimeDistributed(Dense(vocab, activation=’softmax’,
# kernel_regularizer=regularizers.l2(0.01),
# activity_regularizer=regularizers.l2(0.01)
)))
return model

What have I missed?

Reply
- Adrian Tam November 25, 2021 at 2:31 pm #
  
  As far as Keras 2.0, this should still work. But later, the Recurrent class became an alias to “RNN” class and the syntax got changed. That’s why you see the error. Unfortunately, it is not so trivial to rewrite the code. Maybe you should downgrade your keras to make this run.
  
  Reply
Quentin July 12, 2022 at 9:13 am #

Hi Jason,
This tutorial has been really helpful, am I allowed to use this code for a project of mine?

Reply
- James Carmichael July 13, 2022 at 7:43 am #
  
  Hi Quentin…Yes, but understand that all code and material on my site and in my books was developed and provided for educational purposes only.
  
  I take no responsibility for the code, what it might do, or how you might use it.
  
  If you use my code or material in your own project, please reference the source, including:
  
  The Name of the author, e.g. “Jason Brownlee”.
  The Title of the tutorial or book.
  The Name of the website, e.g. “Machine Learning Mastery”.
  The URL of the tutorial or book.
  The Date you accessed or copied the code.
  For example:
  
  Jason Brownlee, Machine Learning Algorithms in Python, Machine Learning Mastery, Available from https://machinelearningmastery.com/machine-learning-with-python/, accessed April 15th, 2018.
  Also, if your work is public, contact me, I’d love to see it out of general interest.
  
  Reply
Ram July 14, 2022 at 1:05 pm #

Hi Jason,
How we can use this code or totally how we can use attention for image classification using CNN2D?

Reply
- James Carmichael July 15, 2022 at 8:33 am #
  
  Hi Ram…the following resource may be of interest to you:
  
  https://blog.paperspace.com/image-classification-with-attention/
  
  Reply
Kostas September 3, 2022 at 3:07 am #

Hi Jason,
great post, I can’t get the code to run. This is probably because of the version of python and tensorflow I’m using.
Can you let me know which is the right version of python and tensorflow to use?

Thanks for your time.
Kostas

Reply
- James Carmichael September 4, 2022 at 10:02 am #
  
  Hi Kostas…What error messages are you experiencing? That will better enable us to assist you.
  
  Reply
  - Kostas September 11, 2022 at 6:56 pm #
    
    Hi James,
    running the code from command line I getting an error message as : “ImportError: cannot import name ‘Recurrent’ from ‘keras.layers.recurrent’ (C:\Users\papav\AppData\Local\Programs\Python\Python37\lib\site-packages\keras\layers\recurrent.py)”
    
    Further internet search found replays reporting that this error message means wrong keras & tensorflow version since recurrent has been depreciate on latest version of keras-tensorflow, that’s why I am asking what is the appropriate version of all evolving software (python, kers, tensorflow) to run the code.
    
    Reply
    - Jack October 17, 2022 at 6:35 pm #
      
      Dear Dr James Carmichael and Kostas
      
      I also have the same error. The error message show cannot import name ‘Recurrent’ from ‘keras.layers.recurrent’ (D:\Users\72771\Anaconda3\lib\site-packages\keras\layers\recurrent.py). How to fix it?
      
      Reply
Abdi November 30, 2022 at 2:51 am #

Dear Jason,

A great example code. but we know now SDPA (Scaled Dot Product Attention) by Keras has been developed. My question is that how possible to define k, v, and q to use the attention layer for the decoder? If it can be possible, or SDPA module can be used for self-attention in transformers?

If my answer is negative, as you referred to earlier, what the Jeras available function for “attention decoder” is?

My second question is that if I have a CNN network as an encoder this ” attention decoder” function has also worked yet?

Reply
- James Carmichael November 30, 2022 at 8:58 am #
  
  Hi Abdi…The following resource may add clarity:
  
  https://machinelearningmastery.com/how-to-implement-scaled-dot-product-attention-from-scratch-in-tensorflow-and-keras/
  
  Reply
  - Abdi December 10, 2022 at 4:58 am #
    
    Thank you, dear James
    I studied this tutorial, but it couldn’t use that in model of
    add.mode.attentionlayer(…)
    
    my purpose is to use it in encoder decoder lstm in a sequential model like
    
    model = Sequential()
    model.add(LSTM(200, activation=’relu’, input_shape=(n_timesteps, n_features)))
    model.add(RepeatVector(n_outputs))
    model.add(Attention_layer (**kwargs) —————————————————> if it is posible
    model.add(LSTM(200, activation=’relu’, return_sequences=True))
    model.add(TimeDistributed(Dense(100, activation=’relu’)))
    model.add(TimeDistributed(Dense(1)))
    model.compile(loss=’mse’, optimizer=’adam’)
    
    Or if we dont have such this model (I searched alot) could I use the attention_decoder code correctly with Keras 2.9?
    
    Reply
Abdi December 2, 2022 at 2:08 am #

Thank you for that In studied it before and no I want to use it in replace of Zafar-Ali’s attention_decoder function if possible. did you do that before?

Reply
Abdi December 2, 2022 at 2:24 am #

Some questions dear Jason

1. if we run the codes in Colab, saving attention_decoder.py in the mounted drive is enough to use?

2. instruction ” from keras.layers.recurrent import Recurrent, _time_distributed_dense” as mentioned above didn’t work. Then if we install Keras 2.0.8 in Colab, there is no problem for other codes need higher or latest version of Keras?

Reply

Navigation

How to Develop an Encoder-Decoder Model with Attention in Keras

Tutorial Overview

Python Environment

Encoder-Decoder with Attention

Test Problem for Attention

Encoder-Decoder Without Attention

Custom Keras Attention Layer

Encoder-Decoder With Attention

Comparison of Models

Further Reading

Summary

Develop LSTMs for Sequence Prediction Today!

Develop Your Own LSTM models in Minutes

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

More On This Topic

358 Responses to How to Develop an Encoder-Decoder Model with Attention in Keras

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Python Environment

Encoder-Decoder with Attention

Test Problem for Attention

Encoder-Decoder Without Attention

Custom Keras Attention Layer

Encoder-Decoder With Attention

Comparison of Models

Further Reading

Summary

Develop LSTMs for Sequence Prediction Today!

Develop Your Own LSTM models in Minutes

Finally Bring LSTM Recurrent Neural Networks to Your Sequence Predictions Projects

More On This Topic

358 Responses to How to Develop an Encoder-Decoder Model with Attention in Keras

Leave a Reply Click here to cancel reply.

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects