The post A Gentle Introduction to LSTM Autoencoders appeared first on MachineLearningMastery.com.

]]>Once fit, the encoder part of the model can be used to encode or compress sequence data that in turn may be used in data visualizations or as a feature vector input to a supervised learning model.

In this post, you will discover the LSTM Autoencoder model and how to implement it in Python using Keras.

After reading this post, you will know:

- Autoencoders are a type of self-supervised learning model that can learn a compressed representation of input data.
- LSTM Autoencoders can learn a compressed representation of sequence data and have been used on video, text, audio, and time series sequence data.
- How to develop LSTM Autoencoder models in Python using the Keras deep learning library.

**Kick-start your project** with my new book Long Short-Term Memory Networks With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This post is divided into six sections; they are:

- What Are Autoencoders?
- A Problem with Sequences
- Encoder-Decoder LSTM Models
- What Is an LSTM Autoencoder?
- Early Application of LSTM Autoencoder
- How to Create LSTM Autoencoders in Keras

An autoencoder is a neural network model that seeks to learn a compressed representation of an input.

They are an unsupervised learning method, although technically, they are trained using supervised learning methods, referred to as self-supervised. They are typically trained as part of a broader model that attempts to recreate the input.

For example:

X = model.predict(X)

The design of the autoencoder model purposefully makes this challenging by restricting the architecture to a bottleneck at the midpoint of the model, from which the reconstruction of the input data is performed.

There are many types of autoencoders, and their use varies, but perhaps the more common use is as a learned or automatic feature extraction model.

In this case, once the model is fit, the reconstruction aspect of the model can be discarded and the model up to the point of the bottleneck can be used. The output of the model at the bottleneck is a fixed length vector that provides a compressed representation of the input data.

Input data from the domain can then be provided to the model and the output of the model at the bottleneck can be used as a feature vector in a supervised learning model, for visualization, or more generally for dimensionality reduction.

Sequence prediction problems are challenging, not least because the length of the input sequence can vary.

This is challenging because machine learning algorithms, and neural networks in particular, are designed to work with fixed length inputs.

Another challenge with sequence data is that the temporal ordering of the observations can make it challenging to extract features suitable for use as input to supervised learning models, often requiring deep expertise in the domain or in the field of signal processing.

Finally, many predictive modeling problems involving sequences require a prediction that itself is also a sequence. These are called sequence-to-sequence, or seq2seq, prediction problems.

You can learn more about sequence prediction problems here:

Recurrent neural networks, such as the Long Short-Term Memory, or LSTM, network are specifically designed to support sequences of input data.

They are capable of learning the complex dynamics within the temporal ordering of input sequences as well as use an internal memory to remember or use information across long input sequences.

The LSTM network can be organized into an architecture called the Encoder-Decoder LSTM that allows the model to be used to both support variable length input sequences and to predict or output variable length output sequences.

This architecture is the basis for many advances in complex sequence prediction problems such as speech recognition and text translation.

In this architecture, an encoder LSTM model reads the input sequence step-by-step. After reading in the entire input sequence, the hidden state or output of this model represents an internal learned representation of the entire input sequence as a fixed-length vector. This vector is then provided as an input to the decoder model that interprets it as each step in the output sequence is generated.

You can learn more about the encoder-decoder architecture here:

An LSTM Autoencoder is an implementation of an autoencoder for sequence data using an Encoder-Decoder LSTM architecture.

For a given dataset of sequences, an encoder-decoder LSTM is configured to read the input sequence, encode it, decode it, and recreate it. The performance of the model is evaluated based on the model’s ability to recreate the input sequence.

Once the model achieves a desired level of performance recreating the sequence, the decoder part of the model may be removed, leaving just the encoder model. This model can then be used to encode input sequences to a fixed-length vector.

The resulting vectors can then be used in a variety of applications, not least as a compressed representation of the sequence as an input to another supervised learning model.

One of the early and widely cited applications of the LSTM Autoencoder was in the 2015 paper titled “Unsupervised Learning of Video Representations using LSTMs.”

In the paper, Nitish Srivastava, et al. describe the LSTM Autoencoder as an extension or application of the Encoder-Decoder LSTM.

They use the model with video input data to both reconstruct sequences of frames of video as well as to predict frames of video, both of which are described as an unsupervised learning task.

The input to the model is a sequence of vectors (image patches or features). The encoder LSTM reads in this sequence. After the last input has been read, the decoder LSTM takes over and outputs a prediction for the target sequence.

— Unsupervised Learning of Video Representations using LSTMs, 2015.

More than simply using the model directly, the authors explore some interesting architecture choices that may help inform future applications of the model.

They designed the model in such a way as to recreate the target sequence of video frames in reverse order, claiming that it makes the optimization problem solved by the model more tractable.

The target sequence is same as the input sequence, but in reverse order. Reversing the target sequence makes the optimization easier because the model can get off the ground by looking at low range correlations.

— Unsupervised Learning of Video Representations using LSTMs, 2015.

They also explore two approaches to training the decoder model, specifically a version conditioned in the previous output generated by the decoder, and another without any such conditioning.

The decoder can be of two kinds – conditional or unconditioned. A conditional decoder receives the last generated output frame as input […]. An unconditioned decoder does not receive that input.

— Unsupervised Learning of Video Representations using LSTMs, 2015.

A more elaborate autoencoder model was also explored where two decoder models were used for the one encoder: one to predict the next frame in the sequence and one to reconstruct frames in the sequence, referred to as a composite model.

… reconstructing the input and predicting the future can be combined to create a composite […]. Here the encoder LSTM is asked to come up with a state from which we can both predict the next few frames as well as reconstruct the input.

— Unsupervised Learning of Video Representations using LSTMs, 2015.

The models were evaluated in many ways, including using encoder to seed a classifier. It appears that rather than using the output of the encoder as an input for classification, they chose to seed a standalone LSTM classifier with the weights of the encoder model directly. This is surprising given the complication of the implementation.

We initialize an LSTM classifier with the weights learned by the encoder LSTM from this model.

— Unsupervised Learning of Video Representations using LSTMs, 2015.

The composite model without conditioning on the decoder was found to perform the best in their experiments.

The best performing model was the Composite Model that combined an autoencoder and a future predictor. The conditional variants did not give any significant improvements in terms of classification accuracy after fine-tuning, however they did give slightly lower prediction errors.

— Unsupervised Learning of Video Representations using LSTMs, 2015.

Many other applications of the LSTM Autoencoder have been demonstrated, not least with sequences of text, audio data and time series.

Creating an LSTM Autoencoder in Keras can be achieved by implementing an Encoder-Decoder LSTM architecture and configuring the model to recreate the input sequence.

Let’s look at a few examples to make this concrete.

The simplest LSTM autoencoder is one that learns to reconstruct each input sequence.

For these demonstrations, we will use a dataset of one sample of nine time steps and one feature:

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

We can start-off by defining the sequence and reshaping it into the preferred shape of [*samples, timesteps, features*].

# define input sequence sequence = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]) # reshape input into [samples, timesteps, features] n_in = len(sequence) sequence = sequence.reshape((1, n_in, 1))

Next, we can define the encoder-decoder LSTM architecture that expects input sequences with nine time steps and one feature and outputs a sequence with nine time steps and one feature.

# define model model = Sequential() model.add(LSTM(100, activation='relu', input_shape=(n_in,1))) model.add(RepeatVector(n_in)) model.add(LSTM(100, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(1))) model.compile(optimizer='adam', loss='mse')

Next, we can fit the model on our contrived dataset.

# fit model model.fit(sequence, sequence, epochs=300, verbose=0)

The complete example is listed below.

The configuration of the model, such as the number of units and training epochs, was completely arbitrary.

# lstm autoencoder recreate sequence from numpy import array from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import RepeatVector from keras.layers import TimeDistributed from keras.utils import plot_model # define input sequence sequence = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]) # reshape input into [samples, timesteps, features] n_in = len(sequence) sequence = sequence.reshape((1, n_in, 1)) # define model model = Sequential() model.add(LSTM(100, activation='relu', input_shape=(n_in,1))) model.add(RepeatVector(n_in)) model.add(LSTM(100, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(1))) model.compile(optimizer='adam', loss='mse') # fit model model.fit(sequence, sequence, epochs=300, verbose=0) plot_model(model, show_shapes=True, to_file='reconstruct_lstm_autoencoder.png') # demonstrate recreation yhat = model.predict(sequence, verbose=0) print(yhat[0,:,0])

Running the example fits the autoencoder and prints the reconstructed input sequence.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The results are close enough, with very minor rounding errors.

[0.10398503 0.20047213 0.29905337 0.3989646 0.4994707 0.60005534 0.70039135 0.80031013 0.8997728 ]

A plot of the architecture is created for reference.

We can modify the reconstruction LSTM Autoencoder to instead predict the next step in the sequence.

In the case of our small contrived problem, we expect the output to be the sequence:

[0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

This means that the model will expect each input sequence to have nine time steps and the output sequence to have eight time steps.

# reshape input into [samples, timesteps, features] n_in = len(seq_in) seq_in = seq_in.reshape((1, n_in, 1)) # prepare output sequence seq_out = seq_in[:, 1:, :] n_out = n_in - 1

The complete example is listed below.

# lstm autoencoder predict sequence from numpy import array from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import RepeatVector from keras.layers import TimeDistributed from keras.utils import plot_model # define input sequence seq_in = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]) # reshape input into [samples, timesteps, features] n_in = len(seq_in) seq_in = seq_in.reshape((1, n_in, 1)) # prepare output sequence seq_out = seq_in[:, 1:, :] n_out = n_in - 1 # define model model = Sequential() model.add(LSTM(100, activation='relu', input_shape=(n_in,1))) model.add(RepeatVector(n_out)) model.add(LSTM(100, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(1))) model.compile(optimizer='adam', loss='mse') plot_model(model, show_shapes=True, to_file='predict_lstm_autoencoder.png') # fit model model.fit(seq_in, seq_out, epochs=300, verbose=0) # demonstrate prediction yhat = model.predict(seq_in, verbose=0) print(yhat[0,:,0])

Running the example prints the output sequence that predicts the next time step for each input time step.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the model is accurate, barring some minor rounding errors.

[0.1657285 0.28903174 0.40304852 0.5096578 0.6104322 0.70671254 0.7997272 0.8904342 ]

A plot of the architecture is created for reference.

Finally, we can create a composite LSTM Autoencoder that has a single encoder and two decoders, one for reconstruction and one for prediction.

We can implement this multi-output model in Keras using the functional API. You can learn more about the functional API in this post:

First, the encoder is defined.

# define encoder visible = Input(shape=(n_in,1)) encoder = LSTM(100, activation='relu')(visible)

Then the first decoder that is used for reconstruction.

# define reconstruct decoder decoder1 = RepeatVector(n_in)(encoder) decoder1 = LSTM(100, activation='relu', return_sequences=True)(decoder1) decoder1 = TimeDistributed(Dense(1))(decoder1)

Then the second decoder that is used for prediction.

# define predict decoder decoder2 = RepeatVector(n_out)(encoder) decoder2 = LSTM(100, activation='relu', return_sequences=True)(decoder2) decoder2 = TimeDistributed(Dense(1))(decoder2)

We then tie the whole model together.

# tie it together model = Model(inputs=visible, outputs=[decoder1, decoder2])

The complete example is listed below.

# lstm autoencoder reconstruct and predict sequence from numpy import array from keras.models import Model from keras.layers import Input from keras.layers import LSTM from keras.layers import Dense from keras.layers import RepeatVector from keras.layers import TimeDistributed from keras.utils import plot_model # define input sequence seq_in = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]) # reshape input into [samples, timesteps, features] n_in = len(seq_in) seq_in = seq_in.reshape((1, n_in, 1)) # prepare output sequence seq_out = seq_in[:, 1:, :] n_out = n_in - 1 # define encoder visible = Input(shape=(n_in,1)) encoder = LSTM(100, activation='relu')(visible) # define reconstruct decoder decoder1 = RepeatVector(n_in)(encoder) decoder1 = LSTM(100, activation='relu', return_sequences=True)(decoder1) decoder1 = TimeDistributed(Dense(1))(decoder1) # define predict decoder decoder2 = RepeatVector(n_out)(encoder) decoder2 = LSTM(100, activation='relu', return_sequences=True)(decoder2) decoder2 = TimeDistributed(Dense(1))(decoder2) # tie it together model = Model(inputs=visible, outputs=[decoder1, decoder2]) model.compile(optimizer='adam', loss='mse') plot_model(model, show_shapes=True, to_file='composite_lstm_autoencoder.png') # fit model model.fit(seq_in, [seq_in,seq_out], epochs=300, verbose=0) # demonstrate prediction yhat = model.predict(seq_in, verbose=0) print(yhat)

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example both reconstructs and predicts the output sequence, using both decoders.

[array([[[0.10736275], [0.20335874], [0.30020815], [0.3983948 ], [0.4985725 ], [0.5998295 ], [0.700336 , [0.8001949 ], [0.89984304]]], dtype=float32), array([[[0.16298929], [0.28785267], [0.4030449 ], [0.5104638 ], [0.61162543], [0.70776784], [0.79992455], [0.8889787 ]]], dtype=float32)]

A plot of the architecture is created for reference.

Regardless of the method chosen (reconstruction, prediction, or composite), once the autoencoder has been fit, the decoder can be removed and the encoder can be kept as a standalone model.

The encoder can then be used to transform input sequences to a fixed length encoded vector.

We can do this by creating a new model that has the same inputs as our original model, and outputs directly from the end of encoder model, before the *RepeatVector* layer.

# connect the encoder LSTM as the output layer model = Model(inputs=model.inputs, outputs=model.layers[0].output)

A complete example of doing this with the reconstruction LSTM autoencoder is listed below.

# lstm autoencoder recreate sequence from numpy import array from keras.models import Sequential from keras.models import Model from keras.layers import LSTM from keras.layers import Dense from keras.layers import RepeatVector from keras.layers import TimeDistributed from keras.utils import plot_model # define input sequence sequence = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]) # reshape input into [samples, timesteps, features] n_in = len(sequence) sequence = sequence.reshape((1, n_in, 1)) # define model model = Sequential() model.add(LSTM(100, activation='relu', input_shape=(n_in,1))) model.add(RepeatVector(n_in)) model.add(LSTM(100, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(1))) model.compile(optimizer='adam', loss='mse') # fit model model.fit(sequence, sequence, epochs=300, verbose=0) # connect the encoder LSTM as the output layer model = Model(inputs=model.inputs, outputs=model.layers[0].output) plot_model(model, show_shapes=True, to_file='lstm_encoder.png') # get the feature vector for the input sequence yhat = model.predict(sequence) print(yhat.shape) print(yhat)

Running the example creates a standalone encoder model that could be used or saved for later use.

We demonstrate the encoder by predicting the sequence and getting back the 100 element output of the encoder.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Obviously, this is overkill for our tiny nine-step input sequence.

[[0.03625513 0.04107533 0.10737951 0.02468692 0.06771207 0. 0.0696108 0. 0. 0.0688471 0. 0. 0. 0. 0. 0. 0. 0.03871286 0. 0. 0.05252134 0. 0.07473809 0.02688836 0. 0. 0. 0. 0. 0.0460703 0. 0. 0.05190025 0. 0. 0.11807001 0. 0. 0. 0. 0. 0. 0. 0.14514188 0. 0. 0. 0. 0.02029926 0.02952124 0. 0. 0. 0. 0. 0.08357017 0.08418129 0. 0. 0. 0. 0. 0.09802645 0.07694854 0. 0.03605933 0. 0.06378153 0. 0.05267526 0.02744672 0. 0.06623861 0. 0. 0. 0.08133873 0.09208347 0.03379713 0. 0. 0. 0.07517676 0.08870222 0. 0. 0. 0. 0.03976351 0.09128518 0.08123557 0. 0.08983088 0.0886112 0. 0.03840019 0.00616016 0.0620428 0. 0. ]

A plot of the architecture is created for reference.

This section provides more resources on the topic if you are looking to go deeper.

- Making Predictions with Sequences
- Encoder-Decoder Long Short-Term Memory Networks
- Autoencoder, Wikipedia
- Unsupervised Learning of Video Representations using LSTMs, ArXiv 2015.
- Unsupervised Learning of Video Representations using LSTMs, PMLR, PDF, 2015.
- Unsupervised Learning of Video Representations using LSTMs, GitHub Repository.
- Building Autoencoders in Keras, 2016.
- How to Use the Keras Functional API for Deep Learning

In this post, you discovered the LSTM Autoencoder model and how to implement it in Python using Keras.

Specifically, you learned:

- Autoencoders are a type of self-supervised learning model that can learn a compressed representation of input data.
- LSTM Autoencoders can learn a compressed representation of sequence data and have been used on video, text, audio, and time series sequence data.
- How to develop LSTM Autoencoder models in Python using the Keras deep learning library.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to LSTM Autoencoders appeared first on MachineLearningMastery.com.

]]>The post A Gentle Introduction to Exploding Gradients in Neural Networks appeared first on MachineLearningMastery.com.

]]>This has the effect of your model being unstable and unable to learn from your training data.

In this post, you will discover the problem of exploding gradients with deep artificial neural networks.

After completing this post, you will know:

- What exploding gradients are and the problems they cause during training.
- How to know whether you may have exploding gradients with your network model.
- How you can fix the exploding gradient problem with your network.

**Kick-start your project** with my new book Long Short-Term Memory Networks With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Oct/2018**: Removed mention of ReLU as a solution.

An error gradient is the direction and magnitude calculated during the training of a neural network that is used to update the network weights in the right direction and by the right amount.

In deep networks or recurrent neural networks, error gradients can accumulate during an update and result in very large gradients. These in turn result in large updates to the network weights, and in turn, an unstable network. At an extreme, the values of weights can become so large as to overflow and result in NaN values.

The explosion occurs through exponential growth by repeatedly multiplying gradients through the network layers that have values larger than 1.0.

In deep multilayer Perceptron networks, exploding gradients can result in an unstable network that at best cannot learn from the training data and at worst results in NaN weight values that can no longer be updated.

… exploding gradients can make learning unstable.

— Page 282, Deep Learning, 2016.

In recurrent neural networks, exploding gradients can result in an unstable network that is unable to learn from training data and at best a network that cannot learn over long input sequences of data.

… the exploding gradients problem refers to the large increase in the norm of the gradient during training. Such events are due to the explosion of the long term components

— On the difficulty of training recurrent neural networks, 2013.

There are some subtle signs that you may be suffering from exploding gradients during the training of your network, such as:

- The model is unable to get traction on your training data (e.g. poor loss).
- The model is unstable, resulting in large changes in loss from update to update.
- The model loss goes to NaN during training.

If you have these types of problems, you can dig deeper to see if you have a problem with exploding gradients.

There are some less subtle signs that you can use to confirm that you have exploding gradients.

- The model weights quickly become very large during training.
- The model weights go to NaN values during training.
- The error gradient values are consistently above 1.0 for each node and layer during training.

There are many approaches to addressing exploding gradients; this section lists some best practice approaches that you can use.

In deep neural networks, exploding gradients may be addressed by redesigning the network to have fewer layers.

There may also be some benefit in using a smaller batch size while training the network.

In recurrent neural networks, updating across fewer prior time steps during training, called truncated Backpropagation through time, may reduce the exploding gradient problem.

In recurrent neural networks, gradient exploding can occur given the inherent instability in the training of this type of network, e.g. via Backpropagation through time that essentially transforms the recurrent network into a deep multilayer Perceptron neural network.

Exploding gradients can be reduced by using the Long Short-Term Memory (LSTM) memory units and perhaps related gated-type neuron structures.

Adopting LSTM memory units is a new best practice for recurrent neural networks for sequence prediction.

Exploding gradients can still occur in very deep Multilayer Perceptron networks with a large batch size and LSTMs with very long input sequence lengths.

If exploding gradients are still occurring, you can check for and limit the size of gradients during the training of your network.

This is called gradient clipping.

Dealing with the exploding gradients has a simple but very effective solution: clipping gradients if their norm exceeds a given threshold.

— Section 5.2.4, Vanishing and Exploding Gradients, Neural Network Methods in Natural Language Processing, 2017.

Specifically, the values of the error gradient are checked against a threshold value and clipped or set to that threshold value if the error gradient exceeds the threshold.

To some extent, the exploding gradient problem can be mitigated by gradient clipping (thresholding the values of the gradients before performing a gradient descent step).

— Page 294, Deep Learning, 2016.

In the Keras deep learning library, you can use gradient clipping by setting the *clipnorm* or *clipvalue* arguments on your optimizer before training.

Good default values are *clipnorm=1.0* and *clipvalue=0.5*.

Another approach, if exploding gradients are still occurring, is to check the size of network weights and apply a penalty to the networks loss function for large weight values.

This is called weight regularization and often an L1 (absolute weights) or an L2 (squared weights) penalty can be used.

Using an L1 or L2 penalty on the recurrent weights can help with exploding gradients

— On the difficulty of training recurrent neural networks, 2013.

In the Keras deep learning library, you can use weight regularization by setting the *kernel_regularizer* argument on your layer and using an *L1* or *L2* regularizer.

This section provides more resources on the topic if you are looking to go deeper.

- On the difficulty of training recurrent neural networks, 2013.
- Learning long-term dependencies with gradient descent is difficult, 1994.
- Understanding the exploding gradient problem, 2012.

- Why is it a problem to have exploding gradients in a neural net (especially in an RNN)?
- How does LSTM help prevent the vanishing (and exploding) gradient problem in a recurrent neural network?
- Rectifier (neural networks)

In this post, you discovered the problem of exploding gradients when training deep neural network models.

Specifically, you learned:

- What exploding gradients are and the problems they cause during training.
- How to know whether you may have exploding gradients with your network model.
- How you can fix the exploding gradient problem with your network.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Exploding Gradients in Neural Networks appeared first on MachineLearningMastery.com.

]]>The post What is Teacher Forcing for Recurrent Neural Networks? appeared first on MachineLearningMastery.com.

]]>It is a network training method critical to the development of deep learning language models used in machine translation, text summarization, and image captioning, among many other applications.

In this post, you will discover the teacher forcing as a method for training recurrent neural networks.

After reading this post, you will know:

- The problem with training recurrent neural networks that use output from prior time steps as input.
- The teacher forcing method for addressing slow convergence and instability when training these types of recurrent networks.
- Extensions to teacher forcing that allow trained models to better handle open loop applications of this type of network.

**Kick-start your project** with my new book Long Short-Term Memory Networks With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

There are sequence prediction models that use the output from the last time step y(t-1) as input for the model at the current time step X(t).

This type of model is common in language models that output one word at a time and use the output word as input for generating the next word in the sequence.

For example, this type of language model is used in an Encoder-Decoder recurrent neural network architecture for sequence-to-sequence generation problems such as:

- Machine Translation
- Caption Generation
- Text Summarization

After the model is trained, a “start-of-sequence” token can be used to start the process and the generated word in the output sequence is used as input on the subsequent time step, perhaps along with other input like an image or a source text.

This same recursive output-as-input process can be used when training the model, but it can result in problems such as:

- Slow convergence.
- Model instability.
- Poor skill.

Teacher forcing is an approach to improve model skill and stability when training these types of models.

Teacher forcing is a strategy for training recurrent neural networks that uses **ground truth as input**, instead of model output from a prior time step as an input.

Models that have recurrent connections from their outputs leading back into the model may be trained with teacher forcing.

— Page 372, Deep Learning, 2016.

The approach was originally described and developed as an alternative technique to backpropagation through time for training a recurrent neural network.

An interesting technique that is frequently used in dynamical supervised learning tasks is to replace the actual output y(t) of a unit by the teacher signal d(t) in subsequent computation of the behavior of the network, whenever such a value exists. We call this technique teacher forcing.

— A Learning Algorithm for Continually Running Fully Recurrent Neural Networks, 1989.

Teacher forcing works by using the actual or expected output from the training dataset at the current time step y(t) as input in the next time step X(t+1), rather than the output generated by the network.

Teacher forcing is a procedure […] in which during training the model receives the ground truth output y(t) as input at time t + 1.

— Page 372, Deep Learning, 2016.

Let’s make teacher forcing concrete with a short worked example.

Given the following input sequence:

Mary had a little lamb whose fleece was white as snow

Imagine we want to train a model to generate the next word in the sequence given the previous sequence of words.

First, we must add a token to signal the start of the sequence and another to signal the end of the sequence. We will use “*[START]*” and “*[END]*” respectively.

[START] Mary had a little lamb whose fleece was white as snow [END]

Next, we feed the model “*[START]*” and let the model generate the next word.

Imagine the model generates the word “*a*“, but of course, we expected “*Mary*“.

X, yhat [START], a

Naively, we could feed in “*a*” as part of the input to generate the subsequent word in the sequence.

X, yhat [START], a, ?

You can see that the model is off track and is going to get punished for every subsequent word it generates. This makes learning slower and the model unstable.

Instead, we can use teacher forcing.

In the first example when the model generated “*a*” as output, we can discard this output after calculating error and feed in “*Mary*” as part of the input on the subsequent time step.

X, yhat [START], Mary, ?

We can then repeat this process for each input-output pair of words.

X, yhat [START], ? [START], Mary, ? [START], Mary, had, ? [START], Mary, had, a, ? ...

The model will learn the correct sequence, or correct statistical properties for the sequence, quickly.

Teacher forcing is a fast and effective way to train a recurrent neural network that uses output from prior time steps as input to the model.

But, the approach can also result in models that may be fragile or limited when used in practice when the generated sequences vary from what was seen by the model during training.

This is common in most applications of this type of model as the outputs are probabilistic in nature. This type of application of the model is often called open loop.

Unfortunately, this procedure can result in problems in generation as small prediction error compound in the conditioning context. This can lead to poor prediction performance as the RNN’s conditioning context (the sequence of previously generated samples) diverge from sequences seen during training.

– Professor Forcing: A New Algorithm for Training Recurrent Networks, 2016.

There are a number of approaches to address this limitation, for example:

One approach commonly used for models that predict a discrete value output, such as a word, is to perform a search across the predicted probabilities for each word to generate a number of likely candidate output sequences.

This approach is used on problems like machine translation to refine the translated output sequence.

A common search procedure for this post-hoc operation is the beam search.

This discrepancy can be mitigated by the use of a beam search heuristic maintaining several generated target sequences

— Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, 2015.

The beam search approach is only suitable for prediction problems with discrete output values and cannot be used for real-valued outputs.

A variation of forced learning is to introduce outputs generated from prior time steps during training to encourage the model to learn how to correct its own mistakes.

We propose to change the training process in order to gradually force the model to deal with its own mistakes, as it would have to during inference.

— Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, 2015.

The approach is called curriculum learning and involves randomly choosing to use the ground truth output or the generated output from the previous time step as input for the current time step.

The curriculum changes over time in what is called scheduled sampling where the procedure starts at forced learning and slowly decreases the probability of a forced input over the training epochs.

There are also other extensions and variations of teacher forcing and I encourage you to explore them if you are interested.

This section provides more resources on the topic if you are looking go deeper.

- A Learning Algorithm for Continually Running Fully Recurrent Neural Networks, 1989.
- Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, 2015.
- Professor Forcing: A New Algorithm for Training Recurrent Networks, 2016.

- Section 10.2.1, Teacher Forcing and Networks with Output Recurrence, Deep Learning, 2016.

In this post, you discovered teacher forcing as a method for training recurrent neural networks that use output from a previous time step as input.

Specifically, you learned:

- The problem with training recurrent neural networks that use output from prior time steps as input.
- The teacher forcing method for addressing slow convergence and instability when training these types of recurrent networks.
- Extensions to teacher forcing that allow trained models to better handle open loop applications of this type of network.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post What is Teacher Forcing for Recurrent Neural Networks? appeared first on MachineLearningMastery.com.

]]>The post How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras appeared first on MachineLearningMastery.com.

]]>Encoder-decoder models can be developed in the Keras Python deep learning library and an example of a neural machine translation system developed with this model has been described on the Keras blog, with sample code distributed with the Keras project.

This example can provide the basis for developing encoder-decoder LSTM models for your own sequence-to-sequence prediction problems.

In this tutorial, you will discover how to develop a sophisticated encoder-decoder recurrent neural network for sequence-to-sequence prediction problems with Keras.

After completing this tutorial, you will know:

- How to correctly define a sophisticated encoder-decoder model in Keras for sequence-to-sequence prediction.
- How to define a contrived yet scalable sequence-to-sequence prediction problem that you can use to evaluate the encoder-decoder LSTM model.
- How to apply the encoder-decoder LSTM model in Keras to address the scalable integer sequence-to-sequence prediction problem.

**Kick-start your project** with my new book Long Short-Term Memory Networks With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Jan/2020**: Updated API for Keras 2.3 and TensorFlow 2.0.

This tutorial is divided into 3 parts; they are:

- Encoder-Decoder Model in Keras
- Scalable Sequence-to-Sequence Problem
- Encoder-Decoder LSTM for Sequence Prediction

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this tutorial.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this post:

The encoder-decoder model is a way of organizing recurrent neural networks for sequence-to-sequence prediction problems.

It was originally developed for machine translation problems, although it has proven successful at related sequence-to-sequence prediction problems such as text summarization and question answering.

The approach involves two recurrent neural networks, one to encode the source sequence, called the encoder, and a second to decode the encoded source sequence into the target sequence, called the decoder.

The Keras deep learning Python library provides an example of how to implement the encoder-decoder model for machine translation (lstm_seq2seq.py) described by the libraries creator in the post: “A ten-minute introduction to sequence-to-sequence learning in Keras.”

For a detailed breakdown of this model see the post:

For more information on the use of return_state, which might be new to you, see the post:

For more help getting started with the Keras Functional API, see the post:

Using the code in that example as a starting point, we can develop a generic function to define an encoder-decoder recurrent neural network. Below is this function named *define_models()*.

# returns train, inference_encoder and inference_decoder models def define_models(n_input, n_output, n_units): # define training encoder encoder_inputs = Input(shape=(None, n_input)) encoder = LSTM(n_units, return_state=True) encoder_outputs, state_h, state_c = encoder(encoder_inputs) encoder_states = [state_h, state_c] # define training decoder decoder_inputs = Input(shape=(None, n_output)) decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True) decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states) decoder_dense = Dense(n_output, activation='softmax') decoder_outputs = decoder_dense(decoder_outputs) model = Model([encoder_inputs, decoder_inputs], decoder_outputs) # define inference encoder encoder_model = Model(encoder_inputs, encoder_states) # define inference decoder decoder_state_input_h = Input(shape=(n_units,)) decoder_state_input_c = Input(shape=(n_units,)) decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c] decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs) decoder_states = [state_h, state_c] decoder_outputs = decoder_dense(decoder_outputs) decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states) # return all models return model, encoder_model, decoder_model

The function takes 3 arguments, as follows:

**n_input**: The cardinality of the input sequence, e.g. number of features, words, or characters for each time step.**n_output**: The cardinality of the output sequence, e.g. number of features, words, or characters for each time step.**n_units**: The number of cells to create in the encoder and decoder models, e.g. 128 or 256.

The function then creates and returns 3 models, as follows:

**train**: Model that can be trained given source, target, and shifted target sequences.**inference_encoder**: Encoder model used when making a prediction for a new source sequence.**inference_decoder**Decoder model use when making a prediction for a new source sequence.

The model is trained given source and target sequences where the model takes both the source and a shifted version of the target sequence as input and predicts the whole target sequence.

For example, one source sequence may be [1,2,3] and the target sequence [4,5,6]. The inputs and outputs to the model during training would be:

Input1: ['1', '2', '3'] Input2: ['_', '4', '5'] Output: ['4', '5', '6']

The model is intended to be called recursively when generating target sequences for new source sequences.

The source sequence is encoded and the target sequence is generated one element at a time, using a “start of sequence” character such as ‘_’ to start the process. Therefore, in the above case, the following input-output pairs would occur during training:

t, Input1, Input2, Output 1, ['1', '2', '3'], '_', '4' 2, ['1', '2', '3'], '4', '5' 3, ['1', '2', '3'], '5', '6'

Here you can see how the recursive use of the model can be used to build up output sequences.

During prediction, the *inference_encoder* model is used to encode the input sequence once which returns states that are used to initialize the *inference_decoder* model. From that point, the *inference_decoder* model is used to generate predictions step by step.

The function below named *predict_sequence()* can be used after the model is trained to generate a target sequence given a source sequence.

# generate target given source sequence def predict_sequence(infenc, infdec, source, n_steps, cardinality): # encode state = infenc.predict(source) # start of sequence input target_seq = array([0.0 for _ in range(cardinality)]).reshape(1, 1, cardinality) # collect predictions output = list() for t in range(n_steps): # predict next char yhat, h, c = infdec.predict([target_seq] + state) # store prediction output.append(yhat[0,0,:]) # update state state = [h, c] # update target sequence target_seq = yhat return array(output)

This function takes 5 arguments as follows:

**infenc**: Encoder model used when making a prediction for a new source sequence.**infdec**: Decoder model use when making a prediction for a new source sequence.**source**:Encoded source sequence.**n_steps**: Number of time steps in the target sequence.**cardinality**: The cardinality of the output sequence, e.g. the number of features, words, or characters for each time step.

The function then returns a list containing the target sequence.

In this section, we define a contrived and scalable sequence-to-sequence prediction problem.

The source sequence is a series of randomly generated integer values, such as [20, 36, 40, 10, 34, 28], and the target sequence is a reversed pre-defined subset of the input sequence, such as the first 3 elements in reverse order [40, 36, 20].

The length of the source sequence is configurable; so is the cardinality of the input and output sequence and the length of the target sequence.

We will use source sequences of 6 elements, a cardinality of 50, and target sequences of 3 elements.

Below are some more examples to make this concrete.

Source, Target [13, 28, 18, 7, 9, 5] [18, 28, 13] [29, 44, 38, 15, 26, 22] [38, 44, 29] [27, 40, 31, 29, 32, 1] [31, 40, 27] ...

You are encouraged to explore larger and more complex variations. Post your findings in the comments below.

Let’s start off by defining a function to generate a sequence of random integers.

We will use the value of 0 as the padding or start of sequence character, therefore it is reserved and we cannot use it in our source sequences. To achieve this, we will add 1 to our configured cardinality to ensure the one-hot encoding is large enough (e.g. a value of 1 maps to a ‘1’ value in index 1).

For example:

n_features = 50 + 1

We can use the *randint()* python function to generate random integers in a range between 1 and 1-minus the size of the problem’s cardinality. The *generate_sequence()* below generates a sequence of random integers.

# generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(1, n_unique-1) for _ in range(length)]

Next, we need to create the corresponding output sequence given the source sequence.

To keep thing simple, we will select the first n elements of the source sequence as the target sequence and reverse them.

# define target sequence target = source[:n_out] target.reverse()

We also need a version of the output sequence shifted forward by one time step that we can use as the mock target generated so far, including the start of sequence value in the first time step. We can create this from the target sequence directly.

# create padded input target sequence target_in = [0] + target[:-1]

Now that all of the sequences have been defined, we can one-hot encode them, i.e. transform them into sequences of binary vectors. We can use the Keras built in *to_categorical()* function to achieve this.

We can put all of this into a function named *get_dataset()* that will generate a specific number of sequences that we can use to train a model.

# prepare data for the LSTM def get_dataset(n_in, n_out, cardinality, n_samples): X1, X2, y = list(), list(), list() for _ in range(n_samples): # generate source sequence source = generate_sequence(n_in, cardinality) # define target sequence target = source[:n_out] target.reverse() # create padded input target sequence target_in = [0] + target[:-1] # encode src_encoded = to_categorical([source], num_classes=cardinality) tar_encoded = to_categorical([target], num_classes=cardinality) tar2_encoded = to_categorical([target_in], num_classes=cardinality) # store X1.append(src_encoded) X2.append(tar2_encoded) y.append(tar_encoded) return array(X1), array(X2), array(y)

Finally, we need to be able to decode a one-hot encoded sequence to make it readable again.

This is needed for both printing the generated target sequences but also for easily comparing whether the full predicted target sequence matches the expected target sequence. The *one_hot_decode()* function will decode an encoded sequence.

# decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq]

We can tie all of this together and test these functions.

A complete worked example is listed below.

from random import randint from numpy import array from numpy import argmax from keras.utils import to_categorical # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(1, n_unique-1) for _ in range(length)] # prepare data for the LSTM def get_dataset(n_in, n_out, cardinality, n_samples): X1, X2, y = list(), list(), list() for _ in range(n_samples): # generate source sequence source = generate_sequence(n_in, cardinality) # define target sequence target = source[:n_out] target.reverse() # create padded input target sequence target_in = [0] + target[:-1] # encode src_encoded = to_categorical([source], num_classes=cardinality) tar_encoded = to_categorical([target], num_classes=cardinality) tar2_encoded = to_categorical([target_in], num_classes=cardinality) # store X1.append(src_encoded) X2.append(tar2_encoded) y.append(tar_encoded) return array(X1), array(X2), array(y) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # configure problem n_features = 50 + 1 n_steps_in = 6 n_steps_out = 3 # generate a single source and target sequence X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1) print(X1.shape, X2.shape, y.shape) print('X1=%s, X2=%s, y=%s' % (one_hot_decode(X1[0]), one_hot_decode(X2[0]), one_hot_decode(y[0])))

Running the example first prints the shape of the generated dataset, ensuring the 3D shape required to train the model matches our expectations.

The generated sequence is then decoded and printed to screen demonstrating both that the preparation of source and target sequences matches our intention and that the decode operation is working.

(1, 6, 51) (1, 3, 51) (1, 3, 51) X1=[32, 16, 12, 34, 25, 24], X2=[0, 12, 16], y=[12, 16, 32]

We are now ready to develop a model for this sequence-to-sequence prediction problem.

In this section, we will apply the encoder-decoder LSTM model developed in the first section to the sequence-to-sequence prediction problem developed in the second section.

The first step is to configure the problem.

# configure problem n_features = 50 + 1 n_steps_in = 6 n_steps_out = 3

Next, we must define the models and compile the training model.

# define model train, infenc, infdec = define_models(n_features, n_features, 128) train.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Next, we can generate a training dataset of 100,000 examples and train the model.

# generate training dataset X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 100000) print(X1.shape,X2.shape,y.shape) # train model train.fit([X1, X2], y, epochs=1)

Once the model is trained, we can evaluate it. We will do this by making predictions for 100 source sequences and counting the number of target sequences that were predicted correctly. We will use the numpy *array_equal()* function on the decoded sequences to check for equality.

# evaluate LSTM total, correct = 100, 0 for _ in range(total): X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1) target = predict_sequence(infenc, infdec, X1, n_steps_out, n_features) if array_equal(one_hot_decode(y[0]), one_hot_decode(target)): correct += 1 print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

Finally, we will generate some predictions and print the decoded source, target, and predicted target sequences to get an idea of whether the model is working as expected.

Putting all of these elements together, the complete code example is listed below.

from random import randint from numpy import array from numpy import argmax from numpy import array_equal from keras.utils import to_categorical from keras.models import Model from keras.layers import Input from keras.layers import LSTM from keras.layers import Dense # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(1, n_unique-1) for _ in range(length)] # prepare data for the LSTM def get_dataset(n_in, n_out, cardinality, n_samples): X1, X2, y = list(), list(), list() for _ in range(n_samples): # generate source sequence source = generate_sequence(n_in, cardinality) # define padded target sequence target = source[:n_out] target.reverse() # create padded input target sequence target_in = [0] + target[:-1] # encode src_encoded = to_categorical([source], num_classes=cardinality) tar_encoded = to_categorical([target], num_classes=cardinality) tar2_encoded = to_categorical([target_in], num_classes=cardinality) # store X1.append(src_encoded) X2.append(tar2_encoded) y.append(tar_encoded) return array(X1), array(X2), array(y) # returns train, inference_encoder and inference_decoder models def define_models(n_input, n_output, n_units): # define training encoder encoder_inputs = Input(shape=(None, n_input)) encoder = LSTM(n_units, return_state=True) encoder_outputs, state_h, state_c = encoder(encoder_inputs) encoder_states = [state_h, state_c] # define training decoder decoder_inputs = Input(shape=(None, n_output)) decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True) decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states) decoder_dense = Dense(n_output, activation='softmax') decoder_outputs = decoder_dense(decoder_outputs) model = Model([encoder_inputs, decoder_inputs], decoder_outputs) # define inference encoder encoder_model = Model(encoder_inputs, encoder_states) # define inference decoder decoder_state_input_h = Input(shape=(n_units,)) decoder_state_input_c = Input(shape=(n_units,)) decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c] decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs) decoder_states = [state_h, state_c] decoder_outputs = decoder_dense(decoder_outputs) decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states) # return all models return model, encoder_model, decoder_model # generate target given source sequence def predict_sequence(infenc, infdec, source, n_steps, cardinality): # encode state = infenc.predict(source) # start of sequence input target_seq = array([0.0 for _ in range(cardinality)]).reshape(1, 1, cardinality) # collect predictions output = list() for t in range(n_steps): # predict next char yhat, h, c = infdec.predict([target_seq] + state) # store prediction output.append(yhat[0,0,:]) # update state state = [h, c] # update target sequence target_seq = yhat return array(output) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # configure problem n_features = 50 + 1 n_steps_in = 6 n_steps_out = 3 # define model train, infenc, infdec = define_models(n_features, n_features, 128) train.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) # generate training dataset X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 100000) print(X1.shape,X2.shape,y.shape) # train model train.fit([X1, X2], y, epochs=1) # evaluate LSTM total, correct = 100, 0 for _ in range(total): X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1) target = predict_sequence(infenc, infdec, X1, n_steps_out, n_features) if array_equal(one_hot_decode(y[0]), one_hot_decode(target)): correct += 1 print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0)) # spot check some examples for _ in range(10): X1, X2, y = get_dataset(n_steps_in, n_steps_out, n_features, 1) target = predict_sequence(infenc, infdec, X1, n_steps_out, n_features) print('X=%s y=%s, yhat=%s' % (one_hot_decode(X1[0]), one_hot_decode(y[0]), one_hot_decode(target)))

Running the example first prints the shape of the prepared dataset.

(100000, 6, 51) (100000, 3, 51) (100000, 3, 51)

Next, the model is fit. You should see a progress bar and the run should take less than one minute on a modern multi-core CPU.

100000/100000 [==============================] - 50s - loss: 0.6344 - acc: 0.7968

Next, the model is evaluated and the accuracy printed.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the model achieves 100% accuracy on new randomly generated examples.

Accuracy: 100.00%

Finally, 10 new examples are generated and target sequences are predicted. Again, we can see that the model correctly predicts the output sequence in each case and the expected value matches the reversed first 3 elements of the source sequences.

X=[22, 17, 23, 5, 29, 11] y=[23, 17, 22], yhat=[23, 17, 22] X=[28, 2, 46, 12, 21, 6] y=[46, 2, 28], yhat=[46, 2, 28] X=[12, 20, 45, 28, 18, 42] y=[45, 20, 12], yhat=[45, 20, 12] X=[3, 43, 45, 4, 33, 27] y=[45, 43, 3], yhat=[45, 43, 3] X=[34, 50, 21, 20, 11, 6] y=[21, 50, 34], yhat=[21, 50, 34] X=[47, 42, 14, 2, 31, 6] y=[14, 42, 47], yhat=[14, 42, 47] X=[20, 24, 34, 31, 37, 25] y=[34, 24, 20], yhat=[34, 24, 20] X=[4, 35, 15, 14, 47, 33] y=[15, 35, 4], yhat=[15, 35, 4] X=[20, 28, 21, 39, 5, 25] y=[21, 28, 20], yhat=[21, 28, 20] X=[50, 38, 17, 25, 31, 48] y=[17, 38, 50], yhat=[17, 38, 50]

You now have a template for an encoder-decoder LSTM model that you can apply to your own sequence-to-sequence prediction problems.

This section provides more resources on the topic if you are looking to go deeper.

- How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda
- How to Define an Encoder-Decoder Sequence-to-Sequence Model for Neural Machine Translation in Keras
- Understand the Difference Between Return Sequences and Return States for LSTMs in Keras
- How to Use the Keras Functional API for Deep Learning

- A ten-minute introduction to sequence-to-sequence learning in Keras
- Keras seq2seq Code Example (lstm_seq2seq)
- Keras Functional API
- LSTM API in Keras

In this tutorial, you discovered how to develop an encoder-decoder recurrent neural network for sequence-to-sequence prediction problems with Keras.

Specifically, you learned:

- How to correctly define a sophisticated encoder-decoder model in Keras for sequence-to-sequence prediction.
- How to define a contrived yet scalable sequence-to-sequence prediction problem that you can use to evaluate the encoder-decoder LSTM model.
- How to apply the encoder-decoder LSTM model in Keras to address the scalable integer sequence-to-sequence prediction problem.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras appeared first on MachineLearningMastery.com.

]]>The post Gentle Introduction to Global Attention for Encoder-Decoder Recurrent Neural Networks appeared first on MachineLearningMastery.com.

]]>Attention is an extension to the encoder-decoder model that improves the performance of the approach on longer sequences. Global attention is a simplification of attention that may be easier to implement in declarative deep learning libraries like Keras and may achieve better results than the classic attention mechanism.

In this post, you will discover the global attention mechanism for encoder-decoder recurrent neural network models.

After reading this post, you will know:

- The encoder-decoder model for sequence-to-sequence prediction problems such as machine translation.
- The attention mechanism that improves the performance of encoder-decoder models on long sequences.
- The global attention mechanism that simplifies the attention mechanism and may achieve better results.

**Kick-start your project** with my new book Long Short-Term Memory Networks With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Encoder-Decoder Model
- Attention
- Global Attention
- Global Attention in More Detail

The encoder-decoder model is a way of organizing recurrent neural networks to tackle sequence-to-sequence prediction problems where the number of input and output time steps differ.

The model was developed for the problem of machine translation, such as translating sentences in French to English.

The model involves two sub-models, as follows:

**Encoder**: An RNN model that reads the entire source sequence to a fixed-length encoding.**Decoder**: An RNN model that uses the encoded input sequence and decodes it to output the target sequence.

The image below shows the relationship between the encoder and the decoder models.

The Long Short-Term Memory recurrent neural network is commonly used for the encoder and decoder. The encoder output that describes the source sequence is used to start the decoding process, conditioned on the words already generated as output so far. Specifically, the hidden state of the encoder for the last time step of the input is used to initialize the state of the decoder.

The LSTM computes this conditional probability by first obtaining the fixed-dimensional representation v of the input sequence (x1, …, xT) given by the last hidden state of the LSTM, and then computing the probability of y1, …, yT’ with a standard LSTM-LM formulation whose initial hidden state is set to the representation v of x1, …, xT

— Sequence to Sequence Learning with Neural Networks, 2014.

The image below shows the explicit encoding of the source sequence to a context vector c which is used along with the words generated so far to output the next word in the target sequence.

However, […], both yt and h(t) are also conditioned on yt−1 and on the summary c of the input sequence.

— Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, 2014.

The encoder-decoder model was shown to be an end-to-end model that performed well on challenging sequence-to-sequence prediction problems such as machine translation.

The model appeared to be limited on very long sequences. The reason for this was believed to be the fixed-length encoding of the source sequence.

A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus.

— Neural Machine Translation by Jointly Learning to Align and Translate, 2015.

In their 2015 paper titled “*Neural Machine Translation by Jointly Learning to Align and Translate*,” Bahdanau, et al. describe an attention mechanisms to address this issue.

Attention is a mechanism that provides a richer encoding of the source sequence from which to construct a context vector that can then be used by the decoder.

Attention allows the model to learn what encoded words in the source sequence to pay attention to and to what degree during the prediction of each word in the target sequence.

The hidden state for each input time step is gathered from the encoder, instead of the hidden state for the final time step of the source sequence.

A context vector is constructed specifically for each output word in the target sequence. First, each hidden state from the encoder is scored using a neural network, then normalized to be a probability over the encoders hidden states. Finally, the probabilities are used to calculate a weighted sum of the encoder hidden states to provide a context vector to be used in the decoder.

For a fuller explanation for how Bahdanau attention works with a worked example, see the post:

In their paper “Effective Approaches to Attention-based Neural Machine Translation,” Stanford NLP researchers Minh-Thang Luong, et al. propose an attention mechanism for the encoder-decoder model for machine translation called “global attention.”

It is proposed as a simplification of the attention mechanism proposed by Bahdanau, et al. in their paper “Neural Machine Translation by Jointly Learning to Align and Translate.” In Bahdanau attention, the attention calculation requires the output of the decoder from the prior time step.

Global attention, on the other hand, makes use of the output from the encoder and decoder for the current time step only. This makes it attractive to implement in vectorized libraries such as Keras.

… our computation path is simpler; we go from ht -> at -> ct -> ~ht then make a prediction […] On the other hand, at any time t, Bahdanau et al. (2015) build from the previous hidden state ht−1 -> at -> ct -> ht, which, in turn, goes through a deep-output and a maxout layer before making predictions.

— Effective Approaches to Attention-based Neural Machine Translation, 2015.

The model evaluated in the Luong et al. paper is different from the one presented by Bahdanau, et al. (e.g. reversed input sequence instead of bidirectional inputs, LSTM instead of GRU elements and the use of dropout), nevertheless, the results of the model with global attention achieve better results on a standard machine translation task.

… the global attention approach gives a significant boost of +2.8 BLEU, making our model slightly better than the base attentional system of Bahdanau et al.

— Effective Approaches to Attention-based Neural Machine Translation, 2015.

Next, let’s take a closer look at how global attention is calculated.

Global attention is an extension of the attentional encoder-decoder model for recurrent neural networks.

Although developed for machine translation, it is relevant for other language generation tasks, such as caption generation and text summarization, and even sequence prediction tasks in general.

We can divide the calculation of global attention into the following computation steps for an encoder-decoder network that predicts one time step given an input sequence. See the paper for the relevant equations.

**Problem**. The input sequence is provided as input to the encoder (X).**Encoding**. The encoder RNN encodes the input sequence and outputs a sequence of the same length (hs).**Decoding**. The decoder interprets the encoding and outputs a target decoding (ht).**Alignment**. Each encoded time step is scored using the target decoding, then the scores are normalized using a softmax function. Four different scoring functions are proposed:**dot**: the dot product between target decoding and source encoding.**general**: the dot product between target decoding and the weighted source encoding.**concat**: a neural network processing of the concatenated source encoding and target decoding.**location**: a softmax of the weighted target decoding.

**Context Vector**. The alignment weights are applied to the source encoding by calculating the weighted sum to result in the context vector.**Final Decoding**. The context vector and the target decoding are concatenated, weighed, and transferred using a tanh function.

The final decoding is passed through a softmax to predict the probability of the next word in the sequence over the output vocabulary.

The graphic below provides a high-level idea of the data flow when calculating global attention.

The authors evaluated all of the scoring functions and found generally that the simple dot scoring function appeared to perform well.

It is interesting to observe that dot works well for the global attention…

— Effective Approaches to Attention-based Neural Machine Translation, 2015.

Because of the simpler and more data flow, global attention may be a good candidate for implementing in declarative deep learning libraries such as TensorFlow, Theano, and wrappers like Keras.

This section provides more resources on the topic if you are looking to go deeper.

- Sequence to Sequence Learning with Neural Networks, 2014.
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, 2014.
- Encoder-Decoder Long Short-Term Memory Networks

- Neural Machine Translation by Jointly Learning to Align and Translate, 2014.
- How Does Attention Work in Encoder-Decoder Recurrent Neural Networks

In this post, you discovered the global attention mechanism for encoder-decoder recurrent neural network models.

Specifically, you learned:

- The encoder-decoder model for sequence-to-sequence prediction problems such as machine translation.
- The attention mechanism that improves the performance of encoder-decoder models on long sequences.
- The global attention mechanism that simplifies the attention mechanism and may achieve better results.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Gentle Introduction to Global Attention for Encoder-Decoder Recurrent Neural Networks appeared first on MachineLearningMastery.com.

]]>The post Difference Between Return Sequences and Return States for LSTMs in Keras appeared first on MachineLearningMastery.com.

]]>As part of this implementation, the Keras API provides access to both return sequences and return state. The use and difference between these data can be confusing when designing sophisticated recurrent neural network models, such as the encoder-decoder model.

In this tutorial, you will discover the difference and result of return sequences and return states for LSTM layers in the Keras deep learning library.

After completing this tutorial, you will know:

- That return sequences return the hidden state output for each input time step.
- That return state returns the hidden state output and cell state for the last input time step.
- That return sequences and return state can be used at the same time.

**Kick-start your project** with my new book Long Short-Term Memory Networks With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- Long Short-Term Memory
- Return Sequences
- Return States
- Return States and Sequences

The Long Short-Term Memory, or LSTM, is a recurrent neural network that is comprised of internal gates.

Unlike other recurrent neural networks, the network’s internal gates allow the model to be trained successfully using backpropagation through time, or BPTT, and avoid the vanishing gradients problem.

In the Keras deep learning library, LSTM layers can be created using the LSTM() class.

Creating a layer of LSTM memory units allows you to specify the number of memory units within the layer.

Each unit or cell within the layer has an internal cell state, often abbreviated as “*c*“, and outputs a hidden state, often abbreviated as “*h*“.

The Keras API allows you to access these data, which can be useful or even required when developing sophisticated recurrent neural network architectures, such as the encoder-decoder model.

For the rest of this tutorial, we will look at the API for access these data.

Each LSTM cell will output one hidden state *h* for each input.

h = LSTM(X)

We can demonstrate this in Keras with a very small model with a single LSTM layer that itself contains a single LSTM cell.

In this example, we will have one input sample with 3 time steps and one feature observed at each time step:

t1 = 0.1 t2 = 0.2 t3 = 0.3

The complete example is listed below.

Note: all examples in this post use the Keras functional API.

from keras.models import Model from keras.layers import Input from keras.layers import LSTM from numpy import array # define model inputs1 = Input(shape=(3, 1)) lstm1 = LSTM(1)(inputs1) model = Model(inputs=inputs1, outputs=lstm1) # define input data data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data))

Running the example outputs a single hidden state for the input sequence with 3 time steps.

Your specific output value will differ given the random initialization of the LSTM weights and cell state.

[[-0.0953151]]

It is possible to access the hidden state output for each input time step.

This can be done by setting the *return_sequences* attribute to *True* when defining the LSTM layer, as follows:

LSTM(1, return_sequences=True)

We can update the previous example with this change.

The full code listing is provided below.

from keras.models import Model from keras.layers import Input from keras.layers import LSTM from numpy import array # define model inputs1 = Input(shape=(3, 1)) lstm1 = LSTM(1, return_sequences=True)(inputs1) model = Model(inputs=inputs1, outputs=lstm1) # define input data data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data))

Running the example returns a sequence of 3 values, one hidden state output for each input time step for the single LSTM cell in the layer.

[[[-0.02243521] [-0.06210149] [-0.11457888]]]

You must set *return_sequences=True* when stacking LSTM layers so that the second LSTM layer has a three-dimensional sequence input. For more details, see the post:

You may also need to access the sequence of hidden state outputs when predicting a sequence of outputs with a *Dense* output layer wrapped in a TimeDistributed layer. See this post for more details:

The output of an LSTM cell or layer of cells is called the hidden state.

This is confusing, because each LSTM cell retains an internal state that is not output, called the cell state, or *c*.

Generally, we do not need to access the cell state unless we are developing sophisticated models where subsequent layers may need to have their cell state initialized with the final cell state of another layer, such as in an encoder-decoder model.

Keras provides the return_state argument to the LSTM layer that will provide access to the hidden state output (*state_h*) and the cell state (*state_c*). For example:

lstm1, state_h, state_c = LSTM(1, return_state=True)

This may look confusing because both lstm1 and *state_h* refer to the same hidden state output. The reason for these two tensors being separate will become clear in the next section.

We can demonstrate access to the hidden and cell states of the cells in the LSTM layer with a worked example listed below.

from keras.models import Model from keras.layers import Input from keras.layers import LSTM from numpy import array # define model inputs1 = Input(shape=(3, 1)) lstm1, state_h, state_c = LSTM(1, return_state=True)(inputs1) model = Model(inputs=inputs1, outputs=[lstm1, state_h, state_c]) # define input data data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data))

Running the example returns 3 arrays:

- The LSTM hidden state output for the last time step.
- The LSTM hidden state output for the last time step (again).
- The LSTM cell state for the last time step.

[array([[ 0.10951342]], dtype=float32), array([[ 0.10951342]], dtype=float32), array([[ 0.24143776]], dtype=float32)]

The hidden state and the cell state could in turn be used to initialize the states of another LSTM layer with the same number of cells.

We can access both the sequence of hidden state and the cell states at the same time.

This can be done by configuring the LSTM layer to both return sequences and return states.

lstm1, state_h, state_c = LSTM(1, return_sequences=True, return_state=True)

The complete example is listed below.

from keras.models import Model from keras.layers import Input from keras.layers import LSTM from numpy import array # define model inputs1 = Input(shape=(3, 1)) lstm1, state_h, state_c = LSTM(1, return_sequences=True, return_state=True)(inputs1) model = Model(inputs=inputs1, outputs=[lstm1, state_h, state_c]) # define input data data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data))

Running the example, we can see now why the LSTM output tensor and hidden state output tensor are declared separably.

The layer returns the hidden state for each input time step, then separately, the hidden state output for the last time step and the cell state for the last input time step.

This can be confirmed by seeing that the last value in the returned sequences (first array) matches the value in the hidden state (second array).

[array([[[-0.02145359], [-0.0540871 ], [-0.09228823]]], dtype=float32), array([[-0.09228823]], dtype=float32), array([[-0.19803026]], dtype=float32)]

This section provides more resources on the topic if you are looking to go deeper.

- Keras Functional API
- LSTM API in Keras
- Long Short-Term Memory, 1997.
- Understanding LSTM Networks, 2015.
- A ten-minute introduction to sequence-to-sequence learning in Keras

In this tutorial, you discovered the difference and result of return sequences and return states for LSTM layers in the Keras deep learning library.

Specifically, you learned:

- That return sequences return the hidden state output for each input time step.
- That return state returns the hidden state output and cell state for the last input time step.
- That return sequences and return state can be used at the same time.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Difference Between Return Sequences and Return States for LSTMs in Keras appeared first on MachineLearningMastery.com.

]]>The post Implementation Patterns for the Encoder-Decoder RNN Architecture with Attention appeared first on MachineLearningMastery.com.

]]>Attention is a mechanism that addresses a limitation of the encoder-decoder architecture on long sequences, and that in general speeds up the learning and lifts the skill of the model on sequence-to-sequence prediction problems.

In this post, you will discover patterns for implementing the encoder-decoder model with and without attention.

After reading this post, you will know:

- The direct versus the recursive implementation pattern for the encoder-decoder recurrent neural network.
- How attention fits into the direct implementation pattern for the encoder-decoder model.
- How attention can be implemented with the recursive implementation pattern for the encoder-decoder model.

**Kick-start your project** with my new book Long Short-Term Memory Networks With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

The encoder-decoder model for recurrent neural networks is an architecture for sequence-to-sequence prediction problems where the length of input sequences is different to the length of output sequences.

It is comprised of two sub-models, as its name suggests:

**Encoder**: The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called a context vector.**Decoder**: The decoder is responsible for stepping through the output time steps while reading from the context vector.

A problem with the architecture is that performance is poor on long input or output sequences. The reason is believed to be because of the fixed-sized internal representation used by the encoder.

Attention is an extension to the architecture that addresses this limitation. It works by first providing a richer context from the encoder to the decoder and a learning mechanism where the decoder can learn where to pay attention in the richer encoding when predicting each time step in the output sequence.

For more on the encoder-decoder architecture, see the post:

There are multiple ways to implement the encoder-decoder architecture as a system.

One approach is to have the output generated in entirety from the decoder given the input to the encoder. This is how the model is often described.

… we propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence.

— Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, 2014.

We will call this model the direct encoder-decoder implementation, for lack of a better name.

To make this clear, let’s work through a vignette for French-to-English neural machine translation.

- A sentence of French is provided to the model as input.
- The encoder reads the sentence one word at a time and encodes the sequence as a fixed-length vector.
- The decoder reads the encoded input and outputs each word in English.

Below is a depiction of this implementation.

Another implementation is to frame the model such that it generates only one word and the model is called recursively to generate the entire output sequence.

We will call this the recursive implementation (for lack of a better name) to distinguish it from the above description.

In their paper on caption generation models titled “*Where to put the Image in an Image Caption*

*Generator*,” Marc Tanti, et al. refer to the direct approach as the “*continuous view*“:

Traditionally, neural language models are depicted […] where strings are thought of as being continuously generated. A new word is generated after each time step, with the RNN’s state being combined with the last generated word in order to generate the next word. We refer to this as the ‘continuous view’.

— Where to put the Image in an Image Caption Generator, 2017.

They refer to the recursive implementation as the “*discontinuous view*“:

We propose to think of the RNN in terms of a series of discontinuous snapshots over time, with each word being generated from the entire prefix of previous words and with the RNN’s state being reinitialised each time. We refer to this as the ‘discontinuous view’

— Where to put the Image in an Image Caption Generator, 2017.

We can step through this approach for the same French-to-English neural machine translation example using the recursive implementation.

- A sentence of French is provided to the model as input.
- The encoder reads the sentence one word at a time and encodes the sequence as a fixed-length vector.
- The decoder reads the encoded input and outputs one English word.
- The output is taken as input along with the encoded French sentence, go to Step 3.

Below is a depiction of this implementation.

To start the process, a “*start-of-sequence*” token may need to be provided to the model as input for the output sequence generated so far.

The entire output sequence generated so far may be replayed as input to the decoder with or without the encoded input sequence to allow the decoder to arrive at the same internal state prior to predicting the next word as would have been achieved if the model generated the entire output sequence at once, as in the previous section.

The recursive implementation can imitate outputting the entire sequence at once as in the first model.

The recursive implementation also allows you to vary the model and seek perhaps a simpler or more skillful model.

One example is to also encode the input sequence and use the decoder model to learn how to best combine the encoded input sequence and output sequence generated so far. Marc Tanti, et al. in their paper “*What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?*” call this the “merge model.”

… at a given time step, the merge architecture predicts what to generate next by combining the RNNencoded prefix of the string generated so far (the ‘past’ of the generation process) with non-linguistic information (the guide of the generation process).

— What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?, 2017

The model is still called recursively, only the internal structure of the model is varied. We can make this clear with a depiction.

We can now consider the attention mechanism in the context of these different implementations for the Encoder-Decoder recurrent neural network architecture.

Canonical attention, as described by Bahdanau et al. in their paper “Neural Machine Translation by Jointly Learning to Align and Translate,” involves a few elements as follows:

**Richer encoding**. The output from the encoder is expanded to provide information across all words in the input sequence, not just the final output from the last word in the sequence.**Alignment model**. A new small neural network model is used to align or relate the expanded encoding using the attended output from the decoder from the previous time step.**Weighted encoding**. A weighting for the alignment that can be used as a probability distribution over the encoded input sequence.**Weighted context vector**. The weighting applied to the encoded input sequence that can then be used to decode the next word.

Note, in all of these encoder-decoder models there is a difference between the output of the model (next predicted word) and the output of the decoder (internal representation). The decoder does not output a word directly; often a fully connected layer is connected to decoder that outputs a probability distribution over the vocabulary of words, which is then further searched using a heuristic like a beam search.

For more detail on how to calculate attention in the encoder-decoder model, see the post:

We can make a cartoon of the direct encoder-decoder model with attention, as below.

Attention can be challenging to implement in a direct encoder-decoder model. This is because efficient neural network libraries with vectorized equations that require all information to be available prior to the computation.

This need is disrupted by the need for the model to access the attended output from the decoder for each prediction made.

Attention lends itself to a recursive description and implementation.

A recursive implementation of attention requires that in addition to making the output sequence generated so far available to the decoder, that the outputs of the decoder generated from the previous time step could be provided to the attention mechanism for predicting the next word.

We can make this clearer with a cartoon.

The recursive approach also introduces additional flexibility to try out new designs.

For example, Luong, et al. in their paper “*Effective Approaches to Attention-based Neural Machine Translation*” take this a step further and propose that the output of the decoder from the previous time step (h(t-1)) can also be fed as inputs to the decoder, instead of being used in the attention calculation. They call this an “input-feeding” model.

The effects of having such connections are twofold: (a) we hope to make the model fully aware of previous alignment choices, and (b) we create a very deep network spanning both horizontally and vertically

— Effective Approaches to Attention-based Neural Machine Translation, 2015.

Interestingly, this input feeding coupled with their local attention resulted in state-of-the-art performance (at their time of writing) on a standard machine translation task.

The input-feeding approach is related to the merge model. Instead of providing the decoded output from the last time step alone, the merge model provides an encoding of all previously generated time steps.

One could imagine attention in the decoder harnessing this encoding to help decode the encoded input sequence, or perhaps employing attention on both encodings.

This section provides more resources on the topic if you are looking to go deeper.

- Encoder-Decoder Long Short-Term Memory Networks
- Attention in Long Short-Term Memory Recurrent Neural Networks
- How Does Attention Work in Encoder-Decoder Recurrent Neural Networks

- Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, 2014.
- Neural Machine Translation by Jointly Learning to Align and Translate, 2015.
- Where to put the Image in an Image Caption Generator, 2017.
- What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?, 2017
- Effective Approaches to Attention-based Neural Machine Translation, 2015.

In this post, you discovered patterns for implementing the encoder-decoder model with and without attention.

Specifically, you learned:

- The direct versus the recursive implementation pattern for the encoder-decoder recurrent neural network.
- How attention fits into the direct implementation pattern for the encoder-decoder model.
- How attention can be implemented with the recursive implementation pattern for the encoder-decoder model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Implementation Patterns for the Encoder-Decoder RNN Architecture with Attention appeared first on MachineLearningMastery.com.

]]>The post How to Develop an Encoder-Decoder Model with Attention in Keras appeared first on MachineLearningMastery.com.

]]>Attention is a mechanism that addresses a limitation of the encoder-decoder architecture on long sequences, and that in general speeds up the learning and lifts the skill of the model no sequence to sequence prediction problems.

In this tutorial, you will discover how to develop an encoder-decoder recurrent neural network with attention in Python with Keras.

After completing this tutorial, you will know:

- How to design a small and configurable problem to evaluate encoder-decoder recurrent neural networks with and without attention.
- How to design and evaluate an encoder-decoder network with and without attention for the sequence prediction problem.
- How to robustly compare the performance of encoder-decoder networks with and without attention.

**Kick-start your project** with my new book Long Short-Term Memory Networks With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Note May/2020**: The underlying APIs have changed and this tutorial may no longer be current. You may require older versions of Keras and TensorFlow, e.g. Keras 2 and TF 1.

This tutorial is divided into 6 parts; they are:

- Encoder-Decoder with Attention
- Test Problem for Attention
- Encoder-Decoder without Attention
- Custom Keras Attention Layer
- Encoder-Decoder with Attention
- Comparison of Models

This tutorial assumes you have a Python 3 SciPy environment installed.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this post:

The encoder-decoder model for recurrent neural networks is an architecture for sequence-to-sequence prediction problems.

It is comprised of two sub-models, as its name suggests:

**Encoder**: The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called a context vector.**Decoder**: The decoder is responsible for stepping through the output time steps while reading from the context vector.

A problem with the architecture is that performance is poor on long input or output sequences. The reason is believed to be because of the fixed-sized internal representation used by the encoder.

Attention is an extension to the architecture that addresses this limitation. It works by first providing a richer context from the encoder to the decoder and a learning mechanism where the decoder can learn where to pay attention in the richer encoding when predicting each time step in the output sequence.

For more on attention in the encoder-decoder architecture, see the posts:

- Attention in Long Short-Term Memory Recurrent Neural Networks
- How Does Attention Work in Encoder-Decoder Recurrent Neural Networks

Before we develop models with attention, we will first define a contrived scalable test problem that we can use to determine whether attention is providing any benefit.

In this problem, we will generate sequences of random integers as input and matching output sequences comprised of a subset of the integers in the input sequence.

For example, an input sequence might be [1, 6, 2, 7, 3] and the expected output sequence might be the first two random integers in the sequence [1, 6].

We will define the problem such that the input and output sequences are the same length and pad the output sequences with “0” values as needed.

First, we need a function to generate sequences of random integers. We will use the Python randint() function to generate random integers between 0 and a maximum value and use this range as the cardinality for the problem (e.g. the number of features or an axis of difficulty).

The function *generate_sequence()* below will generate a random sequence of integers to a fixed length and with the specified cardinality.

from random import randint # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(0, n_unique-1) for _ in range(length)] # generate random sequence sequence = generate_sequence(5, 50) print(sequence)

Running this example generates a sequence of 5 time steps where each value in the sequence is a random integer between 0 and 49.

[43, 3, 28, 34, 33]

Next, we need a function to one hot encode the discrete integer values into binary vectors.

If a cardinality of 50 is used, then each integer will be represented by a 50-element vector of 0 values and 1 in the index of the specified integer value.

The *one_hot_encode()* function below will one hot encode a given sequence of integers.

# one hot encode sequence def one_hot_encode(sequence, n_unique): encoding = list() for value in sequence: vector = [0 for _ in range(n_unique)] vector[value] = 1 encoding.append(vector) return array(encoding)

We also need to be able to decode an encoded sequence. This will be needed to turn a prediction from the model or an encoded expected sequence back into a sequence of integers we can read and evaluate.

The *one_hot_decode()* function below will decode a one hot encoded sequence back into a sequence of integers.

# decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq]

We can test out these operations in the example below.

from random import randint from numpy import array from numpy import argmax # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(0, n_unique-1) for _ in range(length)] # one hot encode sequence def one_hot_encode(sequence, n_unique): encoding = list() for value in sequence: vector = [0 for _ in range(n_unique)] vector[value] = 1 encoding.append(vector) return array(encoding) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # generate random sequence sequence = generate_sequence(5, 50) print(sequence) # one hot encode encoded = one_hot_encode(sequence, 50) print(encoded) # decode decoded = one_hot_decode(encoded) print(decoded)

Running the example first prints a randomly generated sequence, then the one hot encoded version, then finally the decoded sequence again.

[3, 18, 32, 11, 36] [[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]] [3, 18, 32, 11, 36]

Finally, we need a function that can create input and output pairs of sequences to train and evaluate a model.

The function below named *get_pair()* will return one input and output sequence pair given a specified input length, output length, and cardinality. Both input and output sequences are the same length, the length of the input sequence, but the output sequence will be taken as the first *n* characters of the input sequence and padded with zero values to the required length.

The sequences of integers are then encoded then reshaped into a 3D format required for the recurrent neural network, with the dimensions: *samples*, *time steps*, and *features*. In this case, samples is always 1 as we are only generating one input-output pair, the time steps is the input sequence length and features is the cardinality of each time step.

# prepare data for the LSTM def get_pair(n_in, n_out, n_unique): # generate random sequence sequence_in = generate_sequence(n_in, n_unique) sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)] # one hot encode X = one_hot_encode(sequence_in, n_unique) y = one_hot_encode(sequence_out, n_unique) # reshape as 3D X = X.reshape((1, X.shape[0], X.shape[1])) y = y.reshape((1, y.shape[0], y.shape[1])) return X,y

We can put this all together and demonstrate the data preparation code.

from random import randint from numpy import array from numpy import argmax # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(0, n_unique-1) for _ in range(length)] # one hot encode sequence def one_hot_encode(sequence, n_unique): encoding = list() for value in sequence: vector = [0 for _ in range(n_unique)] vector[value] = 1 encoding.append(vector) return array(encoding) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # prepare data for the LSTM def get_pair(n_in, n_out, n_unique): # generate random sequence sequence_in = generate_sequence(n_in, n_unique) sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)] # one hot encode X = one_hot_encode(sequence_in, n_unique) y = one_hot_encode(sequence_out, n_unique) # reshape as 3D X = X.reshape((1, X.shape[0], X.shape[1])) y = y.reshape((1, y.shape[0], y.shape[1])) return X,y # generate random sequence X, y = get_pair(5, 2, 50) print(X.shape, y.shape) print('X=%s, y=%s' % (one_hot_decode(X[0]), one_hot_decode(y[0])))

Running the example generates a single input-output pair and prints the shape of both arrays.

The generated pair is then printed in a decoded form where we can see that the first two integers of the sequence are reproduced in the output sequence followed by a padding of zero values.

(1, 5, 50) (1, 5, 50) X=[12, 20, 36, 40, 12], y=[12, 20, 0, 0, 0]

In this section, we will develop a baseline in performance on the problem with an encoder-decoder model without attention.

We will fix the problem definition at input and output sequences of 5 time steps, the first 2 elements of the input sequence in the output sequence and a cardinality of 50.

# configure problem n_features = 50 n_timesteps_in = 5 n_timesteps_out = 2

We can develop a simple encoder-decoder model in Keras by taking the output from an encoder LSTM model, repeating it n times for the number of timesteps in the output sequence, then using a decoder to predict the output sequence.

For more detail on how to define an encoder-decoder architecture in Keras, see the post:

We will configure the encoder and decoder with the same number of units, in this case 150. We will use the efficient Adam implementation of gradient descent and optimize the categorical cross entropy loss function, given that the problem is technically a multi-class classification problem.

The configuration for the model was found after a little trial and error and is by no means optimized.

The code for an encoder-decoder architecture in Keras is listed below.

# define model model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features))) model.add(RepeatVector(n_timesteps_in)) model.add(LSTM(150, return_sequences=True)) model.add(TimeDistributed(Dense(n_features, activation='softmax'))) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

We will train the model on 5,000 random input-output pairs of integer sequences.

# train LSTM for epoch in range(5000): # generate new random sequence X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) # fit model for one epoch on this sequence model.fit(X, y, epochs=1, verbose=2)

Once trained, we will evaluate the model on 100 new randomly generated integer sequences and only mark a prediction correct when the entire output sequence matches the expected value.

# evaluate LSTM total, correct = 100, 0 for _ in range(total): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])): correct += 1 print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

Finally, we will print 10 examples of expected output sequences and sequences predicted by the model.

Putting all of this together, the complete example is listed below.

from random import randint from numpy import array from numpy import argmax from numpy import array_equal from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import TimeDistributed from keras.layers import RepeatVector # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(0, n_unique-1) for _ in range(length)] # one hot encode sequence def one_hot_encode(sequence, n_unique): encoding = list() for value in sequence: vector = [0 for _ in range(n_unique)] vector[value] = 1 encoding.append(vector) return array(encoding) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # prepare data for the LSTM def get_pair(n_in, n_out, cardinality): # generate random sequence sequence_in = generate_sequence(n_in, cardinality) sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)] # one hot encode X = one_hot_encode(sequence_in, cardinality) y = one_hot_encode(sequence_out, cardinality) # reshape as 3D X = X.reshape((1, X.shape[0], X.shape[1])) y = y.reshape((1, y.shape[0], y.shape[1])) return X,y # configure problem n_features = 50 n_timesteps_in = 5 n_timesteps_out = 2 # define model model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features))) model.add(RepeatVector(n_timesteps_in)) model.add(LSTM(150, return_sequences=True)) model.add(TimeDistributed(Dense(n_features, activation='softmax'))) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # train LSTM for epoch in range(5000): # generate new random sequence X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) # fit model for one epoch on this sequence model.fit(X, y, epochs=1, verbose=2) # evaluate LSTM total, correct = 100, 0 for _ in range(total): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])): correct += 1 print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0)) # spot check some examples for _ in range(10): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

Running this example will not take long, perhaps a few minutes on the CPU, no GPU is required.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The accuracy of the model was reported at just under 20%.

Accuracy: 19.00%

We can see from the sample outputs that the model does get one number in the output sequence correct for most or all cases, and only struggles with the second number. All zero padding values are predicted correctly.

Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0] Expected: [43, 31, 0, 0, 0] Predicted [43, 31, 0, 0, 0] Expected: [14, 22, 0, 0, 0] Predicted [14, 14, 0, 0, 0] Expected: [39, 31, 0, 0, 0] Predicted [39, 39, 0, 0, 0] Expected: [6, 4, 0, 0, 0] Predicted [6, 4, 0, 0, 0] Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0] Expected: [39, 33, 0, 0, 0] Predicted [39, 39, 0, 0, 0] Expected: [23, 2, 0, 0, 0] Predicted [23, 23, 0, 0, 0] Expected: [19, 28, 0, 0, 0] Predicted [19, 3, 0, 0, 0] Expected: [32, 33, 0, 0, 0] Predicted [32, 32, 0, 0, 0]

Now we need to add attention to the encoder-decoder model.

At the time of writing, Keras does not have the capability of attention built into the library, but it is coming soon.

Until attention is officially available in Keras, we can either develop our own implementation or use an existing third-party implementation.

To speed things up, let’s use an existing third-party implementation.

Zafarali Ahmed an intern at Datalogue developed a custom layer for Keras that provides support for attention, presented in a post titled “How to Visualize Your Recurrent Neural Network with Attention in Keras” in 2017 and GitHub project called “keras-attention“.

The custom attention layer is called *AttentionDecoder* and is available in the custom_recurrents.py file in the GitHub project. We can reuse this code under the GNU Affero General Public License v3.0 license of the project.

A copy of the custom layer is listed below for completeness. Copy it and paste it into a new and separate file in your current working directory called ‘*attention_decoder.py*‘.

import tensorflow as tf from keras import backend as K from keras import regularizers, constraints, initializers, activations from keras.layers.recurrent import Recurrent, _time_distributed_dense from keras.engine import InputSpec tfPrint = lambda d, T: tf.Print(input_=T, data=[T, tf.shape(T)], message=d) class AttentionDecoder(Recurrent): def __init__(self, units, output_dim, activation='tanh', return_probabilities=False, name='AttentionDecoder', kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None, **kwargs): """ Implements an AttentionDecoder that takes in a sequence encoded by an encoder and outputs the decoded states :param units: dimension of the hidden state and the attention matrices :param output_dim: the number of labels in the output space references: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). """ self.units = units self.output_dim = output_dim self.return_probabilities = return_probabilities self.activation = activations.get(activation) self.kernel_initializer = initializers.get(kernel_initializer) self.recurrent_initializer = initializers.get(recurrent_initializer) self.bias_initializer = initializers.get(bias_initializer) self.kernel_regularizer = regularizers.get(kernel_regularizer) self.recurrent_regularizer = regularizers.get(kernel_regularizer) self.bias_regularizer = regularizers.get(bias_regularizer) self.activity_regularizer = regularizers.get(activity_regularizer) self.kernel_constraint = constraints.get(kernel_constraint) self.recurrent_constraint = constraints.get(kernel_constraint) self.bias_constraint = constraints.get(bias_constraint) super(AttentionDecoder, self).__init__(**kwargs) self.name = name self.return_sequences = True # must return sequences def build(self, input_shape): """ See Appendix 2 of Bahdanau 2014, arXiv:1409.0473 for model details that correspond to the matrices here. """ self.batch_size, self.timesteps, self.input_dim = input_shape if self.stateful: super(AttentionDecoder, self).reset_states() self.states = [None, None] # y, s """ Matrices for creating the context vector """ self.V_a = self.add_weight(shape=(self.units,), name='V_a', initializer=self.kernel_initializer, regularizer=self.kernel_regularizer, constraint=self.kernel_constraint) self.W_a = self.add_weight(shape=(self.units, self.units), name='W_a', initializer=self.kernel_initializer, regularizer=self.kernel_regularizer, constraint=self.kernel_constraint) self.U_a = self.add_weight(shape=(self.input_dim, self.units), name='U_a', initializer=self.kernel_initializer, regularizer=self.kernel_regularizer, constraint=self.kernel_constraint) self.b_a = self.add_weight(shape=(self.units,), name='b_a', initializer=self.bias_initializer, regularizer=self.bias_regularizer, constraint=self.bias_constraint) """ Matrices for the r (reset) gate """ self.C_r = self.add_weight(shape=(self.input_dim, self.units), name='C_r', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.U_r = self.add_weight(shape=(self.units, self.units), name='U_r', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.W_r = self.add_weight(shape=(self.output_dim, self.units), name='W_r', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.b_r = self.add_weight(shape=(self.units, ), name='b_r', initializer=self.bias_initializer, regularizer=self.bias_regularizer, constraint=self.bias_constraint) """ Matrices for the z (update) gate """ self.C_z = self.add_weight(shape=(self.input_dim, self.units), name='C_z', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.U_z = self.add_weight(shape=(self.units, self.units), name='U_z', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.W_z = self.add_weight(shape=(self.output_dim, self.units), name='W_z', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.b_z = self.add_weight(shape=(self.units, ), name='b_z', initializer=self.bias_initializer, regularizer=self.bias_regularizer, constraint=self.bias_constraint) """ Matrices for the proposal """ self.C_p = self.add_weight(shape=(self.input_dim, self.units), name='C_p', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.U_p = self.add_weight(shape=(self.units, self.units), name='U_p', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.W_p = self.add_weight(shape=(self.output_dim, self.units), name='W_p', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.b_p = self.add_weight(shape=(self.units, ), name='b_p', initializer=self.bias_initializer, regularizer=self.bias_regularizer, constraint=self.bias_constraint) """ Matrices for making the final prediction vector """ self.C_o = self.add_weight(shape=(self.input_dim, self.output_dim), name='C_o', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.U_o = self.add_weight(shape=(self.units, self.output_dim), name='U_o', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.W_o = self.add_weight(shape=(self.output_dim, self.output_dim), name='W_o', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.b_o = self.add_weight(shape=(self.output_dim, ), name='b_o', initializer=self.bias_initializer, regularizer=self.bias_regularizer, constraint=self.bias_constraint) # For creating the initial state: self.W_s = self.add_weight(shape=(self.input_dim, self.units), name='W_s', initializer=self.recurrent_initializer, regularizer=self.recurrent_regularizer, constraint=self.recurrent_constraint) self.input_spec = [ InputSpec(shape=(self.batch_size, self.timesteps, self.input_dim))] self.built = True def call(self, x): # store the whole sequence so we can "attend" to it at each timestep self.x_seq = x # apply the a dense layer over the time dimension of the sequence # do it here because it doesn't depend on any previous steps # thefore we can save computation time: self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a, input_dim=self.input_dim, timesteps=self.timesteps, output_dim=self.units) return super(AttentionDecoder, self).call(x) def get_initial_state(self, inputs): # apply the matrix on the first time step to get the initial s0. s0 = activations.tanh(K.dot(inputs[:, 0], self.W_s)) # from keras.layers.recurrent to initialize a vector of (batchsize, # output_dim) y0 = K.zeros_like(inputs) # (samples, timesteps, input_dims) y0 = K.sum(y0, axis=(1, 2)) # (samples, ) y0 = K.expand_dims(y0) # (samples, 1) y0 = K.tile(y0, [1, self.output_dim]) return [y0, s0] def step(self, x, states): ytm, stm = states # repeat the hidden state to the length of the sequence _stm = K.repeat(stm, self.timesteps) # now multiplty the weight matrix with the repeated hidden state _Wxstm = K.dot(_stm, self.W_a) # calculate the attention probabilities # this relates how much other timesteps contributed to this one. et = K.dot(activations.tanh(_Wxstm + self._uxpb), K.expand_dims(self.V_a)) at = K.exp(et) at_sum = K.sum(at, axis=1) at_sum_repeated = K.repeat(at_sum, self.timesteps) at /= at_sum_repeated # vector of size (batchsize, timesteps, 1) # calculate the context vector context = K.squeeze(K.batch_dot(at, self.x_seq, axes=1), axis=1) # ~~~> calculate new hidden state # first calculate the "r" gate: rt = activations.sigmoid( K.dot(ytm, self.W_r) + K.dot(stm, self.U_r) + K.dot(context, self.C_r) + self.b_r) # now calculate the "z" gate zt = activations.sigmoid( K.dot(ytm, self.W_z) + K.dot(stm, self.U_z) + K.dot(context, self.C_z) + self.b_z) # calculate the proposal hidden state: s_tp = activations.tanh( K.dot(ytm, self.W_p) + K.dot((rt * stm), self.U_p) + K.dot(context, self.C_p) + self.b_p) # new hidden state: st = (1-zt)*stm + zt * s_tp yt = activations.softmax( K.dot(ytm, self.W_o) + K.dot(stm, self.U_o) + K.dot(context, self.C_o) + self.b_o) if self.return_probabilities: return at, [yt, st] else: return yt, [yt, st] def compute_output_shape(self, input_shape): """ For Keras internal compatability checking """ if self.return_probabilities: return (None, self.timesteps, self.timesteps) else: return (None, self.timesteps, self.output_dim) def get_config(self): """ For rebuilding models on load time. """ config = { 'output_dim': self.output_dim, 'units': self.units, 'return_probabilities': self.return_probabilities } base_config = super(AttentionDecoder, self).get_config() return dict(list(base_config.items()) + list(config.items()))

We can make use of this custom layer in our projects by importing it as follows:

from attention_decoder import AttentionDecoder

The layer implements attention as described by Bahdanau, et al. in their paper “Neural Machine Translation by Jointly Learning to Align and Translate.”

The code is explained well in the original post and linked to both the LSTM and attention equations.

A limitation of this implementation is that it must output sequences that are the same length as the input sequences, the specific limitation that the encoder-decoder architecture was designed to overcome.

Importantly, the new layer manages both the repeating of the decoding as performed by the second LSTM, as well as the softmax output for the model as was performed by the Dense output layer in the encoder-decoder model without attention. This greatly simplifies the code for the model.

It is important to note that the custom layer is built upon the Recurrent layer in Keras, which, at the time of writing, is marked as legacy code, and presumably will be removed from the project at some point.

Now that we have an implementation of attention that we can use, we can develop an encoder-decoder model with attention for our contrived sequence prediction problem.

The model with the attention layer is defined below. We can see that the layer handles some of the machinery of the encoder-decoder model itself, making defining the model simpler.

# define model model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True)) model.add(AttentionDecoder(150, n_features)) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

That’s it. The rest of the example is the same.

The complete example is listed below.

from random import randint from numpy import array from numpy import argmax from numpy import array_equal from keras.models import Sequential from keras.layers import LSTM from attention_decoder import AttentionDecoder # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(0, n_unique-1) for _ in range(length)] # one hot encode sequence def one_hot_encode(sequence, n_unique): encoding = list() for value in sequence: vector = [0 for _ in range(n_unique)] vector[value] = 1 encoding.append(vector) return array(encoding) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # prepare data for the LSTM def get_pair(n_in, n_out, cardinality): # generate random sequence sequence_in = generate_sequence(n_in, cardinality) sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)] # one hot encode X = one_hot_encode(sequence_in, cardinality) y = one_hot_encode(sequence_out, cardinality) # reshape as 3D X = X.reshape((1, X.shape[0], X.shape[1])) y = y.reshape((1, y.shape[0], y.shape[1])) return X,y # configure problem n_features = 50 n_timesteps_in = 5 n_timesteps_out = 2 # define model model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True)) model.add(AttentionDecoder(150, n_features)) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # train LSTM for epoch in range(5000): # generate new random sequence X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) # fit model for one epoch on this sequence model.fit(X, y, epochs=1, verbose=2) # evaluate LSTM total, correct = 100, 0 for _ in range(total): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])): correct += 1 print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0)) # spot check some examples for _ in range(10): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

Running the example prints the skill of the model on 100 randomly generated input-output pairs.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

With the same resources and same amount of training, the model with attention performs much better.

Accuracy: 95.00%

Spot-checking some sample outputs and predicted sequences, we can see very few errors, even in cases when there is a zero value in the first two elements.

Expected: [48, 47, 0, 0, 0] Predicted [48, 47, 0, 0, 0] Expected: [7, 46, 0, 0, 0] Predicted [7, 46, 0, 0, 0] Expected: [32, 30, 0, 0, 0] Predicted [32, 2, 0, 0, 0] Expected: [3, 25, 0, 0, 0] Predicted [3, 25, 0, 0, 0] Expected: [45, 4, 0, 0, 0] Predicted [45, 4, 0, 0, 0] Expected: [49, 9, 0, 0, 0] Predicted [49, 9, 0, 0, 0] Expected: [22, 23, 0, 0, 0] Predicted [22, 23, 0, 0, 0] Expected: [29, 36, 0, 0, 0] Predicted [29, 36, 0, 0, 0] Expected: [0, 29, 0, 0, 0] Predicted [0, 29, 0, 0, 0] Expected: [11, 26, 0, 0, 0] Predicted [11, 26, 0, 0, 0]

Although we are getting better results from the model with attention, the results were reported from a single run of each model.

In this case, we seek a more robust finding by repeating the evaluation of each model multiple times and reporting the average performance over those runs. For more information on this robust approach to evaluating neural network models, see the post:

We can define a function to create each type of model, as follows.

# define the encoder-decoder model def baseline_model(n_timesteps_in, n_features): model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features))) model.add(RepeatVector(n_timesteps_in)) model.add(LSTM(150, return_sequences=True)) model.add(TimeDistributed(Dense(n_features, activation='softmax'))) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model # define the encoder-decoder with attention model def attention_model(n_timesteps_in, n_features): model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True)) model.add(AttentionDecoder(150, n_features)) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model

We can then define a function to fit and evaluate the accuracy of a fit model and return the accuracy score.

# train and evaluate a model, return accuracy def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features): # train LSTM for epoch in range(5000): # generate new random sequence X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) # fit model for one epoch on this sequence model.fit(X, y, epochs=1, verbose=0) # evaluate LSTM total, correct = 100, 0 for _ in range(total): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])): correct += 1 return float(correct)/float(total)*100.0

Putting this together, we can repeat the process of creating, training, and evaluating each type of model multiple times and reporting the mean accuracy over the repeats. To keep running times down, we will repeat each model evaluation 10 times, although if you have the resources, you could increase this to 30 or 100 times.

The complete example is listed below.

from random import randint from numpy import array from numpy import argmax from numpy import array_equal from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import TimeDistributed from keras.layers import RepeatVector from attention_decoder import AttentionDecoder # generate a sequence of random integers def generate_sequence(length, n_unique): return [randint(0, n_unique-1) for _ in range(length)] # one hot encode sequence def one_hot_encode(sequence, n_unique): encoding = list() for value in sequence: vector = [0 for _ in range(n_unique)] vector[value] = 1 encoding.append(vector) return array(encoding) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # prepare data for the LSTM def get_pair(n_in, n_out, cardinality): # generate random sequence sequence_in = generate_sequence(n_in, cardinality) sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)] # one hot encode X = one_hot_encode(sequence_in, cardinality) y = one_hot_encode(sequence_out, cardinality) # reshape as 3D X = X.reshape((1, X.shape[0], X.shape[1])) y = y.reshape((1, y.shape[0], y.shape[1])) return X,y # define the encoder-decoder model def baseline_model(n_timesteps_in, n_features): model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features))) model.add(RepeatVector(n_timesteps_in)) model.add(LSTM(150, return_sequences=True)) model.add(TimeDistributed(Dense(n_features, activation='softmax'))) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model # define the encoder-decoder with attention model def attention_model(n_timesteps_in, n_features): model = Sequential() model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True)) model.add(AttentionDecoder(150, n_features)) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model # train and evaluate a model, return accuracy def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features): # train LSTM for epoch in range(5000): # generate new random sequence X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) # fit model for one epoch on this sequence model.fit(X, y, epochs=1, verbose=0) # evaluate LSTM total, correct = 100, 0 for _ in range(total): X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features) yhat = model.predict(X, verbose=0) if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])): correct += 1 return float(correct)/float(total)*100.0 # configure problem n_features = 50 n_timesteps_in = 5 n_timesteps_out = 2 n_repeats = 10 # evaluate encoder-decoder model print('Encoder-Decoder Model') results = list() for _ in range(n_repeats): model = baseline_model(n_timesteps_in, n_features) accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features) results.append(accuracy) print(accuracy) print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats))) # evaluate encoder-decoder with attention model print('Encoder-Decoder With Attention Model') results = list() for _ in range(n_repeats): model = attention_model(n_timesteps_in, n_features) accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features) results.append(accuracy) print(accuracy) print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example prints the accuracy for each model repeat to give you an idea of the progress of the run.

Encoder-Decoder Model 20.0 23.0 23.0 18.0 28.000000000000004 28.999999999999996 23.0 26.0 21.0 20.0 Mean Accuracy: 23.10% Encoder-Decoder With Attention Model 98.0 91.0 94.0 93.0 96.0 99.0 97.0 94.0 99.0 96.0 Mean Accuracy: 95.70%

We can see that even averaged over 10 runs, the attention model still shows better performance than the encoder-decoder model without attention, 23.10% vs 95.70%.

A good extension to this evaluation would be to capture the model loss each epoch for each model, take the average, and compare how the loss changes over time for the architecture with and without attention.

I expect that this trace would show attention achieving better skill much faster and sooner than the non-attentional model, further highlighting the benefit of the approach.

This section provides more resources on the topic if you are looking to go deeper.

- Attention in Long Short-Term Memory Recurrent Neural Networks
- How Does Attention Work in Encoder-Decoder Recurrent Neural Networks
- Encoder-Decoder Long Short-Term Memory Networks
- How to Evaluate the Skill of Deep Learning Models
- How to Visualize Your Recurrent Neural Network with Attention in Keras, 2017.
- keras-attention GitHub Project
- Neural Machine Translation by Jointly Learning to Align and Translate, 2015.

In this tutorial, you discovered how to develop an encoder-decoder recurrent neural network with attention in Python with Keras.

Specifically, you learned:

- How to design a small and configurable problem to evaluate encoder-decoder recurrent neural networks with and without attention.
- How to design and evaluate an encoder-decoder network with and without attention for the sequence prediction problem.
- How to robustly compare the performance of encoder-decoder networks with and without attention.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop an Encoder-Decoder Model with Attention in Keras appeared first on MachineLearningMastery.com.

]]>The post A Gentle Introduction to RNN Unrolling appeared first on MachineLearningMastery.com.

]]>This creates a network graph or circuit diagram with cycles, which can make it difficult to understand how information moves through the network.

In this post, you will discover the concept of unrolling or unfolding recurrent neural networks.

After reading this post, you will know:

- The standard conception of recurrent neural networks with cyclic connections.
- The concept of unrolling of the forward pass when the network is copied for each input time step.
- The concept of unrolling of the backward pass for updating network weights during training.

**Kick-start your project** with my new book Long Short-Term Memory Networks With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

Recurrent neural networks are a type of neural network where outputs from previous time steps are taken as inputs for the current time step.

We can demonstrate this with a picture.

Below we can see that the network takes both the output of the network from the previous time step as input and uses the internal state from the previous time step as a starting point for the current time step.

RNNs are fit and make predictions over many time steps. We can simplify the model by unfolding or unrolling the RNN graph over the input sequence.

A useful way to visualise RNNs is to consider the update graph formed by ‘unfolding’ the network along the input sequence.

— Supervised Sequence Labelling with Recurrent Neural Networks, 2008.

Consider the case where we have multiple time steps of input (X(t), X(t+1), …), multiple time steps of internal state (u(t), u(t+1), …), and multiple time steps of outputs (y(t), y(t+1), …).

We can unfold the above network schematic into a graph without any cycles.

We can see that the cycle is removed and that the output (y(t)) and internal state (u(t)) from the previous time step are passed on to the network as inputs for processing the next time step.

Key in this conceptualization is that the network (RNN) does not change between the unfolded time steps. Specifically, the same weights are used for each time step and it is only the outputs and the internal states that differ.

In this way, it is as though the whole network (topology and weights) are copied for each time step in the input sequence.

Further, each copy of the network may be thought of as an additional layer of the same feed forward neural network.

RNNs, once unfolded in time, can be seen as very deep feedforward networks in which all the layers share the same weights.

— Deep learning, Nature, 2015

This is a useful conceptual tool and visualization to help in understanding what is going on in the network during the forward pass. It may or may not also be the way that the network is implemented by the deep learning library.

The idea of network unfolding plays a bigger part in the way recurrent neural networks are implemented for the backward pass.

As is standard with [backpropagation through time] , the network is unfolded over time, so that connections arriving at layers are viewed as coming from the previous timestep.

— Framewise phoneme classification with bidirectional LSTM and other neural network architectures, 2005

Importantly, the backpropagation of error for a given time step depends on the activation of the network at the prior time step.

In this way, the backward pass requires the conceptualization of unfolding the network.

Error is propagated back to the first input time step of the sequence so that the error gradient can be calculated and the weights of the network can be updated.

Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.

— Supervised Sequence Labelling with Recurrent Neural Networks, 2008

Unfolding the recurrent network graph also introduces additional concerns. Each time step requires a new copy of the network, which in turn takes up memory, especially for larger networks with thousands or millions of weights. The memory requirements of training large recurrent networks can quickly balloon as the number of time steps climbs into the hundreds.

… it is required to unroll the RNNs by the length of the input sequence. By unrolling an RNN N times, every activations of the neurons inside the network are replicated N times, which consumes a huge amount of memory especially when the sequence is very long. This hinders a small footprint implementation of online learning or adaptation. Also, this “full unrolling” makes a parallel training with multiple sequences inefficient on shared memory models such as graphics processing units (GPUs)

— Online Sequence Training of Recurrent Neural Networks with Connectionist Temporal Classification, 2015

This section provides more resources on the topic if you are looking go deeper.

- Online Sequence Training of Recurrent Neural Networks with Connectionist Temporal Classification, 2015
- Framewise phoneme classification with bidirectional LSTM and other neural network architectures, 2005
- Supervised Sequence Labelling with Recurrent Neural Networks, 2008
- Deep learning, Nature, 2015

- A Gentle Introduction to Backpropagation Through Time
- Understanding LSTM Networks, 2015
- Rolling and Unrolling RNNs, 2016
- Unfolding RNNs, 2017

In this tutorial, you discovered the visualization and conceptual tool of unrolling recurrent neural networks.

Specifically, you learned:

- The standard conception of recurrent neural networks with cyclic connections.
- The concept of unrolling of the forward pass when the network is copied for each input time step.
- The concept of unrolling of the backward pass for updating network weights during training.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to RNN Unrolling appeared first on MachineLearningMastery.com.

]]>The post Making Predictions with Sequences appeared first on MachineLearningMastery.com.

]]>The sequence imposes an order on the observations that must be preserved when training models and making predictions.

Generally, prediction problems that involve sequence data are referred to as sequence prediction problems, although there are a suite of problems that differ based on the input and output sequences.

In this tutorial, you will discover the different types of sequence prediction problems.

After completing this tutorial, you will know:

- The 4 types of sequence prediction problems.
- Definitions for each type of sequence prediction problem by the experts.
- Real-world examples of each type of sequence prediction problem.

**Kick-start your project** with my new book Long Short-Term Memory Networks With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into 5 parts; they are:

- Sequence
- Sequence Prediction
- Sequence Classification
- Sequence Generation
- Sequence to Sequence Prediction

Often we deal with sets in applied machine learning such as a train or test sets of samples.

Each sample in the set can be thought of as an observation from the domain.

In a set, the order of the observations is not important.

A sequence is different. The sequence imposes an explicit order on the observations.

The order is important. It must be respected in the formulation of prediction problems that use the sequence data as input or output for the model.

Sequence prediction involves predicting the next value for a given input sequence.

For example:

- Given: 1, 2, 3, 4, 5
- Predict: 6

Sequence prediction attempts to predict elements of a sequence on the basis of the preceding elements

— Sequence Learning: From Recognition and Prediction to Sequential Decision Making, 2001.

A prediction model is trained with a set of training sequences. Once trained, the model is used to perform sequence predictions. A prediction consists in predicting the next items of a sequence. This task has numerous applications such as web page prefetching, consumer product recommendation, weather forecasting and stock market prediction.

— CPT+: Decreasing the time/space complexity of the Compact Prediction Tree, 2015

Sequence prediction may also generally be referred to as “*sequence learning*“.

Learning of sequential data continues to be a fundamental task and a challenge in pattern recognition and machine learning. Applications involving sequential data may require prediction of new events, generation of new sequences, or decision making such as classification of sequences or sub-sequences.

— On Prediction Using Variable Order Markov Models, 2004.

Technically, we could refer to all of the following problems in this post as a type of sequence prediction problem. This can make things confusing for beginners.

Some examples of sequence prediction problems include:

**Weather Forecasting**. Given a sequence of observations about the weather over time, predict the expected weather tomorrow.**Stock Market Prediction**. Given a sequence of movements of a security over time, predict the next movement of the security.**Product Recommendation**. Given a sequence of past purchases of a customer, predict the next purchase of a customer.

Sequence classification involves predicting a class label for a given input sequence.

For example:

- Given: 1, 2, 3, 4, 5
- Predict: “good” or “bad”

The objective of sequence classification is to build a classification model using a labeled dataset D so that the model can be used to predict the class label of an unseen sequence.

— Chapter 14, Data Classification: Algorithms and Applications, 2015

The input sequence may be comprised of real values or discrete values. In the latter case, such problems may be referred to as discrete sequence classification.

Some examples of sequence classification problems include:

**DNA Sequence Classification**. Given a DNA sequence of ACGT values, predict whether the sequence codes for a coding or non-coding region.**Anomaly Detection**. Given a sequence of observations, predict whether the sequence is anomalous or not.**Sentiment Analysis**. Given a sequence of text such as a review or a tweet, predict whether sentiment of the text is positive or negative.

Sequence generation involves generating a new output sequence that has the same general characteristics as other sequences in the corpus.

For example:

- Given: [1, 3, 5], [7, 9, 11]
- Predict: [3, 5 ,7]

[recurrent neural networks] can be trained for sequence generation by processing real data sequences one step at a time and predicting what comes next. Assuming the predictions are probabilistic, novel sequences can be generated from a trained network by iteratively sampling from the network’s output distribution, then feeding in the sample as input at the next step. In other words by making the network treat its inventions as if they were real, much like a person dreaming

— Generating Sequences With Recurrent Neural Networks, 2013.

Some examples of sequence generation problems include:

**Text Generation**. Given a corpus of text, such as the works of Shakespeare, generate new sentences or paragraphs of text that read like Shakespeare.**Handwriting Prediction**. Given a corpus of handwriting examples, generate handwriting for new phrases that has the properties of handwriting in the corpus.**Music Generation**. Given a corpus of examples of music, generate new musical pieces that have the properties of the corpus.

Sequence generation may also refer to the generation of a sequence given a single observation as input.

An example is the automatic textual description of images.

**Image Caption Generation**. Given an image as input, generate a sequence of words that describe an image.

Being able to automatically describe the content of an image using properly formed English sentences is a very challenging task, but it could have great impact, for instance by helping visually impaired people better understand the content of images on the web. […] Indeed, a description must capture not only the objects contained in an image, but it also must express how these objects relate to each other as well as their attributes and the activities they are involved in. Moreover, the above semantic knowledge has to be expressed in a natural language like English, which means that a language model is needed in addition to visual understanding.

— Show and Tell: A Neural Image Caption Generator, 2015

Sequence-to-sequence prediction involves predicting an output sequence given an input sequence.

For example:

- Given: 1, 2, 3, 4, 5
- Predict: 6, 7, 8, 9, 10

Despite their flexibility and power, [deep neural networks] can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality. It is a significant limitation, since many important problems are best expressed with sequences whose lengths are not known a-priori. For example, speech recognition and machine translation are sequential problems. Likewise, question answering can also be seen as mapping a sequence of words representing the question to a sequence of words representing the answer.

— Sequence to Sequence Learning with Neural Networks, 2014

It is a subtle but challenging extension of sequence prediction where rather than predicting a single next value in the sequence, a new sequence is predicted that may or may not have the same length or be of the same time as the input sequence.

This type of problem has recently seen a lot of study in the area of automatic text translation (e.g. translating English to French) and may be referred to by the abbreviation seq2seq.

seq2seq learning, at its core, uses recurrent neural networks to map variable-length input sequences to variable-length output sequences. While relatively new, the seq2seq approach has achieved state-of-the-art results in not only its original application – machine translation.

— Multi-task Sequence to Sequence Learning, 2016.

If the input and output sequences are a time series, then the problem may be referred to as multi-step time series forecasting.

**Multi-Step Time Series Forecasting**. Given a time series of observations, predict a sequence of observations for a range of future time steps.**Text Summarization**. Given a document of text, predict a shorter sequence of text that describes the salient parts of the source document.**Program Execution**. Given the textual description program or mathematical equation, predict the sequence of characters that describes the correct output.

This section provides more resources on the topic if you are looking go deeper.

- Sequence on Wikipedia
- CPT+: Decreasing the time/space complexity of the Compact Prediction Tree, 2015
- On Prediction Using Variable Order Markov Models, 2004
- An Introduction to Sequence Prediction, 2016
- Sequence Learning: From Recognition and Prediction to Sequential Decision Making, 2001
- Chapter 14, Discrete Sequence Classification, Data Classification: Algorithms and Applications, 2015
- Generating Sequences With Recurrent Neural Networks, 2013
- Show and Tell: A Neural Image Caption Generator, 2015
- Multi-task Sequence to Sequence Learning, 2016
- Sequence to Sequence Learning with Neural Networks, 2014
- Recursive and direct multi-step forecasting: the best of both worlds, 2012

In this tutorial, you discovered the different types of sequence prediction problems.

Specifically, you learned:

- The 4 types of sequence prediction problems.
- Definitions for each type of sequence prediction problem by the experts.
- Real-world examples of each type of sequence prediction problem.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Making Predictions with Sequences appeared first on MachineLearningMastery.com.

]]>