Teacher forcing is a method for quickly and efficiently training recurrent neural network models that use the output from a prior time step as input.

It is a network training method critical to the development of deep learning language models used in machine translation, text summarization, and image captioning, among many other applications.

In this post, you will discover the teacher forcing as a method for training recurrent neural networks.

After reading this post, you will know:

- The problem with training recurrent neural networks that use output from prior time steps as input.
- The teacher forcing method for addressing slow convergence and instability when training these types of recurrent networks.
- Extensions to teacher forcing that allow trained models to better handle open loop applications of this type of network.

Let’s get started.

## Using Output as Input in Sequence Prediction

There are sequence prediction models that use the output from the last time step y(t-1) as input for the model at the current time step X(t).

This type of model is common in language models that output one word at a time and use the output word as input for generating the next word in the sequence.

For example, this type of language model is used in an Encoder-Decoder recurrent neural network architecture for sequence-to-sequence generation problems such as:

- Machine Translation
- Caption Generation
- Text Summarization

After the model is trained, a “start-of-sequence” token can be used to start the process and the generated word in the output sequence is used as input on the subsequent time step, perhaps along with other input like an image or a source text.

This same recursive output-as-input process can be used when training the model, but it can result in problems such as:

- Slow convergence.
- Model instability.
- Poor skill.

Teacher forcing is an approach to improve model skill and stability when training these types of models.

## What is Teacher Forcing?

Teacher forcing is a strategy for training recurrent neural networks that uses model output from a prior time step as an input.

Models that have recurrent connections from their outputs leading back into the model may be trained with teacher forcing.

— Page 372, Deep Learning, 2016.

The approach was originally described and developed as an alternative technique to backpropagation through time for training a recurrent neural network.

An interesting technique that is frequently used in dynamical supervised learning tasks is to replace the actual output y(t) of a unit by the teacher signal d(t) in subsequent computation of the behavior of the network, whenever such a value exists. We call this technique teacher forcing.

— A Learning Algorithm for Continually Running Fully Recurrent Neural Networks, 1989.

Teacher forcing works by using the actual or expected output from the training dataset at the current time step y(t) as input in the next time step X(t+1), rather than the output generated by the network.

Teacher forcing is a procedure […] in which during training the model receives the ground truth output y(t) as input at time t + 1.

— Page 372, Deep Learning, 2016.

## Worked Example

Let’s make teacher forcing concrete with a short worked example.

Given the following input sequence:

1 |
Mary had a little lamb whose fleece was white as snow |

Imagine we want to train a model to generate the next word in the sequence given the previous sequence of words.

First, we must add a token to signal the start of the sequence and another to signal the end of the sequence. We will use “*[START]*” and “*[END]*” respectively.

1 |
[START] Mary had a little lamb whose fleece was white as snow [END] |

Next, we feed the model “*[START]*” and let the model generate the next word.

Imagine the model generates the word “*a*“, but of course, we expected “*Mary*“.

1 2 |
X, yhat [START], a |

Naively, we could feed in “*a*” as part of the input to generate the subsequent word in the sequence.

1 2 |
X, yhat [START], a, ? |

You can see that the model is off track and is going to get punished for every subsequent word it generates. This makes learning slower and the model unstable.

Instead, we can use teacher forcing.

In the first example when the model generated “*a*” as output, we can discard this output after calculating error and feed in “*Mary*” as part of the input on the subsequent time step.

1 2 |
X, yhat [START], Mary, ? |

We can then repeat this process for each input-output pair of words.

1 2 3 4 5 6 |
X, yhat [START], ? [START], Mary, ? [START], Mary, had, ? [START], Mary, had, a, ? ... |

The model will learn the correct sequence, or correct statistical properties for the sequence, quickly.

## Extensions to Teacher Forcing

Teacher forcing is a fast and effective way to train a recurrent neural network that uses output from prior time steps as input to the model.

But, the approach can also result in models that may be fragile or limited when used in practice when the generated sequences vary from what was seen by the model during training.

This is common in most applications of this type of model as the outputs are probabilistic in nature. This type of application of the model is often called open loop.

Unfortunately, this procedure can result in problems in generation as small prediction error compound in the conditioning context. This can lead to poor prediction performance as the RNN’s conditioning context (the sequence of previously generated samples) diverge from sequences seen during training.

– Professor Forcing: A New Algorithm for Training Recurrent Networks, 2016.

There are a number of approaches to address this limitation, for example:

### Search Candidate Output Sequences

One approach commonly used for models that predict a discrete value output, such as a word, is to perform a search across the predicted probabilities for each word to generate a number of likely candidate output sequences.

This approach is used on problems like machine translation to refine the translated output sequence.

A common search procedure for this post-hoc operation is the beam search.

This discrepancy can be mitigated by the use of a beam search heuristic maintaining several generated target sequences

— Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, 2015.

### Curriculum Learning

The beam search approach is only suitable for prediction problems with discrete output values and cannot be used for real-valued outputs.

A variation of forced learning is to introduce outputs generated from prior time steps during training to encourage the model to learn how to correct its own mistakes.

We propose to change the training process in order to gradually force the model to deal with its own mistakes, as it would have to during inference.

— Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, 2015.

The approach is called curriculum learning and involves randomly choosing to use the ground truth output or the generated output from the previous time step as input for the current time step.

The curriculum changes over time in what is called scheduled sampling where the procedure starts at forced learning and slowly decreases the probability of a forced input over the training epochs.

There are also other extensions and variations of teacher forcing and I encourage you to explore them if you are interested.

## Further Reading

This section provides more resources on the topic if you are looking go deeper.

### Papers

- A Learning Algorithm for Continually Running Fully Recurrent Neural Networks, 1989.
- Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, 2015.
- Professor Forcing: A New Algorithm for Training Recurrent Networks, 2016.

### Books

- Section 10.2.1, Teacher Forcing and Networks with Output Recurrence, Deep Learning, 2016.

## Summary

In this post, you discovered teacher forcing as a method for training recurrent neural networks that use output from a previous time step as input.

Specifically, you learned:

- The problem with training recurrent neural networks that use output from prior time steps as input.
- The teacher forcing method for addressing slow convergence and instability when training these types of recurrent networks.
- Extensions to teacher forcing that allow trained models to better handle open loop applications of this type of network.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Hi Jason,

Thanks for such as informative posts. Does current version of Keras support ‘teacher forcing’ ? I know recurrent shop can be used to do that but how to use that in Keras?

Yes, I give examples in a photo captioning example.

HI Jason,

Thank you for your post!

Currently, I am learning CNN-LSTM, LSTM encoder-decoder according to the chapter 8 and 9 in your book “Long Short-Term Memory Networks with Python”.

I have a task mapping a sequence of 2D inputs to a sequence of classification. It requires a network CNN-LSTM-encoder-decoder, and I combine the two examples together as below:

def cnn_lstm(lmda1, lmda2):

model = Sequential()

# CNN module

model.add(TimeDistributed(Conv2D(filters = 8,

kernel_size = (2, 2),

padding = ‘same’,

activation=’relu’,

kernel_regularizer = regularizers.l1_l2(lmda1, lmda2),

name = ‘Conv_1′),

input_shape = (None, img_height, img_width, channels)))

model.add(TimeDistributed(BatchNormalization(axis=1, name=’BN_1’)))

model.add(TimeDistributed(MaxPooling2D(pool_size = pool_size)))

model.add(TimeDistributed(Conv2D(filters = 16,

kernel_size = (2, 2),

padding = ‘same’,

activation=’relu’,

kernel_regularizer = regularizers.l1_l2(lmda1, lmda2),

name = ‘Conv_2′)))

model.add(TimeDistributed(BatchNormalization(name=’BN_2’)))

model.add(TimeDistributed(MaxPooling2D(pool_size = pool_size)))

# Flatten all features from CNN before inputing them into encoder-decoder LSTM

model.add(TimeDistributed(Flatten()))

# LSTM module

# encoder

model.add(LSTM(50, name = ‘encoder’))

model.add(RepeatVector(n_out_seq_length))

# decoder

model.add(LSTM(50, return_sequences=True, name = ‘decoder’))

model.add(TimeDistributed(Dense(nb_classes, activation=’softmax’)))

model.compile(loss=’categorical_crossentropy’, optimizer=’rmsprop’, metrics=[‘accuracy’])

return model

Do you think it is the correct way to do it? Thank you very much!

Perhaps try a few different approaches and see which results in best skill on your dataset.