Teacher forcing is a method for quickly and efficiently training recurrent neural network models that use the ground truth from a prior time step as input.

It is a network training method critical to the development of deep learning language models used in machine translation, text summarization, and image captioning, among many other applications.

In this post, you will discover the teacher forcing as a method for training recurrent neural networks.

After reading this post, you will know:

- The problem with training recurrent neural networks that use output from prior time steps as input.
- The teacher forcing method for addressing slow convergence and instability when training these types of recurrent networks.
- Extensions to teacher forcing that allow trained models to better handle open loop applications of this type of network.

**Kick-start your project** with my new book Long Short-Term Memory Networks With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Letâ€™s get started.

## Using Output as Input in Sequence Prediction

There are sequence prediction models that use the output from the last time step y(t-1) as input for the model at the current time step X(t).

This type of model is common in language models that output one word at a time and use the output word as input for generating the next word in the sequence.

For example, this type of language model is used in an Encoder-Decoder recurrent neural network architecture for sequence-to-sequence generation problems such as:

- Machine Translation
- Caption Generation
- Text Summarization

After the model is trained, a “start-of-sequence” token can be used to start the process and the generated word in the output sequence is used as input on the subsequent time step, perhaps along with other input like an image or a source text.

This same recursive output-as-input process can be used when training the model, but it can result in problems such as:

- Slow convergence.
- Model instability.
- Poor skill.

Teacher forcing is an approach to improve model skill and stability when training these types of models.

## What is Teacher Forcing?

Teacher forcing is a strategy for training recurrent neural networks that uses **ground truth as input**, instead of model output from a prior time step as an input.

Models that have recurrent connections from their outputs leading back into the model may be trained with teacher forcing.

— Page 372, Deep Learning, 2016.

The approach was originally described and developed as an alternative technique to backpropagation through time for training a recurrent neural network.

An interesting technique that is frequently used in dynamical supervised learning tasks is to replace the actual output y(t) of a unit by the teacher signal d(t) in subsequent computation of the behavior of the network, whenever such a value exists. We call this technique teacher forcing.

— A Learning Algorithm for Continually Running Fully Recurrent Neural Networks, 1989.

Teacher forcing works by using the actual or expected output from the training dataset at the current time step y(t) as input in the next time step X(t+1), rather than the output generated by the network.

Teacher forcing is a procedure […] in which during training the model receives the ground truth output y(t) as input at time t + 1.

— Page 372, Deep Learning, 2016.

## Worked Example

Let’s make teacher forcing concrete with a short worked example.

Given the following input sequence:

1 |
Mary had a little lamb whose fleece was white as snow |

Imagine we want to train a model to generate the next word in the sequence given the previous sequence of words.

First, we must add a token to signal the start of the sequence and another to signal the end of the sequence. We will use “*[START]*” and “*[END]*” respectively.

1 |
[START] Mary had a little lamb whose fleece was white as snow [END] |

Next, we feed the model “*[START]*” and let the model generate the next word.

Imagine the modelÂ generates the word “*a*“, but of course, we expected “*Mary*“.

1 2 |
X, yhat [START], a |

Naively, we could feed in “*a*” as part of the input to generate the subsequent word in the sequence.

1 2 |
X, yhat [START], a, ? |

You can see that the model is off track and is going to get punished for every subsequent word it generates. This makes learning slower and the model unstable.

Instead, we can use teacher forcing.

In the first example when the model generated “*a*” as output, we can discard this output after calculating error and feed in “*Mary*” as part of the input on the subsequent time step.

1 2 |
X, yhat [START], Mary, ? |

We can then repeat this process for each input-output pair of words.

1 2 3 4 5 6 |
X, yhat [START], ? [START], Mary, ? [START], Mary, had, ? [START], Mary, had, a, ? ... |

The model will learn the correct sequence, or correct statistical properties for the sequence, quickly.

## Extensions to Teacher Forcing

Teacher forcing is a fast and effective way to train a recurrent neural network that uses output from prior time steps as input to the model.

But, the approach can also result in models that may be fragile or limited when used in practice when the generated sequences vary from what was seen by the model during training.

This is common in most applications of this type of model as the outputs are probabilistic in nature. This type of application of the model is often called open loop.

Unfortunately, this procedure can result in problems in generation as small prediction error compound in the conditioning context. This can lead to poor prediction performance as the RNNâ€™s conditioning context (the sequence of previously generated samples) diverge from sequences seen during training.

– Professor Forcing: A New Algorithm for Training Recurrent Networks, 2016.

There are a number of approaches to address this limitation, for example:

### Search Candidate Output Sequences

One approach commonly used for models that predict a discrete value output, such as a word, is to perform a search across the predicted probabilities for each word to generate a number of likely candidate output sequences.

This approach is used on problems like machine translation to refine the translated output sequence.

A common search procedure for this post-hoc operation is the beam search.

This discrepancy can be mitigated by the use of a beam search heuristic maintaining several generated target sequences

— Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, 2015.

### Curriculum Learning

The beam search approach is only suitable for prediction problems with discrete output values and cannot be used for real-valued outputs.

A variation of forced learning is to introduce outputs generated from prior time steps during training to encourage the model to learn how to correct its own mistakes.

We propose to change the training process in order to gradually force the model to deal with its own mistakes, as it would have to during inference.

— Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, 2015.

The approach is called curriculum learning and involves randomly choosing to use the ground truth output or the generated output from the previous time step as input for the current time step.

The curriculum changes over time in what is called scheduled sampling where the procedure starts at forced learning and slowly decreases the probability of a forced input over the training epochs.

There are also other extensions and variations of teacher forcing and I encourage you to explore them if you are interested.

## Further Reading

This section provides more resources on the topic if you are looking go deeper.

### Papers

- A Learning Algorithm for Continually Running Fully Recurrent Neural Networks, 1989.
- Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, 2015.
- Professor Forcing: A New Algorithm for Training Recurrent Networks, 2016.

### Books

- Section 10.2.1, Teacher Forcing and Networks with Output Recurrence, Deep Learning, 2016.

## Summary

In this post, you discovered teacher forcing as a method for training recurrent neural networks that use output from a previous time step as input.

Specifically, you learned:

- The problem with training recurrent neural networks that use output from prior time steps as input.
- The teacher forcing method for addressing slow convergence and instability when training these types of recurrent networks.
- Extensions to teacher forcing that allow trained models to better handle open loop applications of this type of network.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Hi Jason,

Thanks for such as informative posts. Does current version of Keras support ‘teacher forcing’ ? I know recurrent shop can be used to do that but how to use that in Keras?

Yes, I give examples in a photo captioning example.

HI Jason,

Thank you for your post!

Currently, I am learning CNN-LSTM, LSTM encoder-decoder according to the chapter 8 and 9 in your book “Long Short-Term Memory Networks with Python”.

I have a task mapping a sequence of 2D inputs to a sequence of classification. It requires a network CNN-LSTM-encoder-decoder, and I combine the two examples together as below:

def cnn_lstm(lmda1, lmda2):

model = Sequential()

# CNN module

model.add(TimeDistributed(Conv2D(filters = 8,

kernel_size = (2, 2),

padding = ‘same’,

activation=’relu’,

kernel_regularizer = regularizers.l1_l2(lmda1, lmda2),

name = ‘Conv_1′),

input_shape = (None, img_height, img_width, channels)))

model.add(TimeDistributed(BatchNormalization(axis=1, name=’BN_1’)))

model.add(TimeDistributed(MaxPooling2D(pool_size = pool_size)))

model.add(TimeDistributed(Conv2D(filters = 16,

kernel_size = (2, 2),

padding = ‘same’,

activation=’relu’,

kernel_regularizer = regularizers.l1_l2(lmda1, lmda2),

name = ‘Conv_2′)))

model.add(TimeDistributed(BatchNormalization(name=’BN_2’)))

model.add(TimeDistributed(MaxPooling2D(pool_size = pool_size)))

# Flatten all features from CNN before inputing them into encoder-decoder LSTM

model.add(TimeDistributed(Flatten()))

# LSTM module

# encoder

model.add(LSTM(50, name = ‘encoder’))

model.add(RepeatVector(n_out_seq_length))

# decoder

model.add(LSTM(50, return_sequences=True, name = ‘decoder’))

model.add(TimeDistributed(Dense(nb_classes, activation=’softmax’)))

model.compile(loss=’categorical_crossentropy’, optimizer=’rmsprop’, metrics=[‘accuracy’])

return model

Do you think it is the correct way to do it? Thank you very much!

Perhaps try a few different approaches and see which results in best skill on your dataset.

Hi Jason,

Thank a lot for your share!

I want to ask if the teacher forcing method performs bad on multistep forecasting problem? Because at predicting stage some of them cannot access the ground truth value of the previous value?

Teach forcing is only used during training.

Oh I get it.

So how can I deal with it if we want to use it in practice? Does teach forcing not have practical meaning?

Sorry, I don’t follow. What is the problem you are having exactly?

Teacher forcing is used to help to keep the model on track during training.

I use teacher forcing to train a seq2seq time series problem and get a low loss on training and validation dataset, but get a poor result on test dataset. Is it normal to have a bad result on test dataset?

Ideally, you want good skill on train and test sets.

Poor skill on a test with good skill on the training set suggests overfitting.

https://arxiv.org/pdf/1409.3215.pdf is the paper from sutskever describing teacher forcing ? I think not or am I wrong, your implementation here is also with https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/ teacher forcing ?

Yes, I almost always use teacher forcing.

Hey Jason, thank you for the post!

I have several questions:

1. for Curriculum Learning, do we decide to teacher-force once per entire episode, or at every timestep?

2. Assuming I were to use 100% pure teacher forcing while training my LSTM, how should I deal with the gradient that supposed to flow from ‘Cell_{t+1}’ to ‘Cell_t’ ? In other words, what is the gradient that arrives into Cell_t?

As I understood, teacher forcing made us plug-in the Cell values during fwd prop. During Backprop, do we use original value of Cell_t (pretending there never was a swap), and how is that possible to be combined with the gradient from Cell_{t+1}? Especially during Curriculum learning where at Cell_{t+1} or Cell_{t+2} we “played fair” and never swapped anything. (if Question 1 was true)

3. “Teacher forcing is a fast and effective way to train a recurrent neural network that uses output from prior time steps as input to the model. But, the approach can also result in models that may be fragile or limited when used in practice when the generated sequences vary from what was seen by the model during training.”

When you used Curriculum Learning in the past, did you still get slightly fragile networks, or they were just as strong as your non-forced networks? In particular LSTM.

If I use Curriculum Learning, will I be safe while enjoying the speed-ups during training?

4. From experience, how much faster does the training go?

Thanks! ðŸ™‚

Thanks for the post.

I have trained my seq2seq model with teacher-forcing. My question is do I also have to compute the validation loss and ppl with the teacher-forcing?

I compute the validation loss without teacher-forcing and it remains almost the same (it decreases a bit until a certain point and stops and I am sure the point it stops is not an overfitting point) and it is generally much larger than my training loss (it is almost in the range of the training loss of the first epoch).

This is an implementation detail that really depends on your code and how you’ve prepared your data.

What problem are you having exactly?

I am looking for the way to compute the validation loss. Is that necessary to compute the validation loss in an exact manner as we do in training?

Sure.

Hi Jason, is “teacher forcing” the same thing as the concept you showed in your article “How to Develop Word-Based Neural Language Models in Python with Keras” in the section “Model 2: Line-by-Line Sequence”?

https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/

There, you trained an LSTM with training data that looked like:

1. X=(_, _, _, _, _, Jack), y=and

2. X=(_, _, _, _, Jack, and), y=Jill

3. X=(_, _, _, Jack, and, Jill), y=went

4. X=(_, _, Jack, and, Jill, went), y=up

5. X=(_, Jack, and, Jill, went, up), y=the

6. X=(Jack, and, Jill, went, up, the), y=hill

7. X=(and, Jill, went, up, the, hill), y=to

Thanks for any clarification.

Great question.

I use teacher forcing by default, it is just so effective. What is harder is using it sometimes and not others, and allow the model to correct an offtrack input sequence.

Can you confirm that the approach in “Model 2: Line-by-Line Sequence” is the same as teacher forcing? I’m just trying to wrap my head around the terminology.

‘Teacher forcing is a strategy for training recurrent neural networks that uses model output from a prior time step as an input.’

At first reading, I think your word ‘output from a prior time step’ is the output(not ground True) from time t-1.

Actually, It’s the ground True at time t-1 which is send to time t as input during training.Right?

Yes, ground truth.

Great post, thank you!

How is the Scheduled Sampling/Curriculum learning actually implemented? I assume we need to write a custom Keras backend function?

Do you know of any pseudo-code to help implement this?

Thanks!

No, you provide real inputs during training rather than predicted inputs.

I have countless examples, perhaps start here:

https://machinelearningmastery.com/start-here/#lstm

and here:

https://machinelearningmastery.com/start-here/#deep_learning_time_series

Thanks! Maybe I should clarify:

Doesn’t Scheduled Sampling, during training, gradually change from using “real inputs” to “predicted inputs”? This is my understanding from the paper.

Yes, that is the ideal implementation.

To implement this in Keras requires customized code, I don’t have an example of the transition.

Great! Thank you for your reply. I’ll try to code one up!

Hi Jason!

It might be good to mention this:

The first trick is using teacher forcing. This means that at some probability, set by teacher_forcing_ratio, we use the current target word as the decoderâ€™s next input rather than using the decoderâ€™s current guess. This technique acts as training wheels for the decoder, aiding in more efficient training. However, teacher forcing can lead to model instability during inference, as the decoder may not have a sufficient chance to truly craft its own output sequences during training. Thus, we must be mindful of how we are setting the teacher_forcing_ratio, and not be fooled by fast convergence.

Thanks.

Hi Jason, thanks for the article!

I am curious about a point: in section “Using Output as Input in Sequence Prediction”, you mention

”

This same recursive output-as-input process can be used when training the model, but it can result in problems such as:

Slow convergence.

Model instability.

Poor skill.

”

Can you provide a reference to this point? I could not really put my hand on a paper where they report trying this training scheme or studied its effect.

That will be much appreciated :))

Not off hand, perhaps check some of the papers on teacher forcing and extensions?

I will. Thanks Jason ðŸ™‚

Hi Jason,

how would I go about implementing teacher forcing in an autoencoder?

Since the input of the decoder are embeddings produced by the autoencoder, how do I supply the target token ?

Thank you for the helpful resources.

It would be just like teacher forcing for any LSTM model. The autoencoder model does not make it different.

there is a efficient way to do teacher forcing trainning but using yhat as input in lieu of ground true y?

Yes, but that is not teacher forcing.

You can do it sample by sample manually.

Hi Jason,

Teacher forcing makes me think of NARX neural networks, a subject which I am interested in. I have yet to find an example of one with Keras. My question is whether you could use teacher forcing with multiple delays of outputs to create a NARX model, which normally would use the model outputs and not ground truth.

Not sure off hand, perhaps explore whether it is viable with some prototypes.

Hi Jason,

Have you ever tried using teacher forcing on the first half of the number of epochs and the last half do not ??. when I did that I noticed that when the first half of the epochs ended, my val_loss suddenly dropped sharply, the train_loss increased and I didn’t understand why that happened. As far as I thought, when using teacher forcing, both train_loss and val_loss will decrease.

Nice experiment.

I would expect that once teacher forcing is removed that model performance would get worse.

It is a good idea to cycle teacher forcing on and off so the model can slowly learn how to correct its own mistakes.

helpful post, thanks.

but I have a question

How to test a model trained using teacher forcing

I want to train a seq2seq model. The X and y to the model are [X_encoder, X_decoder] and y i.e. a list of encoder and decoder inputs and labels (Note that the decoder input, X_decoder is â€˜yâ€™ with one position ahead than the actual y. Basically, teacher forcing).

So my question is now after training, when it comes to actual prediction where I do not have any labels how do I provide â€˜X_decoderâ€™ to my input? Or do I train on something else?

You’r welcome.

During inference/prediction, the teacher forcing is replaced with the output from the model in the last time step.

Great post, very helpful, thank you!

I have a question and would be very happy if you can enlighten me. I am working on medical image captioning, and I am using custom training and validation functions instead of using model.fit or train_on_batch. (I am also providing the functions below.)

If I use the teacher forcing in the validation step, there is no overfitting or anything unexpected. However, if I use the actual output in the validation, the model is overfitting.

My question is, should we use the teacher forcing in the validation set? Because I believe, the validation step should be the same as the test step (to see whether the model is learning properly or not.)

I could not find the proper explanation anywhere, I would appreciate it if you can answer my question.

Thank you in advance.

@tf.function

def train_step(tensor, target, mesh):

“””Subclassed model training step”””

loss = 0

accuracy = 0

#initializing the hidden state for each batch

hidden = tf.zeros((target.shape[0], units))

dec_input = tf.expand_dims([tokenizer.word_index[”]] * target.shape[0], 1)

with tf.GradientTape() as tape:

features = encoder(tensor, mesh)

for i in range(1, target.shape[1]):

# passing the features through the decoder

predictions, hidden = decoder([dec_input, features, hidden])

loss += loss_func(target[:, i], predictions)

accuracy += acc_func(target[:, i], predictions)

# using teacher forcing

dec_input = tf.expand_dims(target[:, i], 1)

total_loss = (loss / int(target.shape[1]))

total_acc = (accuracy / int(target.shape[1]))

trainable_variables = encoder.trainable_variables + decoder.trainable_variables

gradients = tape.gradient(loss, trainable_variables)

optimizer.apply_gradients(zip(gradients, trainable_variables))

return loss, total_loss, total_acc

#@tf.function

def val_step(tensor, target, mesh):

“””Subclassed model validation step”””

loss = 0

accuracy = 0

# initializing the hidden state for each batch

hidden = tf.zeros((target.shape[0], units))

dec_input = tf.expand_dims([tokenizer.word_index[”]] * target.shape[0], 1)

mesh=tf.fill([target.shape[0], 64], 490)

features = encoder(tensor, mesh)

for i in range(1, target.shape[1]):

# passing the features through the decoder

predictions, hidden = decoder([tf.cast(dec_input, tf.float32), features, hidden])

loss += loss_func(target[:, i], predictions)

accuracy += acc_func(target[:, i], predictions)

#dec_input = tf.expand_dims(target[:, i], 1)

#predicted_id = tf.argmax(predictions[0])

predicted_id = tf.argmax(predictions[0]).numpy()

dec_input = tf.expand_dims([predicted_id]*target.shape[0], 1)

total_loss = (loss / int(target.shape[1]))

total_acc = (accuracy / int(target.shape[1]))

return loss, total_loss, total_acc

Good question. The choice is yours.

If you want the validation score to be representative of how the model will be used in practice, then don’t use teacher forcing. If you want it to be representative and a point of comparison with the training set, do teacher forcing.

Hi, I believe the sentence underneath ‘What is Teacher Forcing?’ is wrong:

“Teacher forcing is a strategy for training recurrent neural networks that uses model output from a prior time step as an input.”

This since teacher forcing actually uses the ground truth as input and not the model output from the previous time-step, as you mention later.

Thanks.

Dear Jason Brownlee

Do you have sample LSTM code not using Teacher forcing?

I may, I’m not sure sorry.

Hi, i’m a bit confused on how teacher forcing could be applied to language translation task. While the concept holds for text generation, where input and output words belong to the same vocabulary, on language translation, say English->Italian, how can i feed a ground truth (Italian) word into input which expects an English word?? Don’t input and outputs belong to different vocabularies?

The concept still valid. The problem is what you should feed in. Instead of using the output of the network during training, teacher forcing means you should hold your ground truth and ignore the model output. This article is about text generation. For translation, I think it doesn’t matter because you will not feed in the output back to the network anyway.

Mr. Brownlee, I would like a print copy of the book.