# What is Teacher Forcing for Recurrent Neural Networks?

Last Updated on April 8, 2021

Teacher forcing is a method for quickly and efficiently training recurrent neural network models that use the ground truth from a prior time step as input.

It is a network training method critical to the development of deep learning language models used in machine translation, text summarization, and image captioning, among many other applications.

In this post, you will discover the teacher forcing as a method for training recurrent neural networks.

After reading this post, you will know:

• The problem with training recurrent neural networks that use output from prior time steps as input.
• The teacher forcing method for addressing slow convergence and instability when training these types of recurrent networks.
• Extensions to teacher forcing that allow trained models to better handle open loop applications of this type of network.

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

What is Teacher Forcing for Recurrent Neural Networks?
Photo by Nathan Russell, some rights reserved.

## Using Output as Input in Sequence Prediction

There are sequence prediction models that use the output from the last time step y(t-1) as input for the model at the current time step X(t).

This type of model is common in language models that output one word at a time and use the output word as input for generating the next word in the sequence.

For example, this type of language model is used in an Encoder-Decoder recurrent neural network architecture for sequence-to-sequence generation problems such as:

• Machine Translation
• Caption Generation
• Text Summarization

After the model is trained, a “start-of-sequence” token can be used to start the process and the generated word in the output sequence is used as input on the subsequent time step, perhaps along with other input like an image or a source text.

This same recursive output-as-input process can be used when training the model, but it can result in problems such as:

• Slow convergence.
• Model instability.
• Poor skill.

Teacher forcing is an approach to improve model skill and stability when training these types of models.

## What is Teacher Forcing?

Teacher forcing is a strategy for training recurrent neural networks that uses ground truth as input, instead of model output from a prior time step as an input.

Models that have recurrent connections from their outputs leading back into the model may be trained with teacher forcing.

— Page 372, Deep Learning, 2016.

The approach was originally described and developed as an alternative technique to backpropagation through time for training a recurrent neural network.

An interesting technique that is frequently used in dynamical supervised learning tasks is to replace the actual output y(t) of a unit by the teacher signal d(t) in subsequent computation of the behavior of the network, whenever such a value exists. We call this technique teacher forcing.

Teacher forcing works by using the actual or expected output from the training dataset at the current time step y(t) as input in the next time step X(t+1), rather than the output generated by the network.

Teacher forcing is a procedure […] in which during training the model receives the ground truth output y(t) as input at time t + 1.

— Page 372, Deep Learning, 2016.

## Worked Example

Let’s make teacher forcing concrete with a short worked example.

Given the following input sequence:

Imagine we want to train a model to generate the next word in the sequence given the previous sequence of words.

First, we must add a token to signal the start of the sequence and another to signal the end of the sequence. We will use “[START]” and “[END]” respectively.

Next, we feed the model “[START]” and let the model generate the next word.

Imagine the model generates the word “a“, but of course, we expected “Mary“.

Naively, we could feed in “a” as part of the input to generate the subsequent word in the sequence.

You can see that the model is off track and is going to get punished for every subsequent word it generates. This makes learning slower and the model unstable.

Instead, we can use teacher forcing.

In the first example when the model generated “a” as output, we can discard this output after calculating error and feed in “Mary” as part of the input on the subsequent time step.

We can then repeat this process for each input-output pair of words.

The model will learn the correct sequence, or correct statistical properties for the sequence, quickly.

## Extensions to Teacher Forcing

Teacher forcing is a fast and effective way to train a recurrent neural network that uses output from prior time steps as input to the model.

But, the approach can also result in models that may be fragile or limited when used in practice when the generated sequences vary from what was seen by the model during training.

This is common in most applications of this type of model as the outputs are probabilistic in nature. This type of application of the model is often called open loop.

Unfortunately, this procedure can result in problems in generation as small prediction error compound in the conditioning context. This can lead to poor prediction performance as the RNN’s conditioning context (the sequence of previously generated samples) diverge from sequences seen during training.

There are a number of approaches to address this limitation, for example:

### Search Candidate Output Sequences

One approach commonly used for models that predict a discrete value output, such as a word, is to perform a search across the predicted probabilities for each word to generate a number of likely candidate output sequences.

This approach is used on problems like machine translation to refine the translated output sequence.

A common search procedure for this post-hoc operation is the beam search.

This discrepancy can be mitigated by the use of a beam search heuristic maintaining several generated target sequences

### Curriculum Learning

The beam search approach is only suitable for prediction problems with discrete output values and cannot be used for real-valued outputs.

A variation of forced learning is to introduce outputs generated from prior time steps during training to encourage the model to learn how to correct its own mistakes.

We propose to change the training process in order to gradually force the model to deal with its own mistakes, as it would have to during inference.

The approach is called curriculum learning and involves randomly choosing to use the ground truth output or the generated output from the previous time step as input for the current time step.

The curriculum changes over time in what is called scheduled sampling where the procedure starts at forced learning and slowly decreases the probability of a forced input over the training epochs.

There are also other extensions and variations of teacher forcing and I encourage you to explore them if you are interested.

This section provides more resources on the topic if you are looking go deeper.

### Books

• Section 10.2.1, Teacher Forcing and Networks with Output Recurrence, Deep Learning, 2016.

## Summary

In this post, you discovered teacher forcing as a method for training recurrent neural networks that use output from a previous time step as input.

Specifically, you learned:

• The problem with training recurrent neural networks that use output from prior time steps as input.
• The teacher forcing method for addressing slow convergence and instability when training these types of recurrent networks.
• Extensions to teacher forcing that allow trained models to better handle open loop applications of this type of network.

Do you have any questions?

## Develop LSTMs for Sequence Prediction Today!

#### Develop Your Own LSTM models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more...

### 50 Responses to What is Teacher Forcing for Recurrent Neural Networks?

1. Huzefa Calcuttawala March 14, 2018 at 10:24 pm #

Hi Jason,

Thanks for such as informative posts. Does current version of Keras support ‘teacher forcing’ ? I know recurrent shop can be used to do that but how to use that in Keras?

• Jason Brownlee March 15, 2018 at 6:30 am #

Yes, I give examples in a photo captioning example.

2. jundong May 3, 2018 at 5:25 am #

HI Jason,

Currently, I am learning CNN-LSTM, LSTM encoder-decoder according to the chapter 8 and 9 in your book “Long Short-Term Memory Networks with Python”.

I have a task mapping a sequence of 2D inputs to a sequence of classification. It requires a network CNN-LSTM-encoder-decoder, and I combine the two examples together as below:

def cnn_lstm(lmda1, lmda2):

model = Sequential()

# CNN module
kernel_size = (2, 2),
activation=’relu’,
kernel_regularizer = regularizers.l1_l2(lmda1, lmda2),
name = ‘Conv_1′),
input_shape = (None, img_height, img_width, channels)))

kernel_size = (2, 2),
activation=’relu’,
kernel_regularizer = regularizers.l1_l2(lmda1, lmda2),
name = ‘Conv_2′)))

# Flatten all features from CNN before inputing them into encoder-decoder LSTM

# LSTM module
# encoder

# decoder

model.compile(loss=’categorical_crossentropy’, optimizer=’rmsprop’, metrics=[‘accuracy’])

return model

Do you think it is the correct way to do it? Thank you very much!

• Jason Brownlee May 3, 2018 at 6:38 am #

Perhaps try a few different approaches and see which results in best skill on your dataset.

3. Skye May 29, 2018 at 1:35 am #

Hi Jason,

Thank a lot for your share!

I want to ask if the teacher forcing method performs bad on multistep forecasting problem? Because at predicting stage some of them cannot access the ground truth value of the previous value?

• Jason Brownlee May 29, 2018 at 6:28 am #

Teach forcing is only used during training.

4. Skye May 29, 2018 at 10:46 am #

Oh I get it.

So how can I deal with it if we want to use it in practice? Does teach forcing not have practical meaning?

• Jason Brownlee May 29, 2018 at 2:52 pm #

Sorry, I don’t follow. What is the problem you are having exactly?

Teacher forcing is used to help to keep the model on track during training.

5. Skye May 29, 2018 at 3:34 pm #

I use teacher forcing to train a seq2seq time series problem and get a low loss on training and validation dataset, but get a poor result on test dataset. Is it normal to have a bad result on test dataset?

• Jason Brownlee May 30, 2018 at 6:31 am #

Ideally, you want good skill on train and test sets.

Poor skill on a test with good skill on the training set suggests overfitting.

6. max June 24, 2018 at 9:14 pm #

https://arxiv.org/pdf/1409.3215.pdf is the paper from sutskever describing teacher forcing ? I think not or am I wrong, your implementation here is also with https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/ teacher forcing ?

• Jason Brownlee June 25, 2018 at 6:21 am #

Yes, I almost always use teacher forcing.

7. Igor Aherne August 23, 2018 at 9:25 am #

Hey Jason, thank you for the post!

I have several questions:

1. for Curriculum Learning, do we decide to teacher-force once per entire episode, or at every timestep?

2. Assuming I were to use 100% pure teacher forcing while training my LSTM, how should I deal with the gradient that supposed to flow from ‘Cell_{t+1}’ to ‘Cell_t’ ? In other words, what is the gradient that arrives into Cell_t?

As I understood, teacher forcing made us plug-in the Cell values during fwd prop. During Backprop, do we use original value of Cell_t (pretending there never was a swap), and how is that possible to be combined with the gradient from Cell_{t+1}? Especially during Curriculum learning where at Cell_{t+1} or Cell_{t+2} we “played fair” and never swapped anything. (if Question 1 was true)

3. “Teacher forcing is a fast and effective way to train a recurrent neural network that uses output from prior time steps as input to the model. But, the approach can also result in models that may be fragile or limited when used in practice when the generated sequences vary from what was seen by the model during training.”

When you used Curriculum Learning in the past, did you still get slightly fragile networks, or they were just as strong as your non-forced networks? In particular LSTM.

If I use Curriculum Learning, will I be safe while enjoying the speed-ups during training?

4. From experience, how much faster does the training go?

Thanks! 🙂

8. pierre November 2, 2018 at 1:09 pm #

Thanks for the post.
I have trained my seq2seq model with teacher-forcing. My question is do I also have to compute the validation loss and ppl with the teacher-forcing?
I compute the validation loss without teacher-forcing and it remains almost the same (it decreases a bit until a certain point and stops and I am sure the point it stops is not an overfitting point) and it is generally much larger than my training loss (it is almost in the range of the training loss of the first epoch).

• Jason Brownlee November 2, 2018 at 2:52 pm #

This is an implementation detail that really depends on your code and how you’ve prepared your data.

What problem are you having exactly?

• pierre November 2, 2018 at 6:34 pm #

I am looking for the way to compute the validation loss. Is that necessary to compute the validation loss in an exact manner as we do in training?

9. Nick March 20, 2019 at 2:45 pm #

Hi Jason, is “teacher forcing” the same thing as the concept you showed in your article “How to Develop Word-Based Neural Language Models in Python with Keras” in the section “Model 2: Line-by-Line Sequence”?

https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/

There, you trained an LSTM with training data that looked like:

1. X=(_, _, _, _, _, Jack), y=and

2. X=(_, _, _, _, Jack, and), y=Jill

3. X=(_, _, _, Jack, and, Jill), y=went

4. X=(_, _, Jack, and, Jill, went), y=up

5. X=(_, Jack, and, Jill, went, up), y=the

6. X=(Jack, and, Jill, went, up, the), y=hill

7. X=(and, Jill, went, up, the, hill), y=to

Thanks for any clarification.

• Jason Brownlee March 21, 2019 at 7:58 am #

Great question.

I use teacher forcing by default, it is just so effective. What is harder is using it sometimes and not others, and allow the model to correct an offtrack input sequence.

• Nick March 22, 2019 at 8:38 am #

Can you confirm that the approach in “Model 2: Line-by-Line Sequence” is the same as teacher forcing? I’m just trying to wrap my head around the terminology.

10. MultiK April 28, 2019 at 7:26 pm #

‘Teacher forcing is a strategy for training recurrent neural networks that uses model output from a prior time step as an input.’

At first reading, I think your word ‘output from a prior time step’ is the output(not ground True) from time t-1.
Actually, It’s the ground True at time t-1 which is send to time t as input during training.Right?

11. Ruzbeh June 6, 2019 at 2:03 am #

Great post, thank you!

How is the Scheduled Sampling/Curriculum learning actually implemented? I assume we need to write a custom Keras backend function?

Do you know of any pseudo-code to help implement this?

Thanks!

• Jason Brownlee June 6, 2019 at 6:36 am #

No, you provide real inputs during training rather than predicted inputs.

I have countless examples, perhaps start here:
https://machinelearningmastery.com/start-here/#lstm

• Ruzbeh June 6, 2019 at 6:51 am #

Thanks! Maybe I should clarify:

Doesn’t Scheduled Sampling, during training, gradually change from using “real inputs” to “predicted inputs”? This is my understanding from the paper.

• Jason Brownlee June 6, 2019 at 2:13 pm #

Yes, that is the ideal implementation.

To implement this in Keras requires customized code, I don’t have an example of the transition.

• Ruzbeh June 6, 2019 at 11:06 pm #

Great! Thank you for your reply. I’ll try to code one up!

12. Brando Miranda June 12, 2019 at 7:43 am #

Hi Jason!

It might be good to mention this:

 The first trick is using teacher forcing. This means that at some probability, set by teacher_forcing_ratio, we use the current target word as the decoder’s next input rather than using the decoder’s current guess. This technique acts as training wheels for the decoder, aiding in more efficient training. However, teacher forcing can lead to model instability during inference, as the decoder may not have a sufficient chance to truly craft its own output sequences during training. Thus, we must be mindful of how we are setting the teacher_forcing_ratio, and not be fooled by fast convergence.

 

13. Omar September 5, 2019 at 3:28 am #

Hi Jason, thanks for the article!
I am curious about a point: in section “Using Output as Input in Sequence Prediction”, you mention

This same recursive output-as-input process can be used when training the model, but it can result in problems such as:

Slow convergence.
Model instability.
Poor skill.

Can you provide a reference to this point? I could not really put my hand on a paper where they report trying this training scheme or studied its effect.
That will be much appreciated :))

• Jason Brownlee September 5, 2019 at 6:59 am #

Not off hand, perhaps check some of the papers on teacher forcing and extensions?

• Omar September 5, 2019 at 10:01 pm #

I will. Thanks Jason 🙂

14. Ferdinando Insalata March 12, 2020 at 3:31 am #

Hi Jason,
how would I go about implementing teacher forcing in an autoencoder?
Since the input of the decoder are embeddings produced by the autoencoder, how do I supply the target token ?

Thank you for the helpful resources.

• Jason Brownlee March 12, 2020 at 8:54 am #

It would be just like teacher forcing for any LSTM model. The autoencoder model does not make it different.

15. Johnathan July 4, 2020 at 6:35 am #

there is a efficient way to do teacher forcing trainning but using yhat as input in lieu of ground true y?

• Jason Brownlee July 5, 2020 at 6:46 am #

Yes, but that is not teacher forcing.

You can do it sample by sample manually.

16. Darryl Fenwick July 11, 2020 at 4:59 am #

Hi Jason,

Teacher forcing makes me think of NARX neural networks, a subject which I am interested in. I have yet to find an example of one with Keras. My question is whether you could use teacher forcing with multiple delays of outputs to create a NARX model, which normally would use the model outputs and not ground truth.

• Jason Brownlee July 11, 2020 at 6:22 am #

Not sure off hand, perhaps explore whether it is viable with some prototypes.

17. daniel August 17, 2020 at 2:38 pm #

Hi Jason,
Have you ever tried using teacher forcing on the first half of the number of epochs and the last half do not ??. when I did that I noticed that when the first half of the epochs ended, my val_loss suddenly dropped sharply, the train_loss increased and I didn’t understand why that happened. As far as I thought, when using teacher forcing, both train_loss and val_loss will decrease.

• Jason Brownlee August 18, 2020 at 5:57 am #

Nice experiment.

I would expect that once teacher forcing is removed that model performance would get worse.

It is a good idea to cycle teacher forcing on and off so the model can slowly learn how to correct its own mistakes.

18. T October 28, 2020 at 2:09 pm #

but I have a question
How to test a model trained using teacher forcing

I want to train a seq2seq model. The X and y to the model are [X_encoder, X_decoder] and y i.e. a list of encoder and decoder inputs and labels (Note that the decoder input, X_decoder is ‘y’ with one position ahead than the actual y. Basically, teacher forcing).

So my question is now after training, when it comes to actual prediction where I do not have any labels how do I provide ‘X_decoder’ to my input? Or do I train on something else?

• Jason Brownlee October 29, 2020 at 7:55 am #

You’r welcome.

During inference/prediction, the teacher forcing is replaced with the output from the model in the last time step.

19. Nur March 17, 2021 at 12:11 am #

Great post, very helpful, thank you!

I have a question and would be very happy if you can enlighten me. I am working on medical image captioning, and I am using custom training and validation functions instead of using model.fit or train_on_batch. (I am also providing the functions below.)

If I use the teacher forcing in the validation step, there is no overfitting or anything unexpected. However, if I use the actual output in the validation, the model is overfitting.

My question is, should we use the teacher forcing in the validation set? Because I believe, the validation step should be the same as the test step (to see whether the model is learning properly or not.)

I could not find the proper explanation anywhere, I would appreciate it if you can answer my question.

@tf.function

def train_step(tensor, target, mesh):
“””Subclassed model training step”””
loss = 0
accuracy = 0
#initializing the hidden state for each batch
hidden = tf.zeros((target.shape[0], units))
dec_input = tf.expand_dims([tokenizer.word_index[”]] * target.shape[0], 1)
features = encoder(tensor, mesh)

for i in range(1, target.shape[1]):
# passing the features through the decoder

predictions, hidden = decoder([dec_input, features, hidden])

loss += loss_func(target[:, i], predictions)

accuracy += acc_func(target[:, i], predictions)

# using teacher forcing
dec_input = tf.expand_dims(target[:, i], 1)

total_loss = (loss / int(target.shape[1]))
total_acc = (accuracy / int(target.shape[1]))

trainable_variables = encoder.trainable_variables + decoder.trainable_variables

return loss, total_loss, total_acc

#@tf.function
def val_step(tensor, target, mesh):
“””Subclassed model validation step”””
loss = 0
accuracy = 0

# initializing the hidden state for each batch
hidden = tf.zeros((target.shape[0], units))
dec_input = tf.expand_dims([tokenizer.word_index[”]] * target.shape[0], 1)
mesh=tf.fill([target.shape[0], 64], 490)
features = encoder(tensor, mesh)

for i in range(1, target.shape[1]):
# passing the features through the decoder
predictions, hidden = decoder([tf.cast(dec_input, tf.float32), features, hidden])

loss += loss_func(target[:, i], predictions)
accuracy += acc_func(target[:, i], predictions)

#dec_input = tf.expand_dims(target[:, i], 1)
#predicted_id = tf.argmax(predictions[0])
predicted_id = tf.argmax(predictions[0]).numpy()
dec_input = tf.expand_dims([predicted_id]*target.shape[0], 1)

total_loss = (loss / int(target.shape[1]))
total_acc = (accuracy / int(target.shape[1]))

return loss, total_loss, total_acc

• Jason Brownlee March 17, 2021 at 6:08 am #

Good question. The choice is yours.

If you want the validation score to be representative of how the model will be used in practice, then don’t use teacher forcing. If you want it to be representative and a point of comparison with the training set, do teacher forcing.

20. Tim April 7, 2021 at 10:19 pm #

Hi, I believe the sentence underneath ‘What is Teacher Forcing?’ is wrong:

“Teacher forcing is a strategy for training recurrent neural networks that uses model output from a prior time step as an input.”

This since teacher forcing actually uses the ground truth as input and not the model output from the previous time-step, as you mention later.

Dear Jason Brownlee

Do you have sample LSTM code not using Teacher forcing?

22. viduz September 15, 2021 at 12:20 am #

Hi, i’m a bit confused on how teacher forcing could be applied to language translation task. While the concept holds for text generation, where input and output words belong to the same vocabulary, on language translation, say English->Italian, how can i feed a ground truth (Italian) word into input which expects an English word?? Don’t input and outputs belong to different vocabularies?

• Adrian Tam September 15, 2021 at 11:27 pm #

The concept still valid. The problem is what you should feed in. Instead of using the output of the network during training, teacher forcing means you should hold your ground truth and ignore the model output. This article is about text generation. For translation, I think it doesn’t matter because you will not feed in the output back to the network anyway.