How to Handle Very Long Sequences with Long Short-Term Memory Recurrent Neural Networks

Long Short-Term Memory or LSTM recurrent neural networks are capable of learning and remembering over long sequences of inputs.

LSTMs work very well if your problem has one output for every input, like time series forecasting or text translation. But LSTMs can be challenging to use when you have very long input sequences and only one or a handful of outputs.

This is often called sequence labeling, or sequence classification.

Some examples include:

  • Classification of sentiment in documents containing thousands of words (natural language processing).
  • Classification of an EEG trace of thousands of time steps (medicine).
  • Classification of coding or non-coding genes for sequences of thousands of DNA base pairs (bioinformatics).

These so-called sequence classification tasks require special handling when using recurrent neural networks, like LSTMs.

In this post, you will discover 6 ways to handle very long sequences for sequence classification problems.

Let’s get started.

How to Handle Very Long Sequences with Long Short-Term Memory Recurrent Neural Networks

How to Handle Very Long Sequences with Long Short-Term Memory Recurrent Neural Networks
Photo by Justin Jensen, some rights reserved.

1. Use Sequences As-Is

The starting point is to use the long sequence data as-is without change.

This may result in the problem of very long training times.

More troubling, attempting to back-propagate across very long input sequences may result in vanishing gradients, and in turn, an unlearnable model.

A reasonable limit of 250-500 time steps is often used in practice with large LSTM models.

2. Truncate Sequences

A common technique for handling very long sequences is to simply truncate them.

This can be done by selectively removing time steps from the beginning or the end of input sequences.

This will allow you to force the sequences to a manageable length at the cost of losing data.

The risk of truncating input sequences is that data that is valuable to the model in order to make accurate predictions is being lost.

3. Summarize Sequences

In some problem domains, it may be possible to summarize the input sequences.

For example, in the case where input sequences are words, it may be possible to remove all words from input sequences that are above a specified word frequency (e.g. “and”, “the”, etc.).

This could be framed as only keep the observations where their ranked frequency in the entire training dataset is above some fixed value.

Summarization may result in both focusing the problem on the most salient parts of the input sequences and sufficiently reducing the length of input sequences.

4. Random Sampling

A less systematic approach may be to summarize a sequence using random sampling.

Random time steps may be selected and removed from the sequence in order to reduce them to a specific length.

Alternately, random contiguous subsequences may be selected to construct a new sampled sequence over the desired length, care to handle overlap or non-overlap as required by the domain.

This approach may be suitable in cases where there is no obvious way to systematically reduce the sequence length.

This approach may also be used as a type of data augmentation scheme in order to create many possible different input sequences from each input sequence. Such methods can improve the robustness of models when available training data is limited.

5. Use Truncated Backpropagation Through Time

Rather than updating the model based on the entire sequence, the gradient can be estimated from a subset of the last time steps.

This is called Truncated Backpropagation Through Time, or TBPTT for short. It can dramatically speed up the learning process of recurrent neural networks like LSTMs on long sequences.

This would allow all sequences to be provided as input and execute the forward pass, but only the last tens or hundreds of time steps would be used to estimate the gradients and used in weight updates.

Some modern implementations of LSTMs permit you to specify the number of time steps to use for updates, separate for the time steps used as input sequences. For example:

6. Use an Encoder-Decoder Architecture

You can use an autoencoder to learn a new representation length for long sequences, then a decoder network to interpret the encoded representation into the desired output.

This may involve an unsupervised autoencoder as a pre-processing pass on sequences, or the more recent encoder-decoder LSTM style networks used for natural language translation.

Again, there may still be difficulties in learning from very long sequences, but the more sophisticated architecture may offer additional leverage or skill, especially if combined with one or more of the techniques above.

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Honorable Mentions and Crazy Ideas

This section lists some additional ideas that are not fully thought through.

  • Explore splitting the input sequence into multiple fixed-length subsequences and train a model with each subsequence as a separate feature (e.g. parallel input sequences).
  • Explore a Bidirectional LSTM where each LSTM in the pair is fit on half of the input sequence and the outcomes of each layer are merged. Scale from 2 to more to suitably reduce the length of the subsequences.
  • Explore using sequence-aware encoding schemes, projection methods, and even hashing in order to reduce the number of time steps in less domain-specific ways.

Do you have any crazy ideas of your own?
Let me know in the comments.

Further Reading

This section lists some resources for further reading on sequence classification problems:


In this post, you discovered how you can handle very long sequences when training recurrent neural networks like LSTMs.

Specifically, you learned:

  • How to reduce sequence length using truncation, summarization, and random sampling.
  • How to adjust learning to use Truncated Backpropagation Through Time.
  • How to adjust the network architecture to use an encoder-decoder structure.

Do you have any questions?
Ask your questions in the comments and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.

14 Responses to How to Handle Very Long Sequences with Long Short-Term Memory Recurrent Neural Networks

  1. Pranav Goel June 27, 2017 at 1:32 am #

    Hello Jason,

    Thank you for this blog post! I thing TBPTT is probably a very suitable method. Do you know if there is something equivalent to ‘truncate_gradient’ parameter in theano available for keras?


  2. Owen June 27, 2017 at 5:03 am #

    “For example, in the case where input sequences are words, it may be possible to remove all words from input sequences that are below a specified word frequency.”

    I think it may be the opposite. Words with extremely low frequencies may be the core of the sentence, and the ones with high frequencies might just be meaningless. For example, ‘the’ is no doubt the most frequent word, but it can’t tell you anything. However, words like ‘covfefe’ might be important.

    • Jason Brownlee June 27, 2017 at 8:36 am #

      I agree that was an error, fixed. Thanks Owen!

      • Michel Lemay July 4, 2017 at 1:09 am #

        In practice, both ends of the frequency distribution needs to be truncated. Common sense dictate that very high frequency words tends to be stopwords. However, meaningful words starts to appear early on in this list. So one should take care when defining what would be a good threshold in the top terms. At the opposite end, in the longtail, most of the lower frequency terms fall into categories like misspelled words, obscure jargon words, unique ids or product numbers, etc.. Here, we use several approaches to cleanup the list: compare local IDF to generic english IDF and select words that have significant divergence, use an unsupervised algorithm to test the usefulness of a word on a given task by removing it, etc.

  3. Ben June 30, 2017 at 5:01 am #

    Hi Jason
    Thanks for the web site
    Do you have an example on symbolic sequences prediction using LSTM?

    • Jason Brownlee June 30, 2017 at 8:18 am #

      What do you mean exactly, can you give an example?

      • Ben June 30, 2017 at 6:24 pm #

        Thanks for your reply.
        For example, having this sequence:
        What symbol can we expect in a future ?


        • Jason Brownlee July 1, 2017 at 6:30 am #

          Yes, I would recommend encoding the integers with a one hot encoding as long as you know the scope of input/output.

Leave a Reply