How to Handle Very Long Sequences with Long Short-Term Memory Recurrent Neural Networks

Long Short-Term Memory or LSTM recurrent neural networks are capable of learning and remembering over long sequences of inputs.

LSTMs work very well if your problem has one output for every input, like time series forecasting or text translation. But LSTMs can be challenging to use when you have very long input sequences and only one or a handful of outputs.

This is often called sequence labeling, or sequence classification.

Some examples include:

  • Classification of sentiment in documents containing thousands of words (natural language processing).
  • Classification of an EEG trace of thousands of time steps (medicine).
  • Classification of coding or non-coding genes for sequences of thousands of DNA base pairs (bioinformatics).

These so-called sequence classification tasks require special handling when using recurrent neural networks, like LSTMs.

In this post, you will discover 6 ways to handle very long sequences for sequence classification problems.

Let’s get started.

How to Handle Very Long Sequences with Long Short-Term Memory Recurrent Neural Networks

How to Handle Very Long Sequences with Long Short-Term Memory Recurrent Neural Networks
Photo by Justin Jensen, some rights reserved.

1. Use Sequences As-Is

The starting point is to use the long sequence data as-is without change.

This may result in the problem of very long training times.

More troubling, attempting to back-propagate across very long input sequences may result in vanishing gradients, and in turn, an unlearnable model.

A reasonable limit of 250-500 time steps is often used in practice with large LSTM models.

2. Truncate Sequences

A common technique for handling very long sequences is to simply truncate them.

This can be done by selectively removing time steps from the beginning or the end of input sequences.

This will allow you to force the sequences to a manageable length at the cost of losing data.

The risk of truncating input sequences is that data that is valuable to the model in order to make accurate predictions is being lost.

3. Summarize Sequences

In some problem domains, it may be possible to summarize the input sequences.

For example, in the case where input sequences are words, it may be possible to remove all words from input sequences that are above a specified word frequency (e.g. “and”, “the”, etc.).

This could be framed as only keep the observations where their ranked frequency in the entire training dataset is above some fixed value.

Summarization may result in both focusing the problem on the most salient parts of the input sequences and sufficiently reducing the length of input sequences.

4. Random Sampling

A less systematic approach may be to summarize a sequence using random sampling.

Random time steps may be selected and removed from the sequence in order to reduce them to a specific length.

Alternately, random contiguous subsequences may be selected to construct a new sampled sequence over the desired length, care to handle overlap or non-overlap as required by the domain.

This approach may be suitable in cases where there is no obvious way to systematically reduce the sequence length.

This approach may also be used as a type of data augmentation scheme in order to create many possible different input sequences from each input sequence. Such methods can improve the robustness of models when available training data is limited.

5. Use Truncated Backpropagation Through Time

Rather than updating the model based on the entire sequence, the gradient can be estimated from a subset of the last time steps.

This is called Truncated Backpropagation Through Time, or TBPTT for short. It can dramatically speed up the learning process of recurrent neural networks like LSTMs on long sequences.

This would allow all sequences to be provided as input and execute the forward pass, but only the last tens or hundreds of time steps would be used to estimate the gradients and used in weight updates.

Some modern implementations of LSTMs permit you to specify the number of time steps to use for updates, separate for the time steps used as input sequences. For example:

6. Use an Encoder-Decoder Architecture

You can use an autoencoder to learn a new representation length for long sequences, then a decoder network to interpret the encoded representation into the desired output.

This may involve an unsupervised autoencoder as a pre-processing pass on sequences, or the more recent encoder-decoder LSTM style networks used for natural language translation.

Again, there may still be difficulties in learning from very long sequences, but the more sophisticated architecture may offer additional leverage or skill, especially if combined with one or more of the techniques above.

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Honorable Mentions and Crazy Ideas

This section lists some additional ideas that are not fully thought through.

  • Explore splitting the input sequence into multiple fixed-length subsequences and train a model with each subsequence as a separate feature (e.g. parallel input sequences).
  • Explore a Bidirectional LSTM where each LSTM in the pair is fit on half of the input sequence and the outcomes of each layer are merged. Scale from 2 to more to suitably reduce the length of the subsequences.
  • Explore using sequence-aware encoding schemes, projection methods, and even hashing in order to reduce the number of time steps in less domain-specific ways.

Do you have any crazy ideas of your own?
Let me know in the comments.

Further Reading

This section lists some resources for further reading on sequence classification problems:

Summary

In this post, you discovered how you can handle very long sequences when training recurrent neural networks like LSTMs.

Specifically, you learned:

  • How to reduce sequence length using truncation, summarization, and random sampling.
  • How to adjust learning to use Truncated Backpropagation Through Time.
  • How to adjust the network architecture to use an encoder-decoder structure.

Do you have any questions?
Ask your questions in the comments and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.


39 Responses to How to Handle Very Long Sequences with Long Short-Term Memory Recurrent Neural Networks

  1. Pranav Goel June 27, 2017 at 1:32 am #

    Hello Jason,

    Thank you for this blog post! I thing TBPTT is probably a very suitable method. Do you know if there is something equivalent to ‘truncate_gradient’ parameter in theano available for keras?

    Thanks!

  2. Owen June 27, 2017 at 5:03 am #

    “For example, in the case where input sequences are words, it may be possible to remove all words from input sequences that are below a specified word frequency.”

    I think it may be the opposite. Words with extremely low frequencies may be the core of the sentence, and the ones with high frequencies might just be meaningless. For example, ‘the’ is no doubt the most frequent word, but it can’t tell you anything. However, words like ‘covfefe’ might be important.

    • Jason Brownlee June 27, 2017 at 8:36 am #

      I agree that was an error, fixed. Thanks Owen!

      • Michel Lemay July 4, 2017 at 1:09 am #

        In practice, both ends of the frequency distribution needs to be truncated. Common sense dictate that very high frequency words tends to be stopwords. However, meaningful words starts to appear early on in this list. So one should take care when defining what would be a good threshold in the top terms. At the opposite end, in the longtail, most of the lower frequency terms fall into categories like misspelled words, obscure jargon words, unique ids or product numbers, etc.. Here, we use several approaches to cleanup the list: compare local IDF to generic english IDF and select words that have significant divergence, use an unsupervised algorithm to test the usefulness of a word on a given task by removing it, etc.

  3. Ben June 30, 2017 at 5:01 am #

    Hi Jason
    Thanks for the web site
    Do you have an example on symbolic sequences prediction using LSTM?
    Thanks

    • Jason Brownlee June 30, 2017 at 8:18 am #

      What do you mean exactly, can you give an example?

      • Ben June 30, 2017 at 6:24 pm #

        Thanks for your reply.
        For example, having this sequence:
        1,2,2,3,1,6,4,5,7,1,3,2,2
        What symbol can we expect in a future ?

        Thanks

        • Jason Brownlee July 1, 2017 at 6:30 am #

          Yes, I would recommend encoding the integers with a one hot encoding as long as you know the scope of input/output.

  4. buzznizz August 4, 2017 at 9:39 pm #

    Hi Jason

    Do you have experience in transforming timeseries to the time, frequency domain using wavelets and using that as inputs instead?

    BR

    • Jason Brownlee August 5, 2017 at 5:45 am #

      Not with LSTMs, sorry. Thanks for the suggestion.

  5. allonbrooks November 21, 2017 at 8:25 pm #

    If i have a sentence with a length of more than 2000,i want to train sentiment analysis mode with lstm,how i do with this situation?

    • Jason Brownlee November 22, 2017 at 11:11 am #

      Try each method in this post and see what results in models with the best skill.

  6. Saeid November 22, 2017 at 2:16 pm #

    Hi Jason,

    I have a question about timesteps. In all of my samples, the sequence are in form of:

    X1 Y1 Z1 X2 Y2 Z2 X3 Y3 Z3 … … … X100 Y100 Z100

    Then in my input shape,I assume the timesteps would be 100 and features=3 . Is that right or timestep is 300 in this case?

    • Jason Brownlee November 23, 2017 at 10:25 am #

      Not sure I follow, what do you want to predict, what are the inputs and outputs you want to model?

      • Saeid November 26, 2017 at 11:43 am #

        So I have a classification problem. My labeled data features are x-y coordinates and direction angle(theta) of trajectories of some people walking in a room. The data are time series and there are 2000 people(examples). Labels are 0 and 1 meaning whether a person has a specific intention or not. In other words, the tabular representation of the data would be:

        X1, Y1, theta1, X2 , Y2 , theta2 , … ,X100,Y100,Theta100, Label
        Person1 1 -1 3 1.1 -1.1 3.1 ….. 10 -10 , 3.9 0

        Person2 5 9 -2 5.1 9.1 -2.1 …. 15 19 , -2.6 1
        …. … …. … … … … … …. … 0
        ….. …. …. … … …. … … … … 1

        Person2000 …. …. … … … … … … …. 0

        Since there are three features and I have a history of these features for each example, I am trying to model this problem in LSTM. So far I have implemented one LSTM layer and one Dense layer with sigmoid activation. But my major concern is whether I am reshaping my input correctly or not. So my input shape, let’s say for the table above is (2000,100,3) but I was thinking maybe timestep is 100*3?

        I hope I could explain well and would be grateful to know your idea about the input shape and if you have any suggestions for me in general.

        Thanks

  7. JJ November 23, 2017 at 10:36 am #

    Hi Jason,

    I’ve been following your blog for a while, and find it very helpful. I have some confusion regarding LSTMs and all the different sources available do not really address it, and i’m hoping that you could given your very clear explanations.

    My confusion is related to the ‘memory’ of an LSTM and how the data is fed into it with real-world examples. So let’s say i’m classifying a one channel EEG of a person (N people in total). Let’s say we have 100 time-points (to keep it simple) for each person and a label which says whether the person is attentive or not during those 100 time-points. For a new set of time-points I want say whether the person is attentive or not.

    If i take an LSTM of 20 time-steps. Then what is the memory based on? does it mean it can only remember what happened within those 20 time-steps? Or would it keep the memory of a previous chunk of time-steps fed into it? And do training examples now simply become 20-time-step chunks and the LSTM can re-call think from previous chunks? Because to me when people talk about say the prediction of the next word from the end of the Document after passing the whole document through an LSTM, they mention that LSTMs are good as they can even remember words at the beginning of the text to make such a prediction. so it seems to me it has memory from previous time-step chunks.

    The confusion comes here for me, if that is the case, then if we get back to our EEG example, then the LSTM may use memory from the EEG of a totally different person for it’s prediction ? As is there any discrimination to the LSTM that the individuals providing the time-series data are different?

    Thank you

  8. Tseo December 9, 2017 at 9:53 pm #

    Hello! Thanks for your posts! Love them!

    I have a question: I want to predict n last steps in different sequences.. I have trained the network with
    Input steps: X(t-n).. X(t-1), X(t)
    Output steps: X(t-n+1).. X(t), X(t+1)

    The loss is almost zero in training and validation but the next steps t+2,t+3..t+n are really bad..

    Maybe I have to use other method? I believe that the network is memorizing the sequences..

    Thanks!

  9. Rui December 21, 2017 at 3:23 am #

    Nice set of ideias ! I have read one of your texts about combining conv nets with lstm. In that example I understand conv operations where applied to each time sample (in case of multivarable or images sequences)… I was wandering if it is possible to apply conv operations in the sequence steps ?

    Thanks in advance (also for all quality information givem so far☺)

    • Jason Brownlee December 21, 2017 at 5:29 am #

      Sure. You could pass a batch of time steps through a CNN and the CNN can provide the features from the chunks to an LSTM. Perhaps batched by hour or day. I don’t think it would be effective though. Try it and see.

  10. Rahul January 20, 2018 at 1:36 am #

    I’ve where long sequence of text where all the characters have been changed (in a specific manner). Hence it is not possible for me to use any standard embeddings. As the text is around 1000 characters, it is time consuming to use LSTM. I’ve tried using CNN of varying kernel size and then passing it to the LSTM(still it takes a lot of time). What would be a better approach(I like the random sampling approach; I would split my data into 4 parts and increase my samples)

    • Jason Brownlee January 20, 2018 at 8:21 am #

      Not sure I have enough info or can get involved deep enough to give you good advice. Sorry.

  11. Daniel January 31, 2018 at 2:27 pm #

    Im having a really hard time trying to understand how i need to prepare my data to feed the LSTM. In my case, i have a set of texts, each one with one or more labels. To make things easier, lets suppose i have only three texts:

    1) “i have a green car at my house” (8 words)
    2) “bananas are yellow and apples are green or red” (9 words)
    3) “my name is bond, james bond. thanks for asking my friend” (11 words)

    What i am currently doing is create a 3D array with shape ( 3 for samples, 11 for timesteps, which is the number of words in the largest document, 26 for number of features, which is the number of unique words in all documents).

    The words are tokenized, so each feature will be a 26D array with a “1” indicating the current word.

    The sentences are also tokenized, making each one a 11D array where each row contains one word vector.

    My goal is to feed this 3D array to the model and try to predict labels from other texts.

    The thing is… i have no idea if this is the right way to convert the documents/words. Most examples of LSTMs use preprocessed text data, so i cant tell what they do with the raw documents. It doesn’t seem like im using the timesteps parameter correctly…

    Can you shine a light on this?

    Thank you so much! Really enjoy all your posts!

    • Jason Brownlee February 1, 2018 at 7:15 am #

      Words will need to be mapped to integers. Then padded to the same length. Then mapped to a 3D shape perhaps with some encoding (one hot or word embeddding).

      I have many examples on the blog of preparing text for LSTMs, perhaps try the search.

  12. Marco Graziano March 2, 2018 at 2:54 am #

    Has anyone tried 1D convolutions as a preliminary step to reduce very long sequences? Are there examples around to look at?

    • Jason Brownlee March 2, 2018 at 5:36 am #

      Interesting idea.

      Also an autoencoder could be used for the same purposes.

    • DanielDeng April 28, 2018 at 1:08 pm #

      I tried this method at the first time, it works but its performce is not better than use totally conv layers in net. So what would be the meaning of use RNN on the top of CNN?

  13. Rg March 8, 2018 at 8:12 am #

    Can we train time series with LSTM that are each observation has different size of timesteps ?

    For example:

    First observation: 50 inputs
    Second observation: 60 inputs
    Third observation: 30 inputs
    and so on. The input channels are the same in all of them.

    • Rg March 8, 2018 at 8:23 am #

      Sorry silly question I can just use a batch size of 1, I guess. Right?

    • Jason Brownlee March 8, 2018 at 2:53 pm #

      Yes, but you must pad all samples to the same length and use a masking layer to ignore the zero values.

Leave a Reply