Techniques to Handle Very Long Sequences with LSTMs

Long Short-Term Memory or LSTM recurrent neural networks are capable of learning and remembering over long sequences of inputs.

LSTMs work very well if your problem has one output for every input, like time series forecasting or text translation. But LSTMs can be challenging to use when you have very long input sequences and only one or a handful of outputs.

This is often called sequence labeling, or sequence classification.

Some examples include:

  • Classification of sentiment in documents containing thousands of words (natural language processing).
  • Classification of an EEG trace of thousands of time steps (medicine).
  • Classification of coding or non-coding genes for sequences of thousands of DNA base pairs (bioinformatics).

These so-called sequence classification tasks require special handling when using recurrent neural networks, like LSTMs.

In this post, you will discover 6 ways to handle very long sequences for sequence classification problems.

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Handle Very Long Sequences with Long Short-Term Memory Recurrent Neural Networks

How to Handle Very Long Sequences with Long Short-Term Memory Recurrent Neural Networks
Photo by Justin Jensen, some rights reserved.

1. Use Sequences As-Is

The starting point is to use the long sequence data as-is without change.

This may result in the problem of very long training times.

More troubling, attempting to back-propagate across very long input sequences may result in vanishing gradients, and in turn, an unlearnable model.

A reasonable limit of 250-500 time steps is often used in practice with large LSTM models.

2. Truncate Sequences

A common technique for handling very long sequences is to simply truncate them.

This can be done by selectively removing time steps from the beginning or the end of input sequences.

This will allow you to force the sequences to a manageable length at the cost of losing data.

The risk of truncating input sequences is that data that is valuable to the model in order to make accurate predictions is being lost.

3. Summarize Sequences

In some problem domains, it may be possible to summarize the input sequences.

For example, in the case where input sequences are words, it may be possible to remove all words from input sequences that are above a specified word frequency (e.g. “and”, “the”, etc.).

This could be framed as only keep the observations where their ranked frequency in the entire training dataset is above some fixed value.

Summarization may result in both focusing the problem on the most salient parts of the input sequences and sufficiently reducing the length of input sequences.

4. Random Sampling

A less systematic approach may be to summarize a sequence using random sampling.

Random time steps may be selected and removed from the sequence in order to reduce them to a specific length.

Alternately, random contiguous subsequences may be selected to construct a new sampled sequence over the desired length, care to handle overlap or non-overlap as required by the domain.

This approach may be suitable in cases where there is no obvious way to systematically reduce the sequence length.

This approach may also be used as a type of data augmentation scheme in order to create many possible different input sequences from each input sequence. Such methods can improve the robustness of models when available training data is limited.

5. Use Truncated Backpropagation Through Time

Rather than updating the model based on the entire sequence, the gradient can be estimated from a subset of the last time steps.

This is called Truncated Backpropagation Through Time, or TBPTT for short. It can dramatically speed up the learning process of recurrent neural networks like LSTMs on long sequences.

This would allow all sequences to be provided as input and execute the forward pass, but only the last tens or hundreds of time steps would be used to estimate the gradients and used in weight updates.

Some modern implementations of LSTMs permit you to specify the number of time steps to use for updates, separate for the time steps used as input sequences. For example:

6. Use an Encoder-Decoder Architecture

You can use an autoencoder to learn a new representation length for long sequences, then a decoder network to interpret the encoded representation into the desired output.

This may involve an unsupervised autoencoder as a pre-processing pass on sequences, or the more recent encoder-decoder LSTM style networks used for natural language translation.

Again, there may still be difficulties in learning from very long sequences, but the more sophisticated architecture may offer additional leverage or skill, especially if combined with one or more of the techniques above.

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Honorable Mentions and Crazy Ideas

This section lists some additional ideas that are not fully thought through.

  • Explore splitting the input sequence into multiple fixed-length subsequences and train a model with each subsequence as a separate feature (e.g. parallel input sequences).
  • Explore a Bidirectional LSTM where each LSTM in the pair is fit on half of the input sequence and the outcomes of each layer are merged. Scale from 2 to more to suitably reduce the length of the subsequences.
  • Explore using sequence-aware encoding schemes, projection methods, and even hashing in order to reduce the number of time steps in less domain-specific ways.

Do you have any crazy ideas of your own?
Let me know in the comments.

Further Reading

This section lists some resources for further reading on sequence classification problems:

Summary

In this post, you discovered how you can handle very long sequences when training recurrent neural networks like LSTMs.

Specifically, you learned:

  • How to reduce sequence length using truncation, summarization, and random sampling.
  • How to adjust learning to use Truncated Backpropagation Through Time.
  • How to adjust the network architecture to use an encoder-decoder structure.

Do you have any questions?
Ask your questions in the comments and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more...

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

See What's Inside

93 Responses to Techniques to Handle Very Long Sequences with LSTMs

  1. Avatar
    Pranav Goel June 27, 2017 at 1:32 am #

    Hello Jason,

    Thank you for this blog post! I thing TBPTT is probably a very suitable method. Do you know if there is something equivalent to ‘truncate_gradient’ parameter in theano available for keras?

    Thanks!

  2. Avatar
    Owen June 27, 2017 at 5:03 am #

    “For example, in the case where input sequences are words, it may be possible to remove all words from input sequences that are below a specified word frequency.”

    I think it may be the opposite. Words with extremely low frequencies may be the core of the sentence, and the ones with high frequencies might just be meaningless. For example, ‘the’ is no doubt the most frequent word, but it can’t tell you anything. However, words like ‘covfefe’ might be important.

    • Avatar
      Jason Brownlee June 27, 2017 at 8:36 am #

      I agree that was an error, fixed. Thanks Owen!

      • Avatar
        Michel Lemay July 4, 2017 at 1:09 am #

        In practice, both ends of the frequency distribution needs to be truncated. Common sense dictate that very high frequency words tends to be stopwords. However, meaningful words starts to appear early on in this list. So one should take care when defining what would be a good threshold in the top terms. At the opposite end, in the longtail, most of the lower frequency terms fall into categories like misspelled words, obscure jargon words, unique ids or product numbers, etc.. Here, we use several approaches to cleanup the list: compare local IDF to generic english IDF and select words that have significant divergence, use an unsupervised algorithm to test the usefulness of a word on a given task by removing it, etc.

  3. Avatar
    Ben June 30, 2017 at 5:01 am #

    Hi Jason
    Thanks for the web site
    Do you have an example on symbolic sequences prediction using LSTM?
    Thanks

    • Avatar
      Jason Brownlee June 30, 2017 at 8:18 am #

      What do you mean exactly, can you give an example?

      • Avatar
        Ben June 30, 2017 at 6:24 pm #

        Thanks for your reply.
        For example, having this sequence:
        1,2,2,3,1,6,4,5,7,1,3,2,2
        What symbol can we expect in a future ?

        Thanks

        • Avatar
          Jason Brownlee July 1, 2017 at 6:30 am #

          Yes, I would recommend encoding the integers with a one hot encoding as long as you know the scope of input/output.

  4. Avatar
    buzznizz August 4, 2017 at 9:39 pm #

    Hi Jason

    Do you have experience in transforming timeseries to the time, frequency domain using wavelets and using that as inputs instead?

    BR

    • Avatar
      Jason Brownlee August 5, 2017 at 5:45 am #

      Not with LSTMs, sorry. Thanks for the suggestion.

  5. Avatar
    allonbrooks November 21, 2017 at 8:25 pm #

    If i have a sentence with a length of more than 2000,i want to train sentiment analysis mode with lstm,how i do with this situation?

    • Avatar
      Jason Brownlee November 22, 2017 at 11:11 am #

      Try each method in this post and see what results in models with the best skill.

  6. Avatar
    Saeid November 22, 2017 at 2:16 pm #

    Hi Jason,

    I have a question about timesteps. In all of my samples, the sequence are in form of:

    X1 Y1 Z1 X2 Y2 Z2 X3 Y3 Z3 … … … X100 Y100 Z100

    Then in my input shape,I assume the timesteps would be 100 and features=3 . Is that right or timestep is 300 in this case?

    • Avatar
      Jason Brownlee November 23, 2017 at 10:25 am #

      Not sure I follow, what do you want to predict, what are the inputs and outputs you want to model?

      • Avatar
        Saeid November 26, 2017 at 11:43 am #

        So I have a classification problem. My labeled data features are x-y coordinates and direction angle(theta) of trajectories of some people walking in a room. The data are time series and there are 2000 people(examples). Labels are 0 and 1 meaning whether a person has a specific intention or not. In other words, the tabular representation of the data would be:

        X1, Y1, theta1, X2 , Y2 , theta2 , … ,X100,Y100,Theta100, Label
        Person1 1 -1 3 1.1 -1.1 3.1 ….. 10 -10 , 3.9 0

        Person2 5 9 -2 5.1 9.1 -2.1 …. 15 19 , -2.6 1
        …. … …. … … … … … …. … 0
        ….. …. …. … … …. … … … … 1

        Person2000 …. …. … … … … … … …. 0

        Since there are three features and I have a history of these features for each example, I am trying to model this problem in LSTM. So far I have implemented one LSTM layer and one Dense layer with sigmoid activation. But my major concern is whether I am reshaping my input correctly or not. So my input shape, let’s say for the table above is (2000,100,3) but I was thinking maybe timestep is 100*3?

        I hope I could explain well and would be grateful to know your idea about the input shape and if you have any suggestions for me in general.

        Thanks

  7. Avatar
    JJ November 23, 2017 at 10:36 am #

    Hi Jason,

    I’ve been following your blog for a while, and find it very helpful. I have some confusion regarding LSTMs and all the different sources available do not really address it, and i’m hoping that you could given your very clear explanations.

    My confusion is related to the ‘memory’ of an LSTM and how the data is fed into it with real-world examples. So let’s say i’m classifying a one channel EEG of a person (N people in total). Let’s say we have 100 time-points (to keep it simple) for each person and a label which says whether the person is attentive or not during those 100 time-points. For a new set of time-points I want say whether the person is attentive or not.

    If i take an LSTM of 20 time-steps. Then what is the memory based on? does it mean it can only remember what happened within those 20 time-steps? Or would it keep the memory of a previous chunk of time-steps fed into it? And do training examples now simply become 20-time-step chunks and the LSTM can re-call think from previous chunks? Because to me when people talk about say the prediction of the next word from the end of the Document after passing the whole document through an LSTM, they mention that LSTMs are good as they can even remember words at the beginning of the text to make such a prediction. so it seems to me it has memory from previous time-step chunks.

    The confusion comes here for me, if that is the case, then if we get back to our EEG example, then the LSTM may use memory from the EEG of a totally different person for it’s prediction ? As is there any discrimination to the LSTM that the individuals providing the time-series data are different?

    Thank you

  8. Avatar
    Tseo December 9, 2017 at 9:53 pm #

    Hello! Thanks for your posts! Love them!

    I have a question: I want to predict n last steps in different sequences.. I have trained the network with
    Input steps: X(t-n).. X(t-1), X(t)
    Output steps: X(t-n+1).. X(t), X(t+1)

    The loss is almost zero in training and validation but the next steps t+2,t+3..t+n are really bad..

    Maybe I have to use other method? I believe that the network is memorizing the sequences..

    Thanks!

  9. Avatar
    Rui December 21, 2017 at 3:23 am #

    Nice set of ideias ! I have read one of your texts about combining conv nets with lstm. In that example I understand conv operations where applied to each time sample (in case of multivarable or images sequences)… I was wandering if it is possible to apply conv operations in the sequence steps ?

    Thanks in advance (also for all quality information givem so far☺)

    • Avatar
      Jason Brownlee December 21, 2017 at 5:29 am #

      Sure. You could pass a batch of time steps through a CNN and the CNN can provide the features from the chunks to an LSTM. Perhaps batched by hour or day. I don’t think it would be effective though. Try it and see.

  10. Avatar
    Rahul January 20, 2018 at 1:36 am #

    I’ve where long sequence of text where all the characters have been changed (in a specific manner). Hence it is not possible for me to use any standard embeddings. As the text is around 1000 characters, it is time consuming to use LSTM. I’ve tried using CNN of varying kernel size and then passing it to the LSTM(still it takes a lot of time). What would be a better approach(I like the random sampling approach; I would split my data into 4 parts and increase my samples)

    • Avatar
      Jason Brownlee January 20, 2018 at 8:21 am #

      Not sure I have enough info or can get involved deep enough to give you good advice. Sorry.

  11. Avatar
    Daniel January 31, 2018 at 2:27 pm #

    Im having a really hard time trying to understand how i need to prepare my data to feed the LSTM. In my case, i have a set of texts, each one with one or more labels. To make things easier, lets suppose i have only three texts:

    1) “i have a green car at my house” (8 words)
    2) “bananas are yellow and apples are green or red” (9 words)
    3) “my name is bond, james bond. thanks for asking my friend” (11 words)

    What i am currently doing is create a 3D array with shape ( 3 for samples, 11 for timesteps, which is the number of words in the largest document, 26 for number of features, which is the number of unique words in all documents).

    The words are tokenized, so each feature will be a 26D array with a “1” indicating the current word.

    The sentences are also tokenized, making each one a 11D array where each row contains one word vector.

    My goal is to feed this 3D array to the model and try to predict labels from other texts.

    The thing is… i have no idea if this is the right way to convert the documents/words. Most examples of LSTMs use preprocessed text data, so i cant tell what they do with the raw documents. It doesn’t seem like im using the timesteps parameter correctly…

    Can you shine a light on this?

    Thank you so much! Really enjoy all your posts!

    • Avatar
      Jason Brownlee February 1, 2018 at 7:15 am #

      Words will need to be mapped to integers. Then padded to the same length. Then mapped to a 3D shape perhaps with some encoding (one hot or word embeddding).

      I have many examples on the blog of preparing text for LSTMs, perhaps try the search.

  12. Avatar
    Marco Graziano March 2, 2018 at 2:54 am #

    Has anyone tried 1D convolutions as a preliminary step to reduce very long sequences? Are there examples around to look at?

    • Avatar
      Jason Brownlee March 2, 2018 at 5:36 am #

      Interesting idea.

      Also an autoencoder could be used for the same purposes.

    • Avatar
      DanielDeng April 28, 2018 at 1:08 pm #

      I tried this method at the first time, it works but its performce is not better than use totally conv layers in net. So what would be the meaning of use RNN on the top of CNN?

  13. Avatar
    Rg March 8, 2018 at 8:12 am #

    Can we train time series with LSTM that are each observation has different size of timesteps ?

    For example:

    First observation: 50 inputs
    Second observation: 60 inputs
    Third observation: 30 inputs
    and so on. The input channels are the same in all of them.

    • Avatar
      Rg March 8, 2018 at 8:23 am #

      Sorry silly question I can just use a batch size of 1, I guess. Right?

    • Avatar
      Jason Brownlee March 8, 2018 at 2:53 pm #

      Yes, but you must pad all samples to the same length and use a masking layer to ignore the zero values.

  14. Avatar
    Andrew Medlin July 19, 2018 at 6:12 pm #

    I have read several posts like this one about training an LSTM on multivariate time series data, and have a kinda working LSTM implementation, but none of the articles have quite addressed my questions on the training strategy for my kind of data.

    My data is medical ECG data for a large number of people (over 1000). For each person, I have a time series of several thousand measurements, each sample consisting of 12 variables (features). The measurements are taken at a constant sampling rate, so in my input I can drop the time value.

    Each ECG sample length is different. For some people we might have 1000 time steps in the series, for another person we might have 1500, etc.. Each series is a different length.

    The expected output (label) data provided for the training consists of a probability distribution representing a classification for each time step. For now, I simply have two classes, so my label is one-hot encoded as either [0 1] or [1 0]. One of these classes is much more predominant in the labels than the other.

    I reckon a time window of between 25 to 30 samples should be sufficient to capture the temporal correlations within a given series, so this is the number of time_steps in my LSTM input. My number of samples, > 1000, is too large to train on all at once, so I choose a batch size of 256. My shaped LSTM input is therefore is shape (256, 25, 12).

    My question is about how to properly subsample these time series for training. Since my series are of variable length, I can’t uniformly shape each sequence. For now, I have been taking random 25 time_steps (sequential) from 256 random people and using that as one training batch, and repeating until the loss is “low enough”.

    Should I instead be using a sliding window on each sample series and train on those? Or should I be subsampling each series and lining up the subsamples (sequential or otherwise) as rows? I feel like I don’t want to have one training batch have only data from one person, for fear that the network may overlearn for that person and then the loss will jump back up when the person changes.

    • Avatar
      Jason Brownlee July 20, 2018 at 5:55 am #

      Hi Andrew,

      The first issue is that the 0th dimension of the shape is not the batch size, but the total number of samples. The batch size is only used to specify to the model during training, not to reshape the data. The input data must have the shape [samples, timesteps, features].

      A total of 25 time steps seems short for a sequence that has 1500 steps. Perhaps try 200 or 400 as well?

      You can split each sequence into multiple sub-samples, e.g. 1500 steps could be 3 samples of 500 timesteps each.

      As a classification problem, I’d encourage you to try a CNN-LSTM or even a straight CNN or ConvLSTM. I have found them to achieve state of the art results for time series classification and I have posts scheduled on the topic.

      Does that help?

      • Avatar
        Andrew Medlin July 20, 2018 at 9:17 am #

        That definitely helps, thank you. If I have more questions on this application, I’ll follow up in this thread.

      • Avatar
        VISWA February 11, 2020 at 1:28 am #

        @jason. I am also working on a similar Sequence Prediction problem as the ECG. But my dataset varies 6500 timesteps to 20k timesteps for each sequence with 4 features.

        I came to an idea of training the network,

        By Splitting into subsequence and assign the single label available to each subsequence to a Stateful LSTM.

        Training by using a single sequence for each fit separately (Retraining model for each sequence separately) and resetting lstm state after each sequence fit.

        or is there any other way to reset lstm state inbetween ?

        • Avatar
          Jason Brownlee February 11, 2020 at 5:14 am #

          Nice work.

          You can use a stateful LSTM and manually control when state is reset.

  15. Avatar
    Nick September 4, 2018 at 10:42 pm #

    Thank you for this blog post Jason!

    Just wanted to ask if you have seen an example code for “Classification of sentiment in documents containing thousands of words (natural language processing)”.

    My Idea is to classify long-text(news) to predict an Event(Binary classification) with the help of sequential models in Keras. I haven’t found any works about that and I am wondering if it makes sense from a technical/statistically point of view?

    • Avatar
      Jason Brownlee September 5, 2018 at 6:40 am #

      Sure, although every effort is often used to reduce the size of the input to make the model faster to train.

      CNNs are fast and perform well, even state of the art.

      I have many tutorials, search the blog.

  16. Avatar
    Faisal October 8, 2018 at 9:23 pm #

    Hey I am using health data for classification using deep learning techniques. Which hybrid model can I used as time series data I have , Can I used CNN LSTM but my data will be not labeled so which Hybrid deep learning technique can be used to classify the time series unlabeled data.

    • Avatar
      Jason Brownlee October 9, 2018 at 8:43 am #

      I recommend testing a suite of methods in order to discover what works best for your specific dataset.

  17. Avatar
    Renan Cunha October 18, 2018 at 10:36 pm #

    A time step of 600 in a seq2seq task would be a feasible task for a LSTM?

  18. Avatar
    Aaron March 12, 2019 at 11:31 am #

    Do you have any papers / studies on the sequence limitations? Long training times are almost obvious, but vanishing gradient – why would a long sequence cause the gradient to diverge?

  19. Avatar
    nandini March 19, 2019 at 4:42 pm #

    What is the difference between dropout and rnn_dropout in keras layer,how can I decide the amount value of drop out in network ,is there any process to decide how much we have to in layer .

    Thanks in advance

  20. Avatar
    M.Hamadan Ghani April 28, 2019 at 1:53 pm #

    “Explore a Bidirectional LSTM where each LSTM in the pair is fit on half of the input sequence and the outcomes of each layer are merged. Scale from 2 to more to suitably reduce the length of the subsequences.”

    Hello Jason, but how can I implement this process of fitting each LSTM in the pair on half of the input sequence?

    Thank you in advance.

  21. Avatar
    Paul Wandeseer July 26, 2019 at 4:00 am #

    Hi, do you have any papers that showcase the length-wise limitations of RNNs/LSTMs ? I cannot find any good references, everybody always just writes smth. like “in practice, a limit of 500 timesteps is a good rule of thumb”.

  22. Avatar
    wezen August 30, 2019 at 12:21 am #

    Hi Jason.
    If you use Timeserie Generatior to fit LSTM is necessary truncate Sequences or use other sample separation technique? Oro the Generatior do for you?
    Thanks

    • Avatar
      Jason Brownlee August 30, 2019 at 6:26 am #

      YesI don’t think it makes a difference.

      • Avatar
        wezen August 30, 2019 at 7:33 pm #

        Thank you and thank you very much for the blog. It is very useful

        • Avatar
          Jason Brownlee August 31, 2019 at 6:03 am #

          Thanks, you’re very welcome – I’m glad it helps.

  23. Avatar
    Gordon Kwok October 4, 2019 at 8:58 pm #

    Hi Jason.
    I am trying to classify the malware through the order they calling API. And the problem is that some samples have a really long sequences (the longest one reached 1.7 million times calling). should I use CNN or FastText? Are there any algorithm to summarize the sequences ?
    Look forward to your reply.

    • Avatar
      Jason Brownlee October 6, 2019 at 8:10 am #

      Perhaps you can use domain expertise to split, truncate or re-define the long sequences?

      • Avatar
        Gordon Kwok October 14, 2019 at 10:35 pm #

        Thank you very much! I will try it as you said.

  24. Avatar
    Kai Gu November 7, 2019 at 3:52 am #

    Thank you for your tutorials very useful and easy to understand for a beginner.

    I am dealing with time series data from a “triggering system”:

    system triggered by pulse signal and record a piece of 1-D data of a fixed length and it’s time stamp.

    the pulses of different kind show different periodic patterns, so that I want to classify different kinds of pulses with a LSTM network.

    I am having trouble processing the input for the LSTM network.

    the time stamps have a sub-nanosecond accuracy, I am thinking about divide the time line into nanoseconds and fill the slots without pulses with zeros. but the repeat cycle of pulses is about several milliseconds(so I would have a time step for LSTM around several milliseconds) in this case I would have a very long input for the network and the input can be very sparse. (some pulses can happen very frequently with a interval of some nanoseconds so I cant divide the time line roughly into 10 nanoseconds or microseconds)

    Do you have any experience or suggestions on similar problem? thank you in advance!

    • Avatar
      Jason Brownlee November 7, 2019 at 6:47 am #

      Not really, perhaps brainstorm a suite of different framings of the problem and test each to see which might best expose the structure of the problem to the learning algorithm.

      Compare results to naive models like persistence and be sure to test alternatives like an MLP, CNN and hybrids.

      • Avatar
        Kai Gu November 8, 2019 at 12:57 am #

        thank you for your reply!

        actually I have already implemented some simple MLP CNN structure to dig some feature within the waveform. and it works just ok.

        but the domain knowledge and experts told me that actually I am missing some macro periodic pattern by just analyzing the 1-d data slices. that’s why I am thinking about to include those time information by using an LSTM network.

        but still, thank you very much for the suggestion!

  25. Avatar
    Austin Bernard January 29, 2020 at 2:18 pm #

    I have a problem where I have data that is made up of weather and temperature inputs for every day of the year. And my goal is to predict when a certain flower will bloom for that given year. Does this problem fall into these type of categories. I’ve been having trouble with this problem for a while and can’t seem to find a solution. Because I have a pretty long input sequence of 365 days and one output which is just the day of the blooming. Any help to point me in the right direction would be appreciated. I’m still a beginner in machine learning and I’m not sure my problem really falls in this category or not.

  26. Avatar
    Krithika March 2, 2020 at 9:17 pm #

    I have a ECG time series dataset. I need to forecast the future signal’s label (either it is a normal signal – ‘0’ or an anomaly – ‘1’) i.e., without forecasting the future signal values, I just directly need to classify it. Can this be done with CNN or LSTM or both? I am finding difficulties in choosing the loss and optimizer for this model. My validation loss and accuracy seem to be constant from the beginning while my training loss and accuracy reduces for a while and later changes to ‘nan’. Do you have any idea, how this can be rectified?

  27. Avatar
    Krithika March 2, 2020 at 9:50 pm #

    Hello Jason,

    I am working with ECG time series data that has 2 feature signals. The frequency rate is 250 samples and each signal is annotated as Normal or Anomaly. My task is to forecast the future signal’s label (0-Normal, 1-Anomaly). So, I need not forecast the next signal values, instead, I need to classify whether the future signal is a normal one or an anomaly.

    Which would be a suitable model for this scenario? either an LSTM or CNN or both?
    Also, while training my model, the validation loss and accuracy are always constant from the beginning while the training loss decreases and accuracy increases for a certain time and later changes to nan.

    I am a bit confused in choosing the loss (should it be a binary crossentropy?) and optimizer for my model as well.

    I have a sliding window of length 245, therefore my input is of the form (nx245x2) while my ouput is (nx1) where my output label is the label of the next signal(245th label value of next signal). i.e., inp_seq = X_train[0:245], out_label = label[489].

  28. Avatar
    PorkPy March 18, 2020 at 2:42 am #

    Hi Jason,

    Awesome post.

    I am thinking of using a LSTM to input a sequence of robot trajectory data.

    My theory is similar to that of using multiple samples to ascertain velocity and acceleration information, but I need more than that.
    I need the learning agent to learn the importance of taking specific trajectories to reach a position, because different trajectories will yield very different results.

    I plan to use a sequence of robot position data, where the most recent data is the most important and will consist of 5 or so consecutive data points, concatenated with several earlier data point that are further apart from each other and which give incite to the past trajectory.

    Basically, the further you go back in time, the fewer the number of data points needed, because, I assume, as you go further back, the history has less importance on the current position, but is still relevant.

    I’m yet to decide on the best way to chop my data or if a LSTM is a suitable choice for this problem.
    It would be nice if there was a LSTM that accepted all of the trajectory data, but had a hyperparameter that allowed me, or itself, to tune how important past data is.

    Any thoughts?

    Cheers,

  29. Avatar
    Sophia June 24, 2020 at 3:24 am #

    Hello Jason, I have some questions. If I input a sequence to the encoder consisting of LSTM cells, I wanna generate some words by decoder and insert them to the input sequence. Can I connect a decoder to a LSTM cell at t-th time step and connect another decoder to a LSTM cell at 2t-th time step and so on? Maybe, like your crazy idea, “Explore splitting the input sequence into multiple fixed-length subsequences and train a model with each subsequence as a separate feature (e.g. parallel input sequences).”, can i train three encoder-decoders on three subsequence, then concatenate these three generated words with three original inputs for a new sequence?
    And do you know the related information and researches?
    thanks a lot !

    • Avatar
      Jason Brownlee June 24, 2020 at 6:37 am #

      No, I believe you have to do it manually – e.g. construct the samples from data and predictions.

      • Avatar
        Sophia June 24, 2020 at 7:13 am #

        so, whether is it practical or not to connect a decoder to a LSTM cell at every time step?

      • Avatar
        Sophia June 24, 2020 at 8:33 am #

        Could you tell me the method is right or not? That’s what I have seen in a model architecture, but there is not validation and specific implementation on it. I have no idea for a long time.

        • Avatar
          Jason Brownlee June 24, 2020 at 1:27 pm #

          I don’t know as I don’t really follow, sorry. Perhaps develop a prototype to see if your ideas work?

  30. Avatar
    QAYSAR September 13, 2020 at 12:31 am #

    Hello Sir.
    Can we use LSTM network for EEG classification?
    if yes, then what should be the input format?

  31. Avatar
    John December 30, 2020 at 1:10 am #

    Thanks for your sharing! It’s qutie useful. I still have a question. If I want to build up a prediction model using LSTM, can training dataset be generated through random sampling and then sorting them?

    • Avatar
      Jason Brownlee December 30, 2020 at 6:41 am #

      You’re welcome.

      Perhaps try it and see for your use case and compare to an MLP.

  32. Avatar
    Noam April 1, 2021 at 4:27 am #

    What if you have many (say 300) inputs, each of length say 10000.

    Could you split each to 100 chunks of 100, and feed them into an LSTM which takes inputs of length 100, then feeds the last c_n and h_n as input to the LSTM on the next batch?

    Meaning, each batch would consist of say 16 samples OF DIFFERENT SEQUENCES, each of 100 time frames, and we would need 100 such batches to finish 16 samples, for a total of 100 * *(300/16) batches.

    Does this make any sense?

    • Avatar
      Jason Brownlee April 1, 2021 at 8:24 am #

      Perhaps. Try it and see if it more effective than truncating, etc.

  33. Avatar
    Pablo August 6, 2023 at 7:45 pm #

    Hello Jackson, thank you for such excellent content. I have a question, in case you have very long data sequences and you are forced to use a windowing process, how do you ensure that the model trains taking into account that each window belongs to a specific sequence/id? And for predicting? I’ve seen various methods such as using controlled batch training and resetting the model’s state once the id changes, but that would imply a batch size of 1 and slow down the training for days. Do you know another way? Thanks again!

  34. Avatar
    Pablo August 6, 2023 at 7:46 pm #

    Sorry, Jason ????

Leave a Reply