Data Preparation for Variable Length Input Sequences

Last Updated on

Deep learning libraries assume a vectorized representation of your data.

In the case of variable length sequence prediction problems, this requires that your data be transformed such that each sequence has the same length.

This vectorization allows code to efficiently perform the matrix operations in batch for your chosen deep learning algorithms.

In this tutorial, you will discover techniques that you can use to prepare your variable length sequence data for sequence prediction problems in Python with Keras.

After completing this tutorial, you will know:

  • How to pad variable length sequences with dummy values.
  • How to pad variable length sequences to a new longer desired length.
  • How to truncate variable length sequences to a shorter desired length.

Discover how to develop LSTMs such as stacked, bidirectional, CNN-LSTM, Encoder-Decoder seq2seq and more in my new book, with 14 step-by-step tutorials and full code.

Let’s get started.

Data Preparation for Variable Length Input Sequences for Sequence Prediction

Data Preparation for Variable-Length Input Sequences for Sequence Prediction
Photo by Adam Bautz, some rights reserved.

Overview

This section is divided into 3 parts; they are:

  1. Contrived Sequence Problem
  2. Sequence Padding
  3. Sequence Truncation

Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

This tutorial assumes you have Keras (v2.0.4+) installed with either the TensorFlow (v1.1.0+) or Theano (v0.9+) backend.

This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help setting up your Python environment, see this post:

Contrived Sequence Problem

We can contrive a simple sequence problem for the purposes of this tutorial.

The problem is defined as sequences of integers. There are three sequences with a length between 4 and 1 timesteps, as follows:

These can be defined as a list of lists in Python as follows (with spacing for readability):

We will use these sequences as the basis for exploring sequence padding in this tutorial.

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Sequence Padding

The pad_sequences() function in the Keras deep learning library can be used to pad variable length sequences.

The default padding value is 0.0, which is suitable for most applications, although this can be changed by specifying the preferred value via the “value” argument. For example:

The padding to be applied to the beginning or the end of the sequence, called pre- or post-sequence padding, can be specified by the “padding” argument, as follows.

Pre-Sequence Padding

Pre-sequence padding is the default (padding=’pre’)

The example below demonstrates pre-padding 3-input sequences with 0 values.

Running the example prints the 3 sequences pre-pended with zero values.

Post-Sequence Padding

Padding can also be applied to the end of the sequences, which may be more appropriate for some problem domains.

Post-sequence padding can be specified by setting the “padding” argument to “post”.

Running the example prints the same sequences with zero-values appended.

Pad Sequences To Length

The pad_sequences() function can also be used to pad sequences to a preferred length that may be longer than any observed sequences.

This can be done by specifying the “maxlen” argument to the desired length. Padding will then be performed on all sequences to achieve the desired length, as follows.

Running the example pads each sequence to the desired length of 5 timesteps, even though the maximum length of an observed sequence is only 4 timesteps.

Sequence Truncation

The length of sequences can also be trimmed to a desired length.

The desired length for sequences can be specified as a number of timesteps with the “maxlen” argument.

There are two ways that sequences can be truncated: by removing timesteps from the beginning or the end of sequences.

Pre-Sequence Truncation

The default truncation method is to remove timesteps from the beginning of sequences. This is called pre-sequence truncation.

The example below truncates sequences to a desired length of 2.

Running the example removes the first two timesteps from the first sequence, the first timestep from the second sequence, and pads the final sequence.

Post-Sequence Truncation

Sequences can also be trimmed by removing timesteps from the end of the sequences.

This approach may be more desirable for some problem domains.

Post-sequence truncation can be configured by changing the “truncating” argument from the default ‘pre’ to ‘post’, as follows:

Running the example removes the last two timesteps from the first sequence, the last timestep from the second sequence, and again pads the final sequence.

Summary

In this tutorial, you discovered how to prepare variable length sequence data for use with sequence prediction problems in Python.

Specifically, you learned:

  • How to pad variable length sequences with dummy values.
  • How to pad out variable length sequences to a new desired length.
  • How to truncate variable length sequences to a new desired length.

Do you have any questions about preparing variable length sequences?
Ask your questions in the comments and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.


78 Responses to Data Preparation for Variable Length Input Sequences

  1. Yuya June 19, 2017 at 6:46 am #

    Are there alternative methods which doesn’t make of use padding as a way to handle sequences of various lengths?

    • Jason Brownlee June 19, 2017 at 8:49 am #

      Yes, some ideas:

      – You could truncate sequences.
      – You could concatenate sequences.
      – You could write your own inefficient implementation of RNNs.

    • Tom June 19, 2017 at 10:54 am #

      If you know the target length, you can try interpolation or warping, right? But they are more costly than just adding zeros.

  2. Sudipta Kar June 23, 2017 at 7:57 am #

    I am interested how does the padding affect the performance of models? Specifically when the dataset contains both long texts and very short texts. Lets assume, median length is 30 sentences, max length is 120 and minimum length is 10. how pre and post sequence padding affects the performance?

    • Rahul June 24, 2017 at 3:06 am #

      I am not wrong you usually batch together inputs of similar length, so in this case you would try to have multiple batches of some batch_size and lengths ranging from 30-120

    • Jason Brownlee June 24, 2017 at 7:52 am #

      Padding and Masking is the best, little to no impact. I would recommend testing though.

    • Ayush November 25, 2018 at 5:28 am #

      Read about bucketing. It is used to handle such issues

  3. Rahul June 24, 2017 at 3:06 am #

    If* I

  4. Patrick September 13, 2017 at 2:11 pm #

    Thank you so much for writing these tutorials. They are remarkably effective in explaining concepts that other websites often gloss over. I have a question about padding outputs in sequence-to-sequence classification problems. Let’s say X has the shape (100, 50, 10), and y has the shape (100, 50, 3). X consists of 100 time series, 50 time steps per time series, and 10 features per time step. The y has three possible one-hot encoded classes per tilmestep. The samples of X have variable length so the shorter samples are pre-padded with 0 For the y labels corresponding to the pre-padded X time steps, should they be [0, 0, 0]? Or should a new fourth label, [0, 0, 0, 1], be created for the pre-padded time steps thus changing the shape of y to (100, 100, 4).

    • Jason Brownlee September 15, 2017 at 11:56 am #

      Thanks Patrick.

      Why do you need to pad the output if each series is classified as one of 3 labels?

  5. Rafiya October 23, 2017 at 9:29 pm #

    Can padding work for variable length feature vectors for concatenating them together?

    • Jason Brownlee October 24, 2017 at 5:31 am #

      Padding can be used to make all independent variable length inputs the same length.

  6. Siddharth November 17, 2017 at 11:40 pm #

    Hi Tom,

    How about variable length sequences with different labels as well?Like assuming I have 3 possible set of labelled features that occur in different combinations.,

    Featureset 1: A, B, C and it is labelled A
    Featureset 2: D , E,F,V and it is labelled D
    Featureset 3: A, F, J, K, L and it is labelled L. The sequence length varies as well as the labels.

    How about i go with Sequence classification with this?
    Should i Train each featuresets with training/test data of different combinations of elements of that sequence?
    If A, B, C are the elements and A is the labelled target, A being the root element of all 3.I will create training data with different combinations of A, B, C and train LSTM to classify it as A no matter in which order the elements in sequence occur.
    Similarly i repeat above for all featuresets..training LSTM individually., as in future, same set of elements can come in varying sequences.and i want my neural net to predict the root element correctly.

    Can we train incrementally like this or is there other way to handle this scenario,,(Which I’m sure there is :))!..Please forgive if my question seems naive.

    • Siddharth November 17, 2017 at 11:41 pm #

      I’m Sorry, I have wrongly mentioned you as Tom instead of Jason 🙂

    • Jason Brownlee November 18, 2017 at 10:19 am #

      Treat each sequence as a new sample to learn from, pad all to the same length.

  7. Becca February 20, 2018 at 1:34 am #

    Hi Jason,
    thank you very much for your many useful posts. There is one question I can’t find an answer to, and I’m not sure I handle it correctly in my code. Maybe you can help me out?
    The input to my network is of form (num_samples, num_timesteps, num_features), where num_timesteps varies. In order to be able to feed all samples to the network, I prepad with zeros. So, say the maximum number of timesteps is 3, the features are a and b, and the current sample has only 2 timesteps. Then I would prepad [[a1,b1],[a2,b2]] to [[0.,0.],[a1,b1],[a2,b2]]. In this way, all my samples would be of shape (3,2).
    Since keras expects the output to be of the same length as the input, I feel like I also need to pad the output, say from [[y1],[y2]] to [0.,[y1],[y2]]. Is this correct? If it is, does keras ignore the prepadded zero in the output as well? I worry that the network does not ignore it, and thus tries to learn something different than intended.
    Any help would be greatly appreciated!
    Many thanks in advance 🙂

    • Jason Brownlee February 21, 2018 at 6:28 am #

      Yes, sounds good.

      A next step would be to try models that allow model outputs that differ in length from model inputs like the encoder-decoder architecture. I have a number of really nice posts on the topic.

      Let me know how you go.

  8. Nikhil Thakur March 28, 2018 at 7:50 am #

    Hi Jason,

    I have a doubt. What impact does the type of padding have on the model performance for any task, example sentence classification?

    By types of padding, I mean padding all the sequences to the length of the maximum length sentence and padding the sentences to some other length (smaller or greater)

  9. shubham April 1, 2018 at 11:22 pm #

    Is it possible that each batch will have different max_len of sequence and that can be feed to rnn?

    • Jason Brownlee April 2, 2018 at 5:23 am #

      I believe the number of samples can differ between batches, but the batch size will be the same.

  10. Amber April 10, 2018 at 5:44 am #

    Hi Jason, I am a beginner in machine learning and I try to make sense of all of this. Is it true that padding basically makes the dataset more sparse? So, I should keep that in mind while doing the preprocessing?

    • Jason Brownlee April 10, 2018 at 6:26 am #

      Yes, but we can ignore the padding by using a Masking layer.

  11. Yidnek May 22, 2018 at 5:36 am #

    dear Jason, thank you for your amusing posts,
    Im new to LSTM or DL in general, and Im trying to write a simple POS Tagging code in python using the LSTM. I saw different LSTM Resources including your posts, and what seems missing for me is how can I prepare a statement (such as “John is driving his red car.”, or shorter “he is driving a car”) to be an input to the LSTM. Does padding solve such specially things for POS tagging.
    I wish if I can find something practical to read and test
    Thank you

  12. Emily June 9, 2018 at 3:27 am #

    Hi Jason,

    Can you give an example of how to use masking after padding with zeros? Some layers have masking built in (e.g. in the embedding layer you can set mask_zero = True), but I’ve run into issues where some layers (e.g. Conv1D) don’t allow for masking. How can we ensure that masking occurs throughout the network?

    Thanks!

    • Jason Brownlee June 9, 2018 at 6:56 am #

      Yes, I have a number of examples on the blog but with LSTM layers, not CNN. As you say, CNN do not support it.

      If you have CNNs on the front-end try learning the padded data directly.

  13. christian June 27, 2018 at 6:43 am #

    Hi Jason, you should have a prof not a phd title with this knowledge.what to do if truncating and padding does not give good results in the model ? usually I trained with a smaller sequence length data and need a much more longer dataset for prediction.

    • Jason Brownlee June 27, 2018 at 8:25 am #

      Perhaps explore other framings of the problem, get creative.

  14. giovanni July 12, 2018 at 10:57 pm #

    Hi Jason, thanks for your amazing tutorials. A question. Let’s say i train the network with fixed input size, i.e, (num_samples, num_timesteps, num_features). May i use a masking input layer only in the recognition phase because of the variable size of the second dimension, e.g., num_timesteps – x ?

    • Jason Brownlee July 13, 2018 at 7:42 am #

      The input dimensions are generally fixed. It is the values within those dimensions that you can mask.

  15. Melina July 30, 2018 at 1:30 am #

    Thanks, this helps!
    However, I’m wondering if/how one can later compute a confusion matrix?
    When shorter sequences are filled up with zeros?
    Aren’t these erroneously treated as right prediction?
    Or what if one normal element (e.g. label) of the sequence is really a zero?

    • Jason Brownlee July 30, 2018 at 5:52 am #

      When padding output sequences, the zero values can be ignored by the evaluation procedure.

  16. Olivier Blais August 12, 2018 at 10:16 pm #

    Hi Jason, great blog by the way!

    I am currently planning a LSTM with time-series data. My specific problem is that I have days without activities, which mean that it is ignored with conventional padding techniques. I was wondering how I could handle this efficiently?

    Thanks in advance
    Olivier

    • Jason Brownlee August 13, 2018 at 6:17 am #

      What’s the problem with ignoring days with no activity?

  17. Neha Sharma August 13, 2018 at 7:31 pm #

    HI Jason,

    with the Many-to-One LSTM for Sequence Prediction of your sequence post I have tried to implement model using my data. Below code from youe page:

    model = Sequential()
    model.add(LSTM(5, input_shape=(5, 1)))
    model.add(Dense(length))
    model.compile(loss=’mean_squared_error’, optimizer=’adam’)
    print(model.summary())

    If I train and predict with the train and test having same number of time steps, it works perfect.

    If I want to predict with lesser number of timesteps, the model throws an error.

    So in this case I assume I need to do Pre Sequence Truncation.
    Kindly confirm if this is the right way and acceptable.

    • Jason Brownlee August 14, 2018 at 6:17 am #

      I’m not sure I understand what you are varying and what the outcomes where.

      Perhaps you can provide more explanation?

  18. Julian Arias September 22, 2018 at 11:58 am #

    Hi Jason,

    If I am interested in to training a sequence to sequence architecture, where both input and output sequences are of variable length, should I apply padding and masking in both encoder and decoder layers? If I decided not to use masking, should I apply pre-sequence padding to the encoder and post-sequence padding to the decoder?

    Thank you.

    • Jason Brownlee September 23, 2018 at 6:36 am #

      Yes, use post-sequence padding, input masking.

  19. jack October 12, 2018 at 6:36 am #

    Hi,

    Thank you for the detailed explanations of LSTM. I would like to know if we can train an LSTM model without a sequence length.

  20. Rick October 13, 2018 at 10:21 am #

    Hi Jason,

    I am training multiple time-series of unequal length for walk-forward prediction. My time-series also have variable starting points.

    With regard to (samples, timesteps, features),
    Sample: I am treating every time-series as a sample.
    Timesteps: I am using the maximum length as the window to capture all the information for that single time-series.

    My data already has 0s which have a specific significance. So, that the zero-padding does not interfere with my data, I am using masking instead of zero-padding.

    Do you foresee any problems with my model?

    Regard,
    Rick

    • Jason Brownlee October 14, 2018 at 5:58 am #

      Sounds good to me Rick!

      • Rick October 30, 2018 at 1:29 am #

        Hi Jason,

        Thanks for your previous reply.

        With regard to the problem I mentioned above, I am a bit worried about whether the sequenced nature of the LSTM is being taken into account if I am treating each time-series as a sample.

        I am training multiple time-series for walk-forward prediction.

        With regard to (samples, timesteps, features),
        Sample: I am treating every time-series as a sample.
        Timesteps: I am using the maximum length as the window to capture all the information for that single time-series.

        Suppose I have 4 timeseries and each has 76 steps. So, if I use the first 75 steps as my X and step 76 as my Y, for the training model, the model does not learn the sequential nature of the 75 steps. So, the input is (4,75,1) and output is (4,)

        X Y

        1 2 3 4 5 6 7 8 9 …………………………….75 (timeseries 1) 76
        1 2 3 4 5 6 7 8 9 ……………………………75 (timeseries 2) 76
        1 2 3 4 5 6 7 8 9 ……………………………75 (timeseries 3) 76
        1 2 3 4 5 6 7 8 9 ……………………………75 (timeseries 4) 76

        On the other hand, if I arrange my data for 1 timeseries in the below form, step 1 predicts 2, Step 1-2 predict 3, step 1-2-3 predict 4…..and then step 1-75 predict 76. This takes into account the sequential nature of the timeseries but now, I cannot apply it to multiple timeseries at the same time. I need the timeseries to be able to predict together at the same time and also generalize acorss each other (if there is some general trends across the time series, they must be captured by the model). So, what should I do, is there any way to do this using a simple LSTM? Is there any other sequence model architecture that works well here?

        X Y
        ————————————1 2
        ———————————1 2 3
        —————————–1 2 3 4
        ……………………………………… ..
        …………………………………….. ..
        1 2 3…………………………. 75 76

        Regards,
        Rick

        • Jason Brownlee October 30, 2018 at 6:07 am #

          Generally LSTMs are poor at time series forecasting, so I would encourage you to try an MLP and CNN, as well as linear methods such as ETS and SARIMA as a baseline.

          Ensure you’re using a masking layer when padding all sequences to the same length.
          Perhaps vary the configuration of the model to see if that is an issue.
          Perhaps you can use the walk-forward validation approach and predict all series together (e.g. multivariate prediction).

          I hope that helps as a start.

  21. Lilli November 9, 2018 at 12:28 am #

    Hi Jason,
    Great post. thank you.
    I have a problem and I am wondering if it is the right way to pad 0 as you mentioned.
    I have a dataset of over 5000 video range from 17 frames to 71 frames of 32×48 grayscale. Image if i pad 0 to all the data to have the fixed size of 71*32*48, then the LSTM or CNN has to calculate through a bunch of 0, which costs too much computing and does actually nothing since it’s just padded frame. Is there any way I could process those data without padding? Or Is there any sense of padding I’m missing?
    Regards,
    Lilli

    • Jason Brownlee November 9, 2018 at 5:25 am #

      You can use a Masking layer to ignore padded values for those models that support Masking.

      • Lilli November 22, 2018 at 1:01 am #

        thank you 🙂

  22. HW November 27, 2018 at 3:08 am #

    Thank you very much for your consistently high clarity. A question: Suppose, the underlying sequence is not numerical, but consists of natural-language words and punctuation symbols. How should the punctuation symbols/marks be accounted for–while calculating the length?

  23. Arjun December 7, 2018 at 4:51 pm #

    I actually have a doubt, it may sound silly. Why is the padding required? While googling I found out that it might be helpful for batch operations. If so what are some possible batch operations we perform and what if I do not pad the input before feeding it to the neural network?

    • Jason Brownlee December 8, 2018 at 6:59 am #

      You can use a dynamic RNN where padding is not required.

      It can be more efficient to put all data into fixed sized vectors when training/evaluating.

  24. Abey December 13, 2018 at 1:15 am #

    Dear Jason,

    Thank you once again for the amazing blog.

    I have a question and I’m hoping you can clarify it for me.

    Here is how my dataframe looks.

    Col1 Col_2…, Col_j, Col_M
    [A_11, A_12, …, A_ij,…, A_1M]
    [A_21, A_22, …, A_ij,…, A_2M]
    . . .
    [A_N1, A_N2, …, A_ij,…, A_PM]

    [B_11, B_12, …, B_ij,…, B_1M]
    [B_21, B_22, …, B_ij,…, B_2M]
    . . .
    [B_N1, B_N2, …, B_ij,…, B_QM]
    .
    [C_11, C_12, …, C_ij,…, C_1M],
    [C_21, C_22, …, C_ij,…, C_2M]
    . . .
    [C_N1, C_N2, …, C_ij,…, C_RM]

    So, length of sample A, B, and C are P, Q, and R respectively. where P < Q < R. Basically all of my samples has a different length.
    So, say I choose bach_size of six samples, then each of the subsequent batch_size will have a different length.
    How do I fix this? or

    Does padding need to be applied to the whole of the dateframe to make P = Q = … = R?

    • Jason Brownlee December 13, 2018 at 7:55 am #

      All dimensions are padded to have fixed lengths, but the lengths of each dimension can vary.

  25. aravind pai December 20, 2018 at 9:06 pm #

    Hi Jason!Its great post!

    I have a question! Lets define a rnn with 50 time steps. Now, When we pad a sequence of length 1 to max length then what would be the input to the rnn?

    is it just one word or entire sequence padded with zeros?

    • Jason Brownlee December 21, 2018 at 5:28 am #

      Yes, and you can use a Masking layer to ignore the zeros.

  26. Molla Hafizur Rahman December 22, 2018 at 8:14 am #

    Great post, Jason. I always follow your tutorials. I have 30 data point and each data point has variable lengths of sequence with feature 7. I want to predict the next action based on the previous actions. In this scenario won’t the padded 0 sequence impact the prediction accuracy?

    • Jason Brownlee December 23, 2018 at 6:02 am #

      No, you can use a Masking layer to ignore the padded values.

  27. Hassan Ahmed May 9, 2019 at 5:31 pm #

    I have a simple question. I want to train a simple SVM model to perform classification on audio signals. I have padded and sliced that data using numpy. As you know we have to pad the data to fixed length. I want to ask that how to select that fixed length? If I select the longest length of vector as my standard length, than there will be issue as some of my signals are too short in length, so there will a lot of zeros in the form of padding. So these too much zeros will effect my training accuracy or nor?
    My second question is that is it possible that I equalize the length of the feature vectors by padding but ignore these padded zeros during training. (But keep in mind that I am using SVM to perform classification)

    • Jason Brownlee May 10, 2019 at 8:14 am #

      Perhaps experiment with different lengths and compare how they impact model skill.

      Some algorithms can do this via masking. I suspect SVM could, but you may have to implement it yourself.

  28. Rahul Krishan June 11, 2019 at 6:39 pm #

    Hey Jason,
    Thanks again on a great blog. I am not able to wrap my head around padding the input as a direct consequence to the LSTM block and its architecture. If we don’t mask and carry on with variable size input why cant the LSTM block process it or can it?

    • Jason Brownlee June 12, 2019 at 7:54 am #

      The Masking layer allows the padded values to be skipped in the processing of the input sequence.

  29. Catriona June 12, 2019 at 12:31 am #

    Hi,
    Thanks for this – it looks great.

    I’m having a bit of a problem, though, with the error:

    TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

    do you know how I can fix this so that I can do post-padding anyway?

    Thanks!

    • Jason Brownlee June 12, 2019 at 8:05 am #

      Perhaps work with dense instead of sparse matrices?

  30. Carlos August 16, 2019 at 7:31 pm #

    Hi Jason,

    I’m preparing time series data for a CNN and have windowed my 4 sensor signals into 50-sample windows, e.g. each window is a 50 x 4 matrix. The last window however, often has fewer samples e.g. 7 x 4 or 21 x 4.

    Is it wise to zero pad additional rows to that matrix to give it a dimension of 50 x 4? E.g. my 7 x 4 sample window would become a 50 x 4 but with 43 rows being all zeros. Would that still be okay for a CNN input or would the additional 0 rows bias the training process in some way?

    • Jason Brownlee August 17, 2019 at 5:35 am #

      Yes, zero pad all samples to the same length. Also, try using a masking layer to ignore the padding if its supported by your model.

  31. naren August 23, 2019 at 1:05 pm #

    In my case, I have a timestamp of 8. and each time stamp gets a feature vector of length 10.
    Not always I have data for all time stamps. for example, if its a disease prediction for new patient once admitted in hospital, I don’t have 8 hours of data at hour 1. so, if I give data for timestamp 1 and feed zero vector’s of 7 to other 7 time stamps!! then worried how BTT makes any sense as the training happens. so how to solve this issue?

    • Naren August 23, 2019 at 1:08 pm #

      correction: I meant 0 vectors of length 10 to all other 7 time stamps.

      Do you know a paper/tutorial with link to BTT proof? thanks.

        • naren August 23, 2019 at 2:53 pm #

          thanks for that link. I will read it and derive the BTT. meanwhile, could you answer my other question above?

          In my case, I have a timestamp of 8. and each time stamp gets a feature vector of length 10. goal is to predict the disease possibility in 8 hours. even in the case I have just 1 hour of data, I should be able to detect that the patient would get disease after 8 hours.

          Not always I have data for all time stamps. for example, if its a disease prediction for new patient once admitted in hospital, I don’t have 8 hours of data at hour 1. so, if I give hour 1 data for timestamp 1 and feed zero vector’s of length 10 for remaining 7 time stamps!! then worried how BTT makes any sense as the training happens. so how to solve this issue?

          I am also training an LSTM stateful with a timestamp of 1, and making it stateful fo the number of records I have for that patient and reset the states. for example, have a patient with 11 records, then I generate data in the following way and reset states after every scenario of data possibility. [1],reset [1,2], reset [1,2,3]..reset.. [1,2,…11] . this training is taking too long. but, what would be your approach?

          haven’t gone to Attention nets to solve this. but, if I want to use the general LSTM or GRU, what’s your approach. could you advise, thanks?

          • Jason Brownlee August 24, 2019 at 7:44 am #

            Yes, you can use padding or truncation to ensure all samples have the same shape. I would recommend zero padding and use a masking layer to ignore the padded features/timesteps.

            I would recommend constructing samples that describe a scenario, rather than learning across samples and managing state. It’s just a simpler model.

Leave a Reply