# Data Preparation for Variable Length Input Sequences

Last Updated on August 14, 2019

Deep learning libraries assume a vectorized representation of your data.

In the case of variable length sequence prediction problems, this requires that your data be transformed such that each sequence has the same length.

This vectorization allows code to efficiently perform the matrix operations in batch for your chosen deep learning algorithms.

In this tutorial, you will discover techniques that you can use to prepare your variable length sequence data for sequence prediction problems in Python with Keras.

After completing this tutorial, you will know:

• How to pad variable length sequences with dummy values.
• How to pad variable length sequences to a new longer desired length.
• How to truncate variable length sequences to a shorter desired length.

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Data Preparation for Variable-Length Input Sequences for Sequence Prediction
Photo by Adam Bautz, some rights reserved.

## Overview

This section is divided into 3 parts; they are:

1. Contrived Sequence Problem
3. Sequence Truncation

### Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

This tutorial assumes you have Keras (v2.0.4+) installed with either the TensorFlow (v1.1.0+) or Theano (v0.9+) backend.

This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help setting up your Python environment, see this post:

## Contrived Sequence Problem

We can contrive a simple sequence problem for the purposes of this tutorial.

The problem is defined as sequences of integers. There are three sequences with a length between 4 and 1 timesteps, as follows:

These can be defined as a list of lists in Python as follows (with spacing for readability):

We will use these sequences as the basis for exploring sequence padding in this tutorial.

### Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

The pad_sequences() function in the Keras deep learning library can be used to pad variable length sequences.

The default padding value is 0.0, which is suitable for most applications, although this can be changed by specifying the preferred value via the “value” argument. For example:

The padding to be applied to the beginning or the end of the sequence, called pre- or post-sequence padding, can be specified by the “padding” argument, as follows.

The example below demonstrates pre-padding 3-input sequences with 0 values.

Running the example prints the 3 sequences pre-pended with zero values.

Padding can also be applied to the end of the sequences, which may be more appropriate for some problem domains.

Post-sequence padding can be specified by setting the “padding” argument to “post”.

Running the example prints the same sequences with zero-values appended.

The pad_sequences() function can also be used to pad sequences to a preferred length that may be longer than any observed sequences.

This can be done by specifying the “maxlen” argument to the desired length. Padding will then be performed on all sequences to achieve the desired length, as follows.

Running the example pads each sequence to the desired length of 5 timesteps, even though the maximum length of an observed sequence is only 4 timesteps.

## Sequence Truncation

The length of sequences can also be trimmed to a desired length.

The desired length for sequences can be specified as a number of timesteps with the “maxlen” argument.

There are two ways that sequences can be truncated: by removing timesteps from the beginning or the end of sequences.

### Pre-Sequence Truncation

The default truncation method is to remove timesteps from the beginning of sequences. This is called pre-sequence truncation.

The example below truncates sequences to a desired length of 2.

Running the example removes the first two timesteps from the first sequence, the first timestep from the second sequence, and pads the final sequence.

### Post-Sequence Truncation

Sequences can also be trimmed by removing timesteps from the end of the sequences.

This approach may be more desirable for some problem domains.

Post-sequence truncation can be configured by changing the “truncating” argument from the default ‘pre’ to ‘post’, as follows:

Running the example removes the last two timesteps from the first sequence, the last timestep from the second sequence, and again pads the final sequence.

## Summary

In this tutorial, you discovered how to prepare variable length sequence data for use with sequence prediction problems in Python.

Specifically, you learned:

• How to pad variable length sequences with dummy values.
• How to pad out variable length sequences to a new desired length.
• How to truncate variable length sequences to a new desired length.

Do you have any questions about preparing variable length sequences?

## Develop LSTMs for Sequence Prediction Today!

#### Develop Your Own LSTM models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more...

### 129 Responses to Data Preparation for Variable Length Input Sequences

1. Yuya June 19, 2017 at 6:46 am #

Are there alternative methods which doesn’t make of use padding as a way to handle sequences of various lengths?

• Jason Brownlee June 19, 2017 at 8:49 am #

Yes, some ideas:

– You could truncate sequences.
– You could concatenate sequences.
– You could write your own inefficient implementation of RNNs.

• Tom June 19, 2017 at 10:54 am #

If you know the target length, you can try interpolation or warping, right? But they are more costly than just adding zeros.

• Jason Brownlee June 20, 2017 at 6:33 am #

Nice tip Tom!

• Deepali Verma November 7, 2020 at 3:49 am #

Sir, how we can apply warp in this respect?

2. Sudipta Kar June 23, 2017 at 7:57 am #

I am interested how does the padding affect the performance of models? Specifically when the dataset contains both long texts and very short texts. Lets assume, median length is 30 sentences, max length is 120 and minimum length is 10. how pre and post sequence padding affects the performance?

• Rahul June 24, 2017 at 3:06 am #

I am not wrong you usually batch together inputs of similar length, so in this case you would try to have multiple batches of some batch_size and lengths ranging from 30-120

• Jason Brownlee June 24, 2017 at 7:52 am #

Padding and Masking is the best, little to no impact. I would recommend testing though.

• Ayush November 25, 2018 at 5:28 am #

3. Rahul June 24, 2017 at 3:06 am #

If* I

4. Patrick September 13, 2017 at 2:11 pm #

Thank you so much for writing these tutorials. They are remarkably effective in explaining concepts that other websites often gloss over. I have a question about padding outputs in sequence-to-sequence classification problems. Let’s say X has the shape (100, 50, 10), and y has the shape (100, 50, 3). X consists of 100 time series, 50 time steps per time series, and 10 features per time step. The y has three possible one-hot encoded classes per tilmestep. The samples of X have variable length so the shorter samples are pre-padded with 0 For the y labels corresponding to the pre-padded X time steps, should they be [0, 0, 0]? Or should a new fourth label, [0, 0, 0, 1], be created for the pre-padded time steps thus changing the shape of y to (100, 100, 4).

• Jason Brownlee September 15, 2017 at 11:56 am #

Thanks Patrick.

Why do you need to pad the output if each series is classified as one of 3 labels?

5. Rafiya October 23, 2017 at 9:29 pm #

Can padding work for variable length feature vectors for concatenating them together?

• Jason Brownlee October 24, 2017 at 5:31 am #

Padding can be used to make all independent variable length inputs the same length.

6. Siddharth November 17, 2017 at 11:40 pm #

Hi Tom,

How about variable length sequences with different labels as well?Like assuming I have 3 possible set of labelled features that occur in different combinations.,

Featureset 1: A, B, C and it is labelled A
Featureset 2: D , E,F,V and it is labelled D
Featureset 3: A, F, J, K, L and it is labelled L. The sequence length varies as well as the labels.

How about i go with Sequence classification with this?
Should i Train each featuresets with training/test data of different combinations of elements of that sequence?
If A, B, C are the elements and A is the labelled target, A being the root element of all 3.I will create training data with different combinations of A, B, C and train LSTM to classify it as A no matter in which order the elements in sequence occur.
Similarly i repeat above for all featuresets..training LSTM individually., as in future, same set of elements can come in varying sequences.and i want my neural net to predict the root element correctly.

Can we train incrementally like this or is there other way to handle this scenario,,(Which I’m sure there is :))!..Please forgive if my question seems naive.

• Siddharth November 17, 2017 at 11:41 pm #

I’m Sorry, I have wrongly mentioned you as Tom instead of Jason ðŸ™‚

• Jason Brownlee November 18, 2017 at 10:19 am #

Treat each sequence as a new sample to learn from, pad all to the same length.

7. Becca February 20, 2018 at 1:34 am #

Hi Jason,
thank you very much for your many useful posts. There is one question I can’t find an answer to, and I’m not sure I handle it correctly in my code. Maybe you can help me out?
The input to my network is of form (num_samples, num_timesteps, num_features), where num_timesteps varies. In order to be able to feed all samples to the network, I prepad with zeros. So, say the maximum number of timesteps is 3, the features are a and b, and the current sample has only 2 timesteps. Then I would prepad [[a1,b1],[a2,b2]] to [[0.,0.],[a1,b1],[a2,b2]]. In this way, all my samples would be of shape (3,2).
Since keras expects the output to be of the same length as the input, I feel like I also need to pad the output, say from [[y1],[y2]] to [0.,[y1],[y2]]. Is this correct? If it is, does keras ignore the prepadded zero in the output as well? I worry that the network does not ignore it, and thus tries to learn something different than intended.
Any help would be greatly appreciated!

• Jason Brownlee February 21, 2018 at 6:28 am #

Yes, sounds good.

A next step would be to try models that allow model outputs that differ in length from model inputs like the encoder-decoder architecture. I have a number of really nice posts on the topic.

Let me know how you go.

8. Nikhil Thakur March 28, 2018 at 7:50 am #

Hi Jason,

I have a doubt. What impact does the type of padding have on the model performance for any task, example sentence classification?

By types of padding, I mean padding all the sequences to the length of the maximum length sentence and padding the sentences to some other length (smaller or greater)

9. shubham April 1, 2018 at 11:22 pm #

Is it possible that each batch will have different max_len of sequence and that can be feed to rnn?

• Jason Brownlee April 2, 2018 at 5:23 am #

I believe the number of samples can differ between batches, but the batch size will be the same.

10. Amber April 10, 2018 at 5:44 am #

Hi Jason, I am a beginner in machine learning and I try to make sense of all of this. Is it true that padding basically makes the dataset more sparse? So, I should keep that in mind while doing the preprocessing?

• Jason Brownlee April 10, 2018 at 6:26 am #

Yes, but we can ignore the padding by using a Masking layer.

11. Yidnek May 22, 2018 at 5:36 am #

dear Jason, thank you for your amusing posts,
Im new to LSTM or DL in general, and Im trying to write a simple POS Tagging code in python using the LSTM. I saw different LSTM Resources including your posts, and what seems missing for me is how can I prepare a statement (such as “John is driving his red car.”, or shorter “he is driving a car”) to be an input to the LSTM. Does padding solve such specially things for POS tagging.
I wish if I can find something practical to read and test
Thank you

• Jason Brownlee May 22, 2018 at 6:31 am #

Sorry, I don’t have any examples of POS tagging.

12. Emily June 9, 2018 at 3:27 am #

Hi Jason,

Can you give an example of how to use masking after padding with zeros? Some layers have masking built in (e.g. in the embedding layer you can set mask_zero = True), but I’ve run into issues where some layers (e.g. Conv1D) don’t allow for masking. How can we ensure that masking occurs throughout the network?

Thanks!

• Jason Brownlee June 9, 2018 at 6:56 am #

Yes, I have a number of examples on the blog but with LSTM layers, not CNN. As you say, CNN do not support it.

If you have CNNs on the front-end try learning the padded data directly.

• ruhama August 17, 2021 at 11:37 pm #

CAN YOU give hint how to train the BILSTM model using word2vec vocabulary size

• Adrian Tam August 18, 2021 at 3:21 am #

Let’s say you have a passage, presented as a sequence of words (say, N words here). For each word, you use word2vec to convert it into a vector of M numbers. Then the input data will be a MxN matrix. You now feed this into the LSTM model, each input step is the vector of M numbers.

Hope this helps.

13. christian June 27, 2018 at 6:43 am #

Hi Jason, you should have a prof not a phd title with this knowledge.what to do if truncating and padding does not give good results in the model ? usually I trained with a smaller sequence length data and need a much more longer dataset for prediction.

• Jason Brownlee June 27, 2018 at 8:25 am #

Perhaps explore other framings of the problem, get creative.

14. giovanni July 12, 2018 at 10:57 pm #

Hi Jason, thanks for your amazing tutorials. A question. Let’s say i train the network with fixed input size, i.e, (num_samples, num_timesteps, num_features). May i use a masking input layer only in the recognition phase because of the variable size of the second dimension, e.g., num_timesteps – x ?

• Jason Brownlee July 13, 2018 at 7:42 am #

The input dimensions are generally fixed. It is the values within those dimensions that you can mask.

15. Melina July 30, 2018 at 1:30 am #

Thanks, this helps!
However, I’m wondering if/how one can later compute a confusion matrix?
When shorter sequences are filled up with zeros?
Aren’t these erroneously treated as right prediction?
Or what if one normal element (e.g. label) of the sequence is really a zero?

• Jason Brownlee July 30, 2018 at 5:52 am #

When padding output sequences, the zero values can be ignored by the evaluation procedure.

16. Olivier Blais August 12, 2018 at 10:16 pm #

Hi Jason, great blog by the way!

I am currently planning a LSTM with time-series data. My specific problem is that I have days without activities, which mean that it is ignored with conventional padding techniques. I was wondering how I could handle this efficiently?

Olivier

• Jason Brownlee August 13, 2018 at 6:17 am #

What’s the problem with ignoring days with no activity?

17. Neha Sharma August 13, 2018 at 7:31 pm #

HI Jason,

with the Many-to-One LSTM for Sequence Prediction of your sequence post I have tried to implement model using my data. Below code from youe page:

model = Sequential()
print(model.summary())

If I train and predict with the train and test having same number of time steps, it works perfect.

If I want to predict with lesser number of timesteps, the model throws an error.

So in this case I assume I need to do Pre Sequence Truncation.
Kindly confirm if this is the right way and acceptable.

• Jason Brownlee August 14, 2018 at 6:17 am #

I’m not sure I understand what you are varying and what the outcomes where.

Perhaps you can provide more explanation?

18. Julian Arias September 22, 2018 at 11:58 am #

Hi Jason,

If I am interested in to training a sequence to sequence architecture, where both input and output sequences are of variable length, should I apply padding and masking in both encoder and decoder layers? If I decided not to use masking, should I apply pre-sequence padding to the encoder and post-sequence padding to the decoder?

Thank you.

• Jason Brownlee September 23, 2018 at 6:36 am #

19. jack October 12, 2018 at 6:36 am #

Hi,

Thank you for the detailed explanations of LSTM. I would like to know if we can train an LSTM model without a sequence length.

20. Rick October 13, 2018 at 10:21 am #

Hi Jason,

I am training multiple time-series of unequal length for walk-forward prediction. My time-series also have variable starting points.

With regard to (samples, timesteps, features),
Sample: I am treating every time-series as a sample.
Timesteps: I am using the maximum length as the window to capture all the information for that single time-series.

Do you foresee any problems with my model?

Regard,
Rick

• Jason Brownlee October 14, 2018 at 5:58 am #

Sounds good to me Rick!

• Rick October 30, 2018 at 1:29 am #

Hi Jason,

With regard to the problem I mentioned above, I am a bit worried about whether the sequenced nature of the LSTM is being taken into account if I am treating each time-series as a sample.

I am training multiple time-series for walk-forward prediction.

With regard to (samples, timesteps, features),
Sample: I am treating every time-series as a sample.
Timesteps: I am using the maximum length as the window to capture all the information for that single time-series.

Suppose I have 4 timeseries and each has 76 steps. So, if I use the first 75 steps as my X and step 76 as my Y, for the training model, the model does not learn the sequential nature of the 75 steps. So, the input is (4,75,1) and output is (4,)

X Y

1 2 3 4 5 6 7 8 9 …………………………….75 (timeseries 1) 76
1 2 3 4 5 6 7 8 9 ……………………………75 (timeseries 2) 76
1 2 3 4 5 6 7 8 9 ……………………………75 (timeseries 3) 76
1 2 3 4 5 6 7 8 9 ……………………………75 (timeseries 4) 76

On the other hand, if I arrange my data for 1 timeseries in the below form, step 1 predicts 2, Step 1-2 predict 3, step 1-2-3 predict 4…..and then step 1-75 predict 76. This takes into account the sequential nature of the timeseries but now, I cannot apply it to multiple timeseries at the same time. I need the timeseries to be able to predict together at the same time and also generalize acorss each other (if there is some general trends across the time series, they must be captured by the model). So, what should I do, is there any way to do this using a simple LSTM? Is there any other sequence model architecture that works well here?

X Y
————————————1 2
———————————1 2 3
—————————–1 2 3 4
……………………………………… ..
…………………………………….. ..
1 2 3…………………………. 75 76

Regards,
Rick

• Jason Brownlee October 30, 2018 at 6:07 am #

Generally LSTMs are poor at time series forecasting, so I would encourage you to try an MLP and CNN, as well as linear methods such as ETS and SARIMA as a baseline.

Ensure you’re using a masking layer when padding all sequences to the same length.
Perhaps vary the configuration of the model to see if that is an issue.
Perhaps you can use the walk-forward validation approach and predict all series together (e.g. multivariate prediction).

I hope that helps as a start.

21. Lilli November 9, 2018 at 12:28 am #

Hi Jason,
Great post. thank you.
I have a problem and I am wondering if it is the right way to pad 0 as you mentioned.
I have a dataset of over 5000 video range from 17 frames to 71 frames of 32×48 grayscale. Image if i pad 0 to all the data to have the fixed size of 71*32*48, then the LSTM or CNN has to calculate through a bunch of 0, which costs too much computing and does actually nothing since it’s just padded frame. Is there any way I could process those data without padding? Or Is there any sense of padding I’m missing?
Regards,
Lilli

• Jason Brownlee November 9, 2018 at 5:25 am #

• Lilli November 22, 2018 at 1:01 am #

thank you ðŸ™‚

22. HW November 27, 2018 at 3:08 am #

Thank you very much for your consistently high clarity. A question: Suppose, the underlying sequence is not numerical, but consists of natural-language words and punctuation symbols. How should the punctuation symbols/marks be accounted for–while calculating the length?

23. Arjun December 7, 2018 at 4:51 pm #

I actually have a doubt, it may sound silly. Why is the padding required? While googling I found out that it might be helpful for batch operations. If so what are some possible batch operations we perform and what if I do not pad the input before feeding it to the neural network?

• Jason Brownlee December 8, 2018 at 6:59 am #

You can use a dynamic RNN where padding is not required.

It can be more efficient to put all data into fixed sized vectors when training/evaluating.

24. Abey December 13, 2018 at 1:15 am #

Dear Jason,

Thank you once again for the amazing blog.

I have a question and I’m hoping you can clarify it for me.

Here is how my dataframe looks.

Col1 Col_2…, Col_j, Col_M
[A_11, A_12, …, A_ij,…, A_1M]
[A_21, A_22, …, A_ij,…, A_2M]
. . .
[A_N1, A_N2, …, A_ij,…, A_PM]

[B_11, B_12, …, B_ij,…, B_1M]
[B_21, B_22, …, B_ij,…, B_2M]
. . .
[B_N1, B_N2, …, B_ij,…, B_QM]
.
[C_11, C_12, …, C_ij,…, C_1M],
[C_21, C_22, …, C_ij,…, C_2M]
. . .
[C_N1, C_N2, …, C_ij,…, C_RM]

So, length of sample A, B, and C are P, Q, and R respectively. where P < Q < R. Basically all of my samples has a different length.
So, say I choose bach_size of six samples, then each of the subsequent batch_size will have a different length.
How do I fix this? or

Does padding need to be applied to the whole of the dateframe to make P = Q = … = R?

• Jason Brownlee December 13, 2018 at 7:55 am #

All dimensions are padded to have fixed lengths, but the lengths of each dimension can vary.

25. aravind pai December 20, 2018 at 9:06 pm #

Hi Jason!Its great post!

I have a question! Lets define a rnn with 50 time steps. Now, When we pad a sequence of length 1 to max length then what would be the input to the rnn?

is it just one word or entire sequence padded with zeros?

• Jason Brownlee December 21, 2018 at 5:28 am #

Yes, and you can use a Masking layer to ignore the zeros.

• Oyku Ozlem Ozen January 21, 2021 at 7:31 am #

Hi Jason,

can you send also an example (code) how to use masking layer, and how to implement in training the padding and masking together?
Thank you

• Jason Brownlee January 21, 2021 at 7:44 am #

Yes, there are many examples of using a masking layer on the blog, use the search box.

• Oyku Ozlem Ozen January 21, 2021 at 9:56 am #

thank you:) Can I use the LSTM with padding masking also for the sequential data for example clcikstream data with session IDs with different length of step(click) numbers, like IMDB example you have, by adding some additional features as well? Do you have any example for that?

• Jason Brownlee January 22, 2021 at 7:14 am #

Perhaps try it and see.

Sorry, I don’t have an example of working with click stream data.

26. Molla Hafizur Rahman December 22, 2018 at 8:14 am #

Great post, Jason. I always follow your tutorials. I have 30 data point and each data point has variable lengths of sequence with feature 7. I want to predict the next action based on the previous actions. In this scenario won’t the padded 0 sequence impact the prediction accuracy?

• Jason Brownlee December 23, 2018 at 6:02 am #

No, you can use a Masking layer to ignore the padded values.

27. Hassan Ahmed May 9, 2019 at 5:31 pm #

I have a simple question. I want to train a simple SVM model to perform classification on audio signals. I have padded and sliced that data using numpy. As you know we have to pad the data to fixed length. I want to ask that how to select that fixed length? If I select the longest length of vector as my standard length, than there will be issue as some of my signals are too short in length, so there will a lot of zeros in the form of padding. So these too much zeros will effect my training accuracy or nor?
My second question is that is it possible that I equalize the length of the feature vectors by padding but ignore these padded zeros during training. (But keep in mind that I am using SVM to perform classification)

• Jason Brownlee May 10, 2019 at 8:14 am #

Perhaps experiment with different lengths and compare how they impact model skill.

Some algorithms can do this via masking. I suspect SVM could, but you may have to implement it yourself.

28. Rahul Krishan June 11, 2019 at 6:39 pm #

Hey Jason,
Thanks again on a great blog. I am not able to wrap my head around padding the input as a direct consequence to the LSTM block and its architecture. If we don’t mask and carry on with variable size input why cant the LSTM block process it or can it?

• Jason Brownlee June 12, 2019 at 7:54 am #

The Masking layer allows the padded values to be skipped in the processing of the input sequence.

29. Catriona June 12, 2019 at 12:31 am #

Hi,
Thanks for this – it looks great.

I’m having a bit of a problem, though, with the error:

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

do you know how I can fix this so that I can do post-padding anyway?

Thanks!

• Jason Brownlee June 12, 2019 at 8:05 am #

Perhaps work with dense instead of sparse matrices?

30. Carlos August 16, 2019 at 7:31 pm #

Hi Jason,

I’m preparing time series data for a CNN and have windowed my 4 sensor signals into 50-sample windows, e.g. each window is a 50 x 4 matrix. The last window however, often has fewer samples e.g. 7 x 4 or 21 x 4.

Is it wise to zero pad additional rows to that matrix to give it a dimension of 50 x 4? E.g. my 7 x 4 sample window would become a 50 x 4 but with 43 rows being all zeros. Would that still be okay for a CNN input or would the additional 0 rows bias the training process in some way?

• Jason Brownlee August 17, 2019 at 5:35 am #

Yes, zero pad all samples to the same length. Also, try using a masking layer to ignore the padding if its supported by your model.

31. naren August 23, 2019 at 1:05 pm #

In my case, I have a timestamp of 8. and each time stamp gets a feature vector of length 10.
Not always I have data for all time stamps. for example, if its a disease prediction for new patient once admitted in hospital, I don’t have 8 hours of data at hour 1. so, if I give data for timestamp 1 and feed zero vector’s of 7 to other 7 time stamps!! then worried how BTT makes any sense as the training happens. so how to solve this issue?

• Naren August 23, 2019 at 1:08 pm #

correction: I meant 0 vectors of length 10 to all other 7 time stamps.

Do you know a paper/tutorial with link to BTT proof? thanks.

• naren August 23, 2019 at 2:53 pm #

thanks for that link. I will read it and derive the BTT. meanwhile, could you answer my other question above?

In my case, I have a timestamp of 8. and each time stamp gets a feature vector of length 10. goal is to predict the disease possibility in 8 hours. even in the case I have just 1 hour of data, I should be able to detect that the patient would get disease after 8 hours.

Not always I have data for all time stamps. for example, if its a disease prediction for new patient once admitted in hospital, I donâ€™t have 8 hours of data at hour 1. so, if I give hour 1 data for timestamp 1 and feed zero vectorâ€™s of length 10 for remaining 7 time stamps!! then worried how BTT makes any sense as the training happens. so how to solve this issue?

I am also training an LSTM stateful with a timestamp of 1, and making it stateful fo the number of records I have for that patient and reset the states. for example, have a patient with 11 records, then I generate data in the following way and reset states after every scenario of data possibility. [1],reset [1,2], reset [1,2,3]..reset.. [1,2,…11] . this training is taking too long. but, what would be your approach?

haven’t gone to Attention nets to solve this. but, if I want to use the general LSTM or GRU, what’s your approach. could you advise, thanks?

• Jason Brownlee August 24, 2019 at 7:44 am #

Yes, you can use padding or truncation to ensure all samples have the same shape. I would recommend zero padding and use a masking layer to ignore the padded features/timesteps.

I would recommend constructing samples that describe a scenario, rather than learning across samples and managing state. It’s just a simpler model.

32. Rana October 20, 2019 at 5:57 pm #

Thank you for the tutorial!

If my input sample consists of two matrices and I construct the training dataset using samples of five different sizes (for each matrix), all zero-padded (to a mximum size, larger than all the original matrices). Will I need to fine-tune my network for every new input size (not included in the training dataset)?

• Jason Brownlee October 21, 2019 at 6:15 am #

I don’t believe so.

• Rana October 24, 2019 at 5:27 pm #

Thank you for your reply. I tried to test on new sizes without fine-tuning but all predictions were zero.

It works very well, though, if I fine-tune only the input layer (with linear activation function) for a few epochs. Is that possible?

Do you think I should use CNN?

• Jason Brownlee October 25, 2019 at 6:37 am #

I recommend testing a suite of framings of the problem and different models to see what works best for your dataset.

1DCNNs can be very effective for sequence prediction. Better and faster than LSTMs often.

Changing the activation function during training does not seem wise. It is possible. Load the weights into two network types and train them sequentially.

33. George October 24, 2019 at 6:54 am #

Hi Jason: great post. Question: what if I have multiple sequences of images? The input of an LSTM is supposed to be (# of sequences, # of timesteps, # of features), right? How do I pad sequences of images before feeding it into an LSTM model?

• Jason Brownlee October 24, 2019 at 2:01 pm #

You would use a CNN-LSTM that would extract the features of an image as a vector to give a feature vector per time step.

[sample, image, featurevector]

Does that help?

34. Siraj Ahmed November 8, 2019 at 8:53 pm #

Jason, Again than you for the post. Could you plz direct me towards something by which we can do padding on variable length “float” list instead of variable length “integer” list ?

• Jason Brownlee November 9, 2019 at 6:12 am #

You can pad float data using the same function.

35. Ahmet Cihat November 29, 2019 at 7:43 pm #

Assume you have a signal that has 25 samples and you have another signal that has 3 samples. If you pad the signal that has 3 samples you actually create almost a zero vector which I don’t know the effect of it to the network. Is it a really good idea to generate such data?

• Jason Brownlee November 30, 2019 at 6:29 am #

You can use a masking layer in some models to ignore the padded values.

Perhaps compare a few methods and compare the results?

36. Fiorella December 13, 2019 at 10:46 am #

Hi Jason,

Thanks for the great tutorials, they always help me a lot!
I had two questions, one is regarding padding and masking, and the other regards the variable input length.

Q 1: Let’s say the input to the LSTM model is a one hot encoded vector, where the ones and zeros have a specific meaning. Would it then be wise to use padding and masking with a value that is not zero, for example -1, to make sure that the actual zeros don’t get ignored?

Q 2: Is the sequence length a variable that the model learns, i.e. does the sequence length affect the weights of the model? I have a classification problem, in which the variable sequence length might indicate the prediction output, but I want to make sure that the model is only learning the values in the sequence and not the sequence length as an extra underlying influencer. Just wondering if the variable sequence length is something that the model learns as well and which might affect classification results.

• Jason Brownlee December 13, 2019 at 1:43 pm #

You’re welcome.

Yes, you must mask with a value not in the input.

Probably not.

37. James January 15, 2020 at 2:50 am #

Hi Jason,

Big fan of your tutorials, thank you for posting them!

I have a question regarding formatting the test dataset. I have padded my training data to (num_samples, num_timesteps, num_features) so that num_timesteps is the same for all samples. Consider num_timesteps as 1000 so every sample in the training data is of length 1000. I use masking and get a trained LSTM. Now, the test data samples have variable length. Is there an alternative to NOT padding the test data to 1000 timesteps? I ask because hypothetically, if I were to turn on my LSTM, I would like to run in at each timestep after the first 1000. Essentially, can I keep the test data as variable length and still get predicted probabilities for each time step?

• Jason Brownlee January 15, 2020 at 8:29 am #

Thanks James.

Not really, if I understand your question correctly.

You can if you use a dynamic LSTM, and in that case padding would not be required.

• James January 16, 2020 at 6:08 am #

Thanks. I was hoping not to go the route of dynamic RNN.

38. Jam April 8, 2020 at 1:58 am #

Dear Jason,
Thanks, it is useful. but do you have any idea how to pad sequences for their features?
for data which are same in time steps but different in feature size?
Thanks

• Jason Brownlee April 8, 2020 at 7:57 am #

Yes, the above examples should help directly.

• Jam April 26, 2020 at 8:41 pm #

Sorry to ask again, i cannot find similar example. Posit we want to make an array (1,2) into array (16,2). This can happen when features of LSTM are not same size. then is there any padding solution for that?

• Jason Brownlee April 27, 2020 at 5:33 am #

Yes, you could pad the remaining 15 time steps.

39. Mariana April 10, 2020 at 4:55 am #

Hi Jason! You’re blog has solved a lot of my questions.

I just have another question regarding the LSTM that I plan to build.

The idea is to feed the neural network with a 2D tensor(300 timesteps, 5 features)
Every 300 timesteps I will need to classify each timestep with {0,1,2}.

t feature 1 2 3 4 5 label
1 0
2 0
3 1
. 1
. 0
. 2
300 2

The size is fixed I will always received 300 samples. However, I am confused on how to train the model. So far, I have (80,300,5) 80 samples of 300 timesteps by 5 features each.

Input: (300,5)
Output: (300,1) –> Multilabel classification
Training: (80,300,5)

How should I define the model?
# define model
model = Sequential()

40. Hai Yisha June 15, 2020 at 2:51 pm #

Hi, Jason ! thank you for your tutorial.
one question, in text generation, if I padded my sequences with special value like 0, what the masking value should be in this line of code ?

is it should be just 0 or should be one hot encoded vector of 0?

• Jason Brownlee June 16, 2020 at 5:32 am #

If you padded with zero values, then the masked value will be zero.

41. Will Lxg June 27, 2020 at 5:34 am #

Hi, Jason,

Thanks for your post. It is very useful.

I got a question, which I want to use LSTM to do a classification. In the training phase, I know the real length of my input, then I padd zero value. But in the predicting phase, I do not know the real length of my input. I mean I do not know when a real sequence is finished.

Is it possible to use LSTM to do that or I must know the real length of my input?

• Jason Brownlee June 27, 2020 at 5:36 am #

Yes, but maybe you will have to truncate new samples to your pre-defined fixed length.

Or maybe you can explore using a dynamic LSTM.

42. Will Lxg June 28, 2020 at 2:31 pm #

Thanks, Jason

I will check the dynamic LSTM.

43. Moctar October 2, 2020 at 9:04 pm #

Hi Jason,
so i have a text file with multiple phrases like thousands each on one line, for preprocessing i would to to add the at the beginning and at the end of each of them, but it doesn’t seem to work, i was wondering how to do it then

“,
n=2))

• Jason Brownlee October 3, 2020 at 6:06 am #

Sorry, I don’t understand, perhaps you could rephrase your question.

• Moctar October 3, 2020 at 12:18 pm #

Sorry my bad, i am fairly new to NLP
what i wanted to say is, i have a .txt file of thousands+ lines, during the preprocessing phase i would like to add special “padding” symbols on each of them, like ……. so i was wondering how to do it from a text file. Most of the examples are done with just one text line so it is easy to do it manually. But in my case i have many lines so that’s not really an option.

• Jason Brownlee October 3, 2020 at 12:31 pm #

You could do it as part of loading each line – e.g. prepend the string to each line, and perhaps save the padded lines for reuse later.

• Moctar October 3, 2020 at 1:14 pm #

Thank you!

• Jason Brownlee October 4, 2020 at 6:48 am #

You’re welcome.

44. Moctar October 4, 2020 at 2:13 pm #

Hi jason
i wanted to know how do we train Ngrams(N =1 to 3) model in NLTK. to what i understand it is just the data preprocessing then write a code using the nltk ngrams library to be able to get unigrams, bigrams and trigrams. is that it or there is a part left?

• Jason Brownlee October 4, 2020 at 2:59 pm #

Sorry, I don’t have a tutorial on this topic.

45. ali May 19, 2021 at 9:26 pm #

hi Jason,
I have a dataset in shape of (#samples, #timesteps, #features) and I pad sequences with zero-padding. I used a BLSTM model for classification task, but when I added a Masking layer with mask_value=0 before the BLSTM layer, the results didn’t change…
could you tell what is happening and why the results are still the same?

model = Sequential()

• Jason Brownlee May 20, 2021 at 5:47 am #

Perhaps you need to tune the model learning hyperparameters or architecture or data preparation?

46. Skyler August 4, 2021 at 10:45 pm #

Hi Jason,
Thank you very much for such an informative article.
I am working on sound event detection task and I have test samples of different length.
So I am 0 padding the samples to make all sample length same i.e., equal to length of longest sample. But, I am not getting a good test score.
1) Do you this this might be because of 0-padding as the samples have Onset and Offset timing and after 0-padding model finds difficult to learn feature ?
2) Can you suggest an alternate way to tackle variable length sample problem?

Thank you

• Jason Brownlee August 5, 2021 at 5:18 am #

Yes, try some of the above-mentioned methods.

47. Jinho Ko January 14, 2022 at 9:58 pm #

Thank you so much for writing these tutorials.

I want to recognize the motion with the cam through motion images with different frame lengths, but if the input length is unified through padding, I cannot catch the standard when inputting frames with the webcam.

Is there any way?

48. parisa March 8, 2022 at 8:02 pm #

hi, thanks for your useful posts.
I have confused about my project.
imagine something like speech recognition. for example we have 32 signals for input (this number is fixed) and we have 28 classes for each signal.
after that we have a CTC decoder that creates the output.
the label for first 32 signals is “hello”
for second one is “goodbye”
and….
these labels are used just for ctc loss

• James Carmichael March 9, 2022 at 5:51 am #

Hi Parisa…this discussion may help clarify:

49. dave September 14, 2022 at 12:01 am #

Hi James

– I find myself in the process of extracting bigrams (sequences of two characters) from text data to perform an analysis of frequencies and develop a model to perform language classification

– Now, the frequencies and related quintiles are of course affected by the text length, which can vary both in the train set and in the future score sets

– I searched some proper approach to follow to avoid the variance in text length to affect the model performances.

It seems that one approach is the following:

â€œ Recall that for batching, we need to have all the sequences in a given batch be of uniform length ( N of words).

To do that, we either:

â€¢ (1) pad the sequences that are shorter than a given length or

â€¢ (2) truncate the sequences that are bigger than the given length.

â€¢ The question is how do we decide this length? We have several options:

We decide on a global maximum sequence length based on the sequence length characteristics of the training data. â€œ

Meaning that:

1. It is identified in the training set, within the distribution of N of words composing a text, the maximum value or 75% quantile , and use it as upper limit
2. Based on this value, all the texts with greater length are truncated

The approach seems sound to me in avoiding that text length will affect the bigrams distributions and model performance on the score set

Still, I would know what is the best approach according to your experience on the topic described