How to Reshape Input Data for Long Short-Term Memory Networks in Keras

It can be difficult to understand how to prepare your sequence data for input to an LSTM model.

Often there is confusion around how to define the input layer for the LSTM model.

There is also confusion about how to convert your sequence data that may be a 1D or 2D matrix of numbers to the required 3D format of the LSTM input layer.

In this tutorial, you will discover how to define the input layer to LSTM models and how to reshape your loaded input data for LSTM models.

After completing this tutorial, you will know:

  • How to define an LSTM input layer.
  • How to reshape a one-dimensional sequence data for an LSTM model and define the input layer.
  • How to reshape multiple parallel series data for an LSTM model and define the input layer.

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Reshape Input for Long Short-Term Memory Networks in Keras

How to Reshape Input for Long Short-Term Memory Networks in Keras
Photo by Global Landscapes Forum, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. LSTM Input Layer
  2. Example of LSTM with Single Input Sample
  3. Example of LSTM with Multiple Input Features
  4. Tips for LSTM Input

LSTM Input Layer

The LSTM input layer is specified by the “input_shape” argument on the first hidden layer of the network.

This can make things confusing for beginners.

For example, below is an example of a network with one hidden LSTM layer and one Dense output layer.

In this example, the LSTM() layer must specify the shape of the input.

The input to every LSTM layer must be three-dimensional.

The three dimensions of this input are:

  • Samples. One sequence is one sample. A batch is comprised of one or more samples.
  • Time Steps. One time step is one point of observation in the sample.
  • Features. One feature is one observation at a time step.

This means that the input layer expects a 3D array of data when fitting the model and when making predictions, even if specific dimensions of the array contain a single value, e.g. one sample or one feature.

When defining the input layer of your LSTM network, the network assumes you have 1 or more samples and requires that you specify the number of time steps and the number of features. You can do this by specifying a tuple to the “input_shape” argument.

For example, the model below defines an input layer that expects 1 or more samples, 50 time steps, and 2 features.

Now that we know how to define an LSTM input layer and the expectations of 3D inputs, let’s look at some examples of how we can prepare our data for the LSTM.

Example of LSTM With Single Input Sample

Consider the case where you have one sequence of multiple time steps and one feature.

For example, this could be a sequence of 10 values:

We can define this sequence of numbers as a NumPy array.

We can then use the reshape() function on the NumPy array to reshape this one-dimensional array into a three-dimensional array with 1 sample, 10 time steps, and 1 feature at each time step.

The reshape() function when called on an array takes one argument which is a tuple defining the new shape of the array. We cannot pass in any tuple of numbers; the reshape must evenly reorganize the data in the array.

Once reshaped, we can print the new shape of the array.

Putting all of this together, the complete example is listed below.

Running the example prints the new 3D shape of the single sample.

This data is now ready to be used as input (X) to the LSTM with an input_shape of (10, 1).

Example of LSTM with Multiple Input Features

Consider the case where you have multiple parallel series as input for your model.

For example, this could be two parallel series of 10 values:

We can define these data as a matrix of 2 columns with 10 rows:

This data can be framed as 1 sample with 10 time steps and 2 features.

It can be reshaped as a 3D array as follows:

Putting all of this together, the complete example is listed below.

Running the example prints the new 3D shape of the single sample.

This data is now ready to be used as input (X) to the LSTM with an input_shape of (10, 2).

Longer Worked Example

For a complete end-to-end worked example of preparing data, see this post:

Tips for LSTM Input

This section lists some tips to help you when preparing your input data for LSTMs.

  • The LSTM input layer must be 3D.
  • The meaning of the 3 input dimensions are: samples, time steps, and features.
  • The LSTM input layer is defined by the input_shape argument on the first hidden layer.
  • The input_shape argument takes a tuple of two values that define the number of time steps and features.
  • The number of samples is assumed to be 1 or more.
  • The reshape() function on NumPy arrays can be used to reshape your 1D or 2D data to be 3D.
  • The reshape() function takes a tuple as an argument that defines the new shape.

Further Reading

This section provides more resources on the topic if you are looking go deeper.


In this tutorial, you discovered how to define the input layer for LSTMs and how to reshape your sequence data for input to LSTMs.

Specifically, you learned:

  • How to define an LSTM input layer.
  • How to reshape a one-dimensional sequence data for an LSTM model and define the input layer.
  • How to reshape multiple parallel series data for an LSTM model and define the input layer.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more...

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

See What's Inside

394 Responses to How to Reshape Input Data for Long Short-Term Memory Networks in Keras

  1. Avatar
    Steven August 31, 2017 at 2:14 am #

    Great explanation of the dimensions! Just wanted to say this explanation also works for LSTM models in Tensorflow as well.

    • Avatar
      Jason Brownlee August 31, 2017 at 6:20 am #

      Thanks Steven.

      • Avatar
        yuan September 1, 2017 at 6:42 pm #

        Hi Jason,

        Thanks a lot for your explanations .
        I have a confusion below:
        Assuming that we have multiple parallel series as input for out model.The first step is to define these data as a matrix of M columns with N rows.To be 3D(samples, time steps, and features),is this means that,samples :1 sample ,time steps: row numbers of the matrix ,and features: column numbers of the matrix ? Must it be like this?Looking forward to your reply.Thank you

        • Avatar
          Jason Brownlee September 2, 2017 at 6:06 am #

          Sorry, I’m not sure I follow your question.

          If you have parallel time series, then each series would need the same number of time steps and be represented as a separate feature (e.g. observation at a time).

          Does that help?

          • Avatar
            HelloWorld February 25, 2019 at 8:09 pm #

            Agree with Yuan on this issue

        • Avatar
          Nikhil April 30, 2019 at 8:11 pm #

          Hi Yuan,
          I got your question. You have doubt that can we say the number of rows is the time steps and number of columns is features. Yes you can understand this way also.
          That is , sample, time steps and features is equivalent to sample, number of rows, number of columns respectively.

      • Avatar
        Md. Abul Kalam Azad August 16, 2018 at 12:21 pm #

        Hello Sir,
        I have used your multivariates code for testing lstm model.Its working fine but I did not understand why lstm one row is hidden from train as well as test dataset. If I test two rows in test then got one prediction as a output. And datetime output this one prediction result?

        Please help me.

        Thanks in advance.


        • Avatar
          Jason Brownlee August 16, 2018 at 1:57 pm #

          Sorry, I don’t follow.

          Perhaps you can elaborate on your question?

          • Avatar
            wanghy January 17, 2019 at 4:55 pm #

            hi jason,your artical is very inspire.
            Now i am working with loan risk control. i have client repayment time difference for each period. i am working on use time series modle on time difference before to predite the client will overdue.
            suppose i have 10000 client and for each client i have 6 period repayment time difference. as the artical you write my input data orginal is (10000*6).before i send it to LSTM model i should reshape it to 10000*6*1 3D dataset? is that the right way to do it? if you have any other experience of this problem you can tell me as well.

          • Avatar
            Jason Brownlee January 18, 2019 at 5:29 am #

            Sounds like it would be [10000, 6, 1]

          • Avatar
            zy April 3, 2019 at 6:41 pm #

            HI, I think wanghy wants to make a loan risk forecast.
            The customer should pay 6 repayments for each loan.
            So if you use [10000,6,1] to train the model,
            how do you feed data when making prediction after 3 replayments have done

          • Avatar
            Jason Brownlee April 4, 2019 at 7:45 am #

            Perhaps you can make 1 step forecasts based on zero or more input time steps and use zero padding to fill out the inputs?


          • Avatar
            zy April 4, 2019 at 1:38 pm #

            thanks jason
            I am still confused about the timesteps.
            Suppose the customer pays 240 repayments for each loan (20 years) and the loan risk is related to the latest year’s repayments.
            So, the train data should be [rows, 12, 1] instead of [rows, 240, 1], right?
            Sometimes, the values of timesteps and features are very large, such as 1024, 512.
            How to choose the value of the timesteps when the memory is not enough?

          • Avatar
            Jason Brownlee April 4, 2019 at 2:17 pm #

            This explains it well:

            Trial and error is a good approach, test different framings of the problem and see what works best.

          • Avatar
            zy April 8, 2019 at 10:41 am #

            thanks very much

      • Avatar
        Nomi October 29, 2019 at 8:36 pm #

        sir i am working on prediction problem , i have an experimental data in 1D colunm for 8000 elements and also 3D data distributed alonh spatial coordinates like [x y z value]. now i have confusion how i use my 1D data and then 3D data for prediction.

  2. Avatar
    Oliver August 31, 2017 at 9:23 pm #

    Hi Jason,

    thanks a lot for all the explanations you gave!
    I tried to understand the effect of the reshape parameters and the effect in the spyder/variable explorer. But I do not understand the result shown in the data window.
    I used the code from a different tutorial:

    data = array([
    [0.1, 1.0],
    [0.2, 0.9],
    [0.3, 0.8],
    [0.4, 0.7],
    [0.5, 0.6],
    [0.6, 0.5],
    [0.7, 0.4],
    [0.8, 0.3],
    [0.9, 0.2],
    [1.0, 0.1]])
    data_re = data.reshape(1, 10, 2)

    When checking the result in the variable explorer of spyder I see 3 dimensions of the array but can not connect it to the paramters sample, timestep, feature.

    On axis 0 of data_re I see the complete dataset
    On axis 1 of the data_re I get 0.1 and 1.0 in column 1
    On axis 2 of the data_re I see the column 1 of axis 0 transposed to row 1

    Would you give me a hint how to interpret it?


    • Avatar
      Jason Brownlee September 1, 2017 at 6:46 am #

      There are no named parameters, I am referring to the dimensions by those names because that is how the LSTM model uses the data.

      Sorry for the confusion.

  3. Avatar
    Saga September 1, 2017 at 6:46 pm #

    Hi Jason,

    Thanks so much for the article (and the whole series in fact!). The documentation in Keras is not very clear on many things on its own.

    I have been trying to implement a model that receives multiple samples of multivariate timeseries as input. The twist is that the length of the series, i.e. the “time steps” dimension is different for different samples. I have tried to train a model on each sample individually and then merge, (but then each LSTM is going to be extremely prone to overfitting). Another idea was to scale the samples to have the same time steps but this comes with a scaling factor of time steps for each sample which is not ideal either.

    Is there a way to provide the LSTM with samples of dynamic time steps? maybe using a lower-level API?


    • Avatar
      Jason Brownlee September 2, 2017 at 6:07 am #

      A way I use often is to pad all sequences to the same length and use a masking layer on the front end to ignore masked time steps.

      • Avatar
        Mohamad May 13, 2018 at 10:43 pm #

        HI Jason,
        Thank you for this amazing article.
        I have the same problem here. which is the samples have many different lengths.

        I did not get the idea you said.
        “A way I use often is to pad all sequences to the same length and use a masking layer on the front end to ignore masked time steps.”
        can you please provide more details about that?
        or maybe provide articles explain how to solve this problem.

        Thank you in advance.

  4. Avatar
    Shrimanti September 14, 2017 at 2:42 am #

    Hi Jason,

    Thanks very much for your tutorials on LSTM. I am trying to predict one time series from 10 different parallel time series. All of them are different 1D series. So, the shape of my X_train is (50000,10) and Y_train is (50000,1). I couldn’t figure out how to reshape my dataset and the input shape of LSTM if I want to use let’s say 100 time steps or look back as 100.


  5. Avatar
    Anil Pise October 14, 2017 at 6:38 am #

    Respected Sir
    I want to use LSTM RNN GRU to check changes in facial expression of the person who is watching a movie. Want to check his mental state whether he is a boar or interested to continue this movie or at what time he is a boar. Can you please help me how can I start to work on same.

    • Avatar
      Jason Brownlee October 15, 2017 at 5:16 am #

      That sounds like a great problem. I would recommend starting by collecting a ton of training data.

      Then think of using a CNN on the front end of your LSTM.

  6. Avatar
    Rafiya October 23, 2017 at 9:28 pm #

    I have around 12,000 tweets for sentiment classification totally. Do you think 16GB CPU RAM will be enough?

  7. Avatar
    Ravil November 19, 2017 at 10:34 pm #

    Hi Jason

    Thanks for the simple explanation.

    However, I have a doubt. What if you don’t know the no of time steps? How do you proceed then?
    Is that why we use the embedding layer?

    I intend to use it for sentiment analysis of imDb movie review dataset.

  8. Avatar
    Mohammad November 23, 2017 at 5:21 pm #

    Hi Jason,
    I finally understood the input shape requirements.
    Just a quick question: batch_size would be a certain number of samples inside a group e.g if we have 100 samples we can divide it into batches of 10. Batching helps with a faster training time right?

    • Avatar
      Jason Brownlee November 24, 2017 at 9:35 am #

      Correct, and weight updates occur and state is reset at the end of each batch.

  9. Avatar
    Eduardo Andrade November 30, 2017 at 4:24 am #

    Hi Jason,

    About sample (the first argument in reshape): if I have two sequences with different number of values (let’s suppose one with 10 values and another with 8) and want them to be considered as two distinct samples (not 2 features), a zero-padding is necessary?

    series 1: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
    series 2: 1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.0, 0.0

    If I do:

    data = data.reshape(2, 10, 1)

    It is going to understand them as 2 different samples?

    • Avatar
      Jason Brownlee November 30, 2017 at 8:23 am #

      Yes, padding to 10 time steps.

      Yes, your reshape looks good.

      Explore pre and post padding to see if it makes a difference for your model.

      • Avatar
        Tseo December 8, 2017 at 12:49 am #

        With this input, the model is going to understand two different series?

        Why to don’t use (1, 10,2) shape?

        • Avatar
          Jason Brownlee December 8, 2017 at 5:42 am #

          You could treat them as two features as you suggestion, I thought they were separate samples.

  10. Avatar
    Ali December 2, 2017 at 10:55 pm #

    Hi Jason,

    Thanks a lot for the tutorial!
    I am trying to understand the input shape for LSTM data (No. of timesteps & no. of features). Could I ask what each will be in the context of the iris dataset, please?

    Am I correct to say that in the iris dataset, the timesteps can be 2, 3, 5, 6 – as long as it neatly divides the dataset into equal number of rows (iris has 150 rows).
    And the number of features will be the number of columns (apart from the target column/class)?

    Thanks ever so much!

    • Avatar
      Jason Brownlee December 3, 2017 at 5:25 am #

      The iris dataset is not a sequence classification problem. It does not have time steps, only samples and features.

  11. Avatar
    Tsep December 7, 2017 at 11:58 am #


    First of all, thank you very much for your posts, I have learned a lot.

    My question is because I’m not sure how to focus the next type of problems: multiples sequences of multiple features.

    For example, predict the amount that a user could spend given the previous purchases (here I can consider different features such as the previous amounts, products, day of week, etc.). If I have a dataset with data of 1000 users and I want to predict the amount for each user, how should be addressed?
    Can I use a lstm for all users or each user will have a model/lstm?

    I understand that a lstm for all users could see things more interesting.. But I don’t know how to organize the input of different users.. because the example of two sequences (1,10,2) I don’t know how to apply.. I want to include more features for each sequence..

    I’m very lost..
    Thank you in advance

    • Avatar
      Jason Brownlee December 7, 2017 at 3:04 pm #

      Perhaps start off by modeling individual users?

      • Avatar
        Tseo December 7, 2017 at 9:28 pm #

        By modelling individual users do you mean a lstm per user?

        I have users with 200 purchases but others only with only 10.. would be enough?

        I will try!


    • Avatar
      Emi August 11, 2019 at 2:27 am #

      I’ have the same issue. It’s driving me crazy!

      datetime user_id feture_1 feature_2 feature_3 …

      2018-01-01 0 1 2 3 …
      2018-01-20 0 3 49 15 …
      2018-01-01 1 1 5 8 …
      20118-02-25 1 3 5 15 …


      user_id target

      0 0
      1 1

      I think I’ve two ways: one with DL Model (LSTM maybe) but I’m not sure how to organize trainig set. The other way could be by grouping features by userid and apply the cuount of the category feature (previos on_hot_encode) and apply Descition Trees model

      How did you adress it??

      • Avatar
        Jason Brownlee August 11, 2019 at 6:04 am #

        You may have to transform the data prior to modeling.

        E.g. sequence of prior user actions on day, week or month intervals and a user target output. Perhaps with zero padding on the input sequence.

  12. Avatar
    Peter December 17, 2017 at 7:50 am #

    Hi Jason,

    Thanks for your tutorial and for your book!
    I am not sure how to design the input shape of the following table or dataframe:

    date, product, store, hasPromotion, attrib1, attrib2, quantity (t)

    The first three columns are the key. We have 50000 products in 20 stores and I would like to predict the quantity (per product per store) at least 14 days ahead with LSTM.

    What is the good start for the 3D input?
    I am wondering if creating new features from date (as there are repetition), like day of week, day of month, month of year, etc. + the existing features + quantity (t), quantity (t+1) would do…

    Thank you for your help in advance!

    • Avatar
      Jason Brownlee December 17, 2017 at 8:56 am #

      Drop date and you have 6 features, does that help?

      • Avatar
        Peter December 18, 2017 at 8:02 am #

        OK, thanks. If there are seasonality and trend in sales, should I remove them before train the LSTM, too?

  13. Avatar
    Vic January 8, 2018 at 11:31 am #


    Thank you so much for the great amount of tutorials on LSTMs
    Im trying to build an LSTM in keras using your examples and keep running into shape issues.

    I have time series data set with prices for different things, and am trying to predict the price of item4 for time t+1
    Item4 is a lagged value so that you can use previous set of prices to predict the next.
    The data set has 400 sequential observations.

    variables: datetime price1 price2 price3 item4_price

    since the data variable has uniform interval of observations and none are missing, i am dropping the datetime variable.
    So now i have 4 variables and 400 observations.

    trainX = train[:, 0:-1] #use first 3 variables
    trainY = train[:,-1] #use the last variable

    so now the trainX data set has price1 price2 and price3 variables (its my undestanding that this means there are 3 “feautres” in keras)
    trainY is the predictor data set and only cointains item4_price

    trainX = numpy.reshape(trainX, (1, 400, 3)) #reshape, this means there is 1 sample, 400 timestamps, and 3 features

    model = Sequential()
    model.add(LSTM(5, input_shape=(1, 400, 3), return_sequences=True))
    model.compile(loss=’mean_squared_error’, optimizer=’adam’), trainY, epochs=100, batch_size=1, verbose=2)

    Keep getting various shape errors all the time, no matter what i do. I tried switching it around, and even ommiting the first dimension.
    I was wondering if you could point me in the right dirrection of what it is that i keep missing in my understnading of keras/lstm shapes.

    I also dont know if the trainY set needs shaping? I tried to shape it too but python was also not happy with that.

    Let me know what you think!


    • Avatar
      Jason Brownlee January 8, 2018 at 3:53 pm #

      Perhaps start with one series and really nail down what is required.

      Did you try this tutorial:

      • Avatar
        Vic January 9, 2018 at 1:33 am #

        Hi Dr. Brownlee,

        I have previously read that tutorial and feel as though i understand it fine.
        But when applying what I learned to the problem in a way as described previously, find that Im running into some trouble.

        So i was hoping I was just overlooking something, but at this point im not really sure what. Is what Im doing seem reasonable?


        • Avatar
          Jason Brownlee January 9, 2018 at 5:33 am #

          Perhaps, but I don’t know your problem as well as you and there is no set way to solve any ml problem.

          I would encourage you to brainstorm and try a suite of approaches to see what works best.

  14. Avatar
    sujan Ghimire January 10, 2018 at 12:31 pm #

    Hi Jason,

    I have gone through this tutorial but i have a input size of 1762 X 4 and output size 1762X 1.

    I did as follows but the shape of y train is giving as (1394, 4) , which should be 1394,1

    Can you help me on this?

    • Avatar
      Jason Brownlee January 10, 2018 at 3:43 pm #

      Sorry, I cannot debug your code for you. I simply do not have the capacity, I’m sure you can understand.

      Perhaps post your error to stackoverflow or cross validated?

  15. Avatar
    Siji January 17, 2018 at 6:28 pm #

    I got an exception “ValueError: Input arrays should have the same number of samples as target arrays. Found 1 input samples and 21 target samples”.

    =>print X_train

    [[ 0.15699646 -1.59383227]
    [-0.31399291 -0.03680409]
    [ 0.15699646 -1.59383227]
    [-0.31399291 0.78456757]
    [ 0.15699646 -1.59383227]
    [ 4.39590078 -1.59383227]
    [-0.31399291 1.38764971]
    [-0.31399291 -0.03680409]
    [-0.31399291 -0.32252408]
    [-0.31399291 0.6081381 ]
    [-0.31399291 -0.32252408]
    [-0.31399291 1.38764971]
    [-0.31399291 0.78456757]
    [-0.31399291 -0.03680409]
    [-0.31399291 0.78456757]
    [ 0.15699646 1.24889926]
    [-0.31399291 -0.32252408]
    [-0.31399291 1.24889926]
    [-0.31399291 -0.69488163]
    [-0.31399291 -0.69488163]
    [-0.31399291 0.6081381 ]]

    =>print y_train

    0 1
    1 1
    2 1
    3 1
    4 1
    5 1
    6 1
    7 0
    8 0
    9 0
    10 0
    11 0
    12 0
    13 0
    14 1
    15 1
    16 1
    17 1
    18 0
    19 0
    20 0
    Name: out, dtype: int64



    =>print X_train.shape

    (21, 2)

    =>print X_test.shape

    (8, 2)

    I have reshaped the inputs to 3dimensional input. I have followed you steps.
    =>X_train = X_train.reshape(1,21, 2)

    (1, 21, 2)
    model = Sequential()
    model.add(LSTM(32, input_shape=(21, 2)))


    history =,y_train,batch_size =13, epochs = 14)

    ValueError Traceback (most recent call last)
    in ()
    —-> 1 history =,y_train,batch_size =13, epochs = 14)

    /home/siji/anaconda2/lib/python2.7/site-packages/keras/models.pyc in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, **kwargs)
    891 class_weight=class_weight,
    892 sample_weight=sample_weight,
    –> 893 initial_epoch=initial_epoch)
    895 def evaluate(self, x, y, batch_size=32, verbose=1,

    /home/siji/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
    1553 class_weight=class_weight,
    1554 check_batch_axis=False,
    -> 1555 batch_size=batch_size)
    1556 # Prepare validation data.
    1557 do_validation = False

    /home/siji/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in _standardize_user_data(self, x, y, sample_weight, class_weight, check_batch_axis, batch_size)
    1419 for (ref, sw, cw, mode)
    1420 in zip(y, sample_weights, class_weights, self._feed_sample_weight_modes)]
    -> 1421 _check_array_lengths(x, y, sample_weights)
    1422 _check_loss_and_target_compatibility(y,
    1423 self._feed_loss_fns,

    /home/siji/anaconda2/lib/python2.7/site-packages/keras/engine/training.pyc in _check_array_lengths(inputs, targets, weights)
    249 ‘the same number of samples as target arrays. ‘
    250 ‘Found ‘ + str(list(set_x)[0]) + ‘ input samples ‘
    –> 251 ‘and ‘ + str(list(set_y)[0]) + ‘ target samples.’)
    252 if len(set_w) > 1:
    253 raise ValueError(‘All sample_weight arrays should have ‘

    ValueError: Input arrays should have the same number of samples as target arrays. Found 1 input samples and 21 target samples.

    Please solve my problem. I am new in this area. What is the mistake

    • Avatar
      Jason Brownlee January 18, 2018 at 10:05 am #

      Perhaps cut the example back to a few lines to help expose the fault?

  16. Avatar
    Mikel February 15, 2018 at 2:34 am #

    Hi Jason,

    I’m trying to understand the input_shape but I think I’m totally confused about the time step variable. I have a multivariate time series with 18,000 samples and 720 features. I created a 10 lagged observation dataset to forecast the next 5 time steps so my dataset goes from t-10 to t+5, being the feature dataset from t-10 to t and the label dataset from t+1 to t+5.

    Assuming that I take 15,000 samples for training, what will be the values of the reshape function? I think it should be [15000, 1, 7200 (720 features * 10)]. Regarding the time step, is the value “1” correct or it should be the number of lagged observations, that is, 10?

    Thank you in advance.

  17. Avatar
    Amim April 11, 2018 at 7:15 pm #

    Hi Jason!
    Thank you so much for all the tutorials on LSTMs, I’ve learned a lot.

    I’m trying to implement the LSTM Architecture from the paper “Dropout improves Recurrent Neural Networks for Handwriting Recognition” for resolving the handwritten recognition problem.
    Basically I have to train the network giving in input variable-sized images (different W and H but always 3 channels) and to predict what is the word written in the image. What I can’t understand is how to deal with variable sized images? Can I consider images as some sequences (for ex. a 50×30 image considered as 50 sequences with 30 features?). The authors say I give in input a block of image of size 2×2 scaning in 4 different directions (multidirectional multidimensional LSTM).
    What do I have to specify here : input_size(Samples,Time Steps,Features) ? The Samples refers to the number of all images I have in training set or the number of miniblocks 2×2 ? What about time steps and features? I don’t get it and its very confusing. Can you please help with any idea? I am new in this area and Im stacked in this problem.

    Thanks a lot 🙂

    • Avatar
      Jason Brownlee April 12, 2018 at 8:36 am #

      I would recommend padding the inputs to a fixed size.

  18. Avatar
    Jeyel April 19, 2018 at 6:07 pm #

    Hi Jason!
    Thanks so much for your tutorials on LSTM!
    I’m trying to predict trajectory with LSTM and ARIMA now. After reading this tutorial, I’ve got some questions.
    (1) Do we must transfer time series to lag observations if we want to do forecast work with LSTM?
    (2) After transfering time series to supervised learning problems, the forecast is only related to “order” or “lag” rather than “time”(like ARIMA do)? Why the input is not time/date? And the time interval of data must be even?
    Thanks a lot in advance!

    • Avatar
      Jason Brownlee April 20, 2018 at 5:47 am #

      No, LSTMs can work with the time steps directly.

      The order of the observations is sufficient for the model, if the time steps are consistently spaced it does not need the absolute date/time information.

  19. Avatar
    Leonardo April 20, 2018 at 4:34 pm #

    Hello Jason! Congratulations on the LSTM input tutorial!
    Could you please answer three questions?

    I’m working with 500 samples that have varying sizes. My doubts are related to the organization of these 500 samples within this 3-dimensional input, mainly in relation to the Samples dimension.

    The dimension “Features” has already defined that it will have size 26, the dimension “Time Steps” will have to have size 100 but the dimension “Samples” is that I still do not know what its value will be.

    Doubt 1: In these cases of samples with different sizes to know the dimension “Samples” I have to be based on the larger sample and for the other samples I fill in the value 0 (zero) in the additional spaces?

    Doubt 2: Can I have more than one line in the “Samples” dimension representing the same sample?

    Doubt 3: How do samples have varying sizes, there are possibilities to work with 4 dimensions, for example: “Samples” x “Part of Samples” x “Time Steps” x “Features”?

    Thank you for your attention!

  20. Avatar
    Kate April 23, 2018 at 9:17 am #

    Hi Jason,

    Thank you for such a good tutorial! This really helps!

    I am not sure if I understand the model correctly:
    The sequence of samples does matter in lstm because the state of current one is affected by last one in the sequence.

    If this is the case, can you let me know how I should deal with the following scenarios?

    I have non-equally spaced trajectory data. The interval varies from seconds to days. Solutions I come up with are interpolation or adding time feature. What do you think is a good way to prepare the data?

    Assuming last problem is solved, how can I organize the input if my data contains trajectories of different people? For example, trajectory of one person is (100, 5, 2) and trajectory of another one is (200, 5, 2). How to train both sequence in one model?

    Thank you very much!

  21. Avatar
    erik May 2, 2018 at 12:13 am #

    Hello Jason, you are making a good job Dr.! I am a bit confused about my data shape for the network: I have 300 different samples, where the next one always is measured lets say in 1 min steps, so I have in total 300 timesteps, and each file is containing 1 column, 2.000 rows. When we say I want to reduce the ‘features’, rows I am thinking that my inputshape must be therefore (1,300,2000) and than I can reduce to something e.g.200. with the lstm decoder ?

    • Avatar
      Jason Brownlee May 2, 2018 at 5:45 am #

      How can 1 sample have 300 time steps, 1 column and 2k rows? I do not understand sorry.

      • Avatar
        erik May 2, 2018 at 7:21 am #

        Sorrry for the obscurity, 1 sample has 2000 rows all in one column so only one type of value temperature is measured. in total I have 300 samples and time distance between their recording is 1 min

        • Avatar
          Jason Brownlee May 3, 2018 at 6:27 am #

          I still don’t follow.

          Are the 2000 rows related in time for one feature? Or are the 2000 rows separate features at one time?

          • Avatar
            erik May 4, 2018 at 9:37 am #

            I think if I understood you right, the 2000 rows are related to one measurement (so I measure in 1 second 2000 times the temperature). But when you regard it with an lstm autoencoder I try to reduce the “features” to learn from them and make than the prediction. I do not know the shape for the lstm encoder decoder either it should be (1, 300,2000) or (2000,300,1) but for the last one I got strange results, the first one is closer to the real data. Which one is right ?

          • Avatar
            Jason Brownlee May 4, 2018 at 1:32 pm #

            The input to the encoder will be [300, ?, 2000] where ? represents the number of time steps you wish to model.

            The encoder decoder is not appropriate for all sequence prediction problems, it is suited to sequence output that differs in length to the input. If you are doing straight sequence classification/regression it might not be appropriate.

  22. Avatar
    paul May 14, 2018 at 3:34 am #

    HI Jason, the example is really good. Besides this I have a question for my data. I have temperature values measured for a sampling rate of 1 second with a sampling frequency of 10.000. So I measure in 1 second 10.000 different values but same unit(lets say force). This I repeat with certain time intervals. Do I have than 10.000 different features or only one feature as input dimension ?

    • Avatar
      Jason Brownlee May 14, 2018 at 6:40 am #

      Sounds like 10K features at each time step.

      • Avatar
        catherine May 30, 2018 at 10:04 pm #

        this means that one time step can have 10k features?

  23. Avatar
    Hadeer El-Zayat May 18, 2018 at 7:21 am #

    great tutorial jason .. but i have a problem in the reshaping my RNN model,,
    this my code

    import numpy as np
    from keras.datasets import imdb
    from keras.models import Sequential
    from keras.layers import Dense
    from keras.layers import LSTM
    from keras.layers import Bidirectional
    from keras.preprocessing import sequence
    # fix random seed for reproducibility

    train = np.loadtxt(“featwithsignalsTRAIN.txt”, delimiter=”,”)
    test = np.loadtxt(“featwithsignalsTEST.txt”, delimiter=”,”)

    x_train = train[:,[2,3,4,5,6,7]]
    x_test = test[:,[2,3,4,5,6,7]]
    y_train = train[:,8]
    y_test = test[:,8]

    # create the model
    model = Sequential()
    model.add(LSTM(20, input_shape=(10,6)))
    model.add(Dense(1415684, activation = ‘sigmoid’))
    model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]), y_train, epochs = 2)

    • Avatar
      Jason Brownlee May 18, 2018 at 8:09 am #

      What problem?

      • Avatar
        Hadeer El-Zayat May 18, 2018 at 11:21 am #

        a problem of reshaping the dataset..

        • Avatar
          Hadeer El-Zayat May 18, 2018 at 11:22 am #

          this is a sample of my dataset

          a sample of my dataset (patient number, time in mill/sec., normalization of X Y and Z, kurtosis, skewness, pitch, roll and yaw, label) respectively.





  24. Avatar
    Hadeer El-Zayat May 20, 2018 at 4:03 am #

    i didn’t kow how td do it !

    • Avatar
      Jason Brownlee May 20, 2018 at 6:40 am #

      Take it slow, one step at a time.

      • Avatar
        Hadeer El-Zayat May 20, 2018 at 7:24 am #

        this is what i have accomplished

        train = np.loadtxt(“featwithsignalsTRAIN.txt”, delimiter=”,”)
        test = np.loadtxt(“featwithsignalsTEST.txt”, delimiter=”,”)

        x_train = train[:,[2,3,4,5,6,7]]
        x_test = test[:,[2,3,4,5,6,7]]
        y_train = train[:,8]
        y_test = test[:,8]

        model = Sequential()
        model.add(LSTM(64,activation=’relu’,batch_input_shape=(100, 10, 1),
        model.add(Dense(1, activation=’linear’))
        model.compile(loss=’mean_squared_error’, optimizer=’adam’)

        is that true ??

        • Avatar
          Jason Brownlee May 21, 2018 at 6:21 am #

          Nice work.

          What do you mean by true?

          Our job is to find a model that gives “good enough” results when making predictions. This requires careful experimentation.

      • Avatar
        Hadeer El-Zayat May 21, 2018 at 2:25 am #

        thank you. i have tried the following code


        train = np.loadtxt(“featwithsignalsTRAIN.txt”, delimiter=”,”)
        test = np.loadtxt(“featwithsignalsTEST.txt”, delimiter=”,”)

        x_train = train[:,[2,3,4,5,6,7]]
        x_test = test[:,[2,3,4,5,6,7]]
        y_train = train[:,8]
        y_test = test[:,8]

        x_train = x_train.reshape((-1,1,6))

        model = Sequential()
        model.add(LSTM(64,activation=’relu’,input_shape=(1, 6)))
        model.add(Dense(1, activation=’softmax’))

        metrics=[‘accuracy’]), y_train, batch_size = 128, epochs = 10, verbose = 2)

        but it gets a very low accuracy with very high loss

        Epoch 1/20 – 63s – loss: 15.0343 – acc: 0.0570
        Epoch 2/20 – 60s – loss: 15.0343 – acc: 0.0570
        Epoch 3/20 – 60s – loss: 15.0343 – acc: 0.0570
        Epoch 4/20 – 60s – loss: 15.0343 – acc: 0.0570

  25. Avatar
    Ryan May 20, 2018 at 10:15 am #

    What does 32 in model.add(LSTM(32)) mean?

    • Avatar
      Jason Brownlee May 21, 2018 at 6:24 am #

      It means 32 LSTM units in the layer.

      • Avatar
        Godwin May 25, 2018 at 4:24 am #

        Would have been nice if you have added this info in your article.

      • Avatar
        uciha October 30, 2021 at 11:30 pm #

        whether the layer can be determined arbitrarily? Or is there a stipulation?

        • Avatar
          Adrian Tam November 1, 2021 at 1:43 pm #

          Usually arbitrarily first, and then experiment to confirm it fits for the problem (i.e., test with your dataset)

  26. Avatar
    Suzi May 28, 2018 at 9:33 pm #

    Hi, Jason,

    Thank you for the great tutorial. It helps me to predict time series data sequences with the lstm model.

    However, I have a question about how to determine the length of look_back time steps.

    For example, there is a time series sequence X1, X2, X3, …, Xn. When I apply ARIMA for prediction Xn+1, I can use ACF and PCF to determine the parameter pi and qi. The number of pi indicates the look_back time steps. Then, the ARIMA equation can be used to predict Xn+1.

    But for lstm, I do not know how to determine the look_back time steps, in other words, the reshape size for a time series sequence. Is there any way to get an appropriate look_back time steps in reshaping the time series sequence data for lstm? Could you pls give me some suggestions about it?

    Thanks a lot.


    • Avatar
      Jason Brownlee May 29, 2018 at 6:26 am #

      Looking at ACF/PACF plots might be a good start to get an idea of the number of lag obs that are significant.

      • Avatar
        Suzi May 29, 2018 at 5:25 pm #

        Thanks for your quick reply.

        I am still confused about your suggestion. Do you mean that I need plot the ACF/PACF to find the number of time lag for applying lstm?

        I do not think the the ACF/PACF can be used for determining the look_back time steps for lstm. These two criteria explain the linear correlation of time series.
        For those nonlinear correlation time series sequences, the ACF/PACF is not truncating or tailing and ARIMA cannot be used to model them.

        Then I use lstm to model the nonlinear correlation time series sequences and lstm is good at it. Unfortunately, the ACF/PACF is not able to find the time lag in applying lstm.

        Before applying lstm for a time series prediction, I must decide the reshape size. However, I cannot find any information on the internet about how to determine it. Is there any book or tutorial can help me to solve this problem?

        Thank you very much.

  27. Avatar
    Eriz May 29, 2018 at 6:39 pm #

    Hi Jason,

    Thanks for the article and clarifying the dimensions which some of us have trouble with them.

    However, my question goes to something I didn’t find anybody asked in the Q&A:

    Why do you put 32 units in the input LSTM layer?

    I mean, if you have 2 features in each of the 10 time steps and one sample example, why would we want to have more than 10 neurons in the first input layer?

    As I understand LSTMs, each neuron gets feed with the features of one specific time step (in the cell images of colah’s blog it is stated as Xt, as you will surely know).

    If you feed the first one with “t” and continue like “t+1,t+2,t+3…t+10”, what time step will we use in the case of t+17 for example which would be the 17th neuron?

    In fully connected ANN the first input layer has the same number of neurons that of features. Is there anything I’m missing or is there any rule to select the number of neurons if we choose that our input layer is a LSTMs one?

    Thanks for the attention and for correcting any error that I may be not understanding.

    • Avatar
      Jason Brownlee May 30, 2018 at 6:38 am #

      The number of units in the hidden layer is unrelated to the number of input or output time steps.

      We configure neural nets by trial and error, more here:

      • Avatar
        Eriz May 30, 2018 at 9:08 am #

        Hi again Jason,

        Thanks for the quick reply.

        Let me please introduce some numbers:

        Input_shape = (300, 10, 2)
        Batch_size = 1
        Num_units in input/first LSTM layer = 32

        So, as you say, if the units in the input LSTM layer (I am supposing that it is the first layer we use) are not related to the time steps, each time we feed a batch of data into that layer through “Xt” we will feed one row (one sample) of those 300 with 10 columns and we will do it two times: one for the first feature and another for the second feature, and the important point, this feeding will be to every unit of those 32 that compose the LSTM layer. Am I getting the point?

        I get confused because in normal feedforward ANN, the first layer (the input layer) has as many nodes as features we have, so that we can feed each feature in one node.

        If you could clarify this for me, you would be doing me a big favour because there is not much insight about this details elsewhere.

        Thanks in advance,

        • Avatar
          Jason Brownlee May 30, 2018 at 3:08 pm #

          If your batch size is 1, then each batch will contain one sample (sequence).

          Yes, the sequence will be exposed to each unit in the first hidden layer.

          • Avatar
            Eriz May 30, 2018 at 5:43 pm #

            Hi Jason,

            Okey, perfect. Now I get almost all the points.

            Thank you for your kindness,

          • Avatar
            Niklas April 23, 2019 at 5:40 pm #

            Hey Jason,
            i think i have the same question as Eriz had but i’m not sure whether i understood his explanation right, so i would be great if you could tell me if i got it right.
            So the question is: How is the data fed into the first layer of a lstm/rnn ? (i hope there aren’t any differences)
            Let’s take Eriz example: 32 Units in the first layer and an input shape of (300,10,2)

            I understood Eriz like this:
            For one example e (from the 300 examples) all 32 units in the first layer get the time series with length 10 of example e. And this seperately for each feature one after the other (in this case two times) before the network processes the next example.

            Is this correct?

            Also if we look at this typical illustration of a rnn:
            Am i right that in this case the variable t in the image would be in the range of 1 <= x <= 10 ? (because of the length of the time series)

            Thank you very much in advance, because i couldn't find any detailed description on how this works.

          • Avatar
            Jason Brownlee April 24, 2019 at 7:54 am #

            Each of the 32 units in the hidden layer are separate and do not interact.

            For a given unit, it receives one time step of data at a time with 2 features. This continues until all 10 time steps have been shown, the final activation is then passed on to the next layer.

            Does that help?

          • Avatar
            Niklas April 24, 2019 at 8:58 pm #

            Somewhat. I now understand what happens for a single unit.

            You said that all units are independent. Does that mean each of the units recieves the same data (the complete time series of an example) in the way you described in your second sentence?

            Again, thank you for your effort.

          • Avatar
            Jason Brownlee April 25, 2019 at 8:13 am #

            Yes. Units in a layer do not interact, and each receives the entire input.

  28. Avatar
    Manuel Gonçalves June 14, 2018 at 8:16 am #

    Hi Jason, thank you for the articles and books… I just have some open questions about shape. Since I have a 2D multivariate data ex: (samples = 1024, features = 6) , and make a supervised learning dataset with ten (10) lags, the shape will be (samples = 1024, features = 60).
    The question is: The shape for LSTM is (samples, timesteps, features) so it will be data.reshape(1024, 10, 60) ? I dont, understand why some tutorials use something lile (1, 10, 1) and how to reshape/split train/test on the new shape. The steps are:

    1 – convert to supervised problem.
    2 – reshape the entire 2D dataset or split here and reshape after?
    3 – how about shape of Y to make predictions?

    I just need a step with these key points… Thanks for the excelent posts.

    • Avatar
      Jason Brownlee June 14, 2018 at 4:06 pm #

      From your description, there is no need to worry about the lag, the time steps take care of that.

      The shape would be (1024, 10, 6)

      • Avatar
        Manuel Gonçalves June 22, 2018 at 12:02 am #

        Hi again Jason, thanks for the reply. On your example for one and multiple features, you say:
        – consider a matrix of 2 columns with 10 rows, This data can be framed as 1 sample with 10 timesteps and 2 features.

        So, this “1 sample” is drive me crazy. When I have a 2D data like lines vs columns (sample, features), I thought the number of samples will be the number of lines of a matrix 2D data; So it always be (sample, features) –> (sample, timesteps, features). On your example, the rows turned into timesteps and I can’t realize this sample = 1 in this post. Why one sample? Why rows become timesteps here and become samples on other examples?
        Another question is: After rechape input data, how to reshape X_train, y_train and new data for predictions.

        • Avatar
          Jason Brownlee June 22, 2018 at 6:11 am #

          It is challenging in the beginning.

          Think about it like this: you are taking a 2D dataset and projecting it into a 3D space.

        • Avatar
          Priyan October 1, 2023 at 1:02 am #

          Hi Manuel,

          Sorry to post this question after years. Hope you discovered the answer. Could you post your observation?

  29. Avatar
    Hajar June 21, 2018 at 6:04 pm #


    Why do we need to reshape data on 3 dimensions for the LSTM

    Thank you

    • Avatar
      Jason Brownlee June 22, 2018 at 6:03 am #

      Because LSTMs expect data as input in 3 dimensions.

  30. Avatar
    akbar June 26, 2018 at 7:51 pm #

    Hi Thank you so much for this article, it helped me understand Keras and Overall Input thing.
    I am really confused how can I prepare the output data. Overall the output

    at one point we use, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

    I have (25000, 15) input shape, How can I prepare the overall output

  31. Avatar
    koho July 3, 2018 at 5:55 pm #

    Thanks for sharing. I am confused about the padding and “sliding window” method. Suppose the dataset contains two sequences s1,s2 and time_step is set to 3, then s1=(1,2,3,4,5) should have 2 subsequences: [(1,2,3), (2,3,4)], s2=(6,7,8,9,10,11,12) should have 4 subsequences: [(6,7,8), (7,8,9), (8,9,10), (9,10,11)]. Theses 6 subsequences have the same length equal to time_step, so it can be reshaped to 3D tensor (6, 3, 1) without padding s1 and s2 into same length. If all the sequences length are greater than the time_step, then we don’t need to pad the sequences into same length. Am I right?

    • Avatar
      Jason Brownlee July 4, 2018 at 8:20 am #

      Padding is only required if the number of time steps differ and/or if obs for a time step are missing.

  32. Avatar
    Anna July 10, 2018 at 4:13 am #

    Hi, thanks for the great article!

    Say I have a normalized 2D array data with a shape of (10,2)

    but when I want to reshape the data to a 3D array of (10,3,2), I got an error saying:

    “ValueError: cannot reshape array of size 20 into shape (10,3,2)”

    It seems that the previous 2D array multiply the samples of data of 10 with the input dimensions of 2 before reshaping it to a 3D array, and perhaps that caused the error?

    Thanks in advance,

    • Avatar
      Jason Brownlee July 10, 2018 at 6:52 am #

      You need more data to go from (10,2) to (10,3,2), think about it, maybe even draw a pic of it. You are invention dimensions that don’t exist in your original data.

      You would beed (10,6) to go to (10,3,2)

      • Avatar
        Anna July 10, 2018 at 3:17 pm #

        Ok I think I got it, I should actually divide the samples with the timesteps, because doing this finally solved my problem! Thanks Jason!

  33. Avatar
    Anna July 10, 2018 at 3:07 pm #

    Ok I see. So, does reframed first the data with lagged t-n solved the issue?

  34. Avatar
    Rana July 13, 2018 at 12:45 am #

    If you may I Have a Question : I Have 20 Topics (classes) each topic have 700 files each file is a represent a document but in word embedding representation (size of each file : number of words X 300 features ) I want to train a LSTM Network is it possible and how ?

    • Avatar
      Jason Brownlee July 13, 2018 at 7:42 am #

      You can get started with LSTMs and text data here:

      • Avatar
        Rana July 17, 2018 at 12:00 am #

        Thank you so much I will look it up …

        • Avatar
          Rana July 18, 2018 at 8:55 pm #

          I have another question please, so for my problem does your book “Deep Learning for Natural Language Processing” have LSTM in it because I don’t only want to take the word embedding only but I want to take word’s order in consideration .
          or I’ll need your book “Long Short-Term Memory Networks With Python” also ?

          Sorry for bothering you with my questions but I’m really stuck and I don’t have anyone who can help me in this matter.

          Thanks in advance…

          Best Regards,

          • Avatar
            Jason Brownlee July 19, 2018 at 7:50 am #

            I give examples of addressing NLP problems with LSTMs as well as other networks like MLPs and CNNs in “deep learning for nlp”.

  35. Avatar
    Neda July 14, 2018 at 5:24 am #

    Thanks, Jason, for your wonderful blog posts!

    I have a question regarding the input shape which I cannot find a solid answer to. I don’t know how much this question is related to this blog post, but would appreciate to hear your answer to my question:

    I have a training set which contains sequences of images (say n is the number of the images in the sequence and c, h, w are channel , height, and width). I have trained a CNN-LSTM on that with the input shape of (n,c,h,w).

    Now, for predicting through this network, it seems I have to feed sequences of data to it at each time (not a single frame). That is, with each new frame I need to update the sequence and feed it to the network to get the results.

    However, I was under the impression that when dealing with RNN or LSTM, we can feed one frame at a time (because of recurrency), rather than feeding the whole sequence. Was this impression wrong?

    So, briefly, when having an LSTM network for real time prediction, do we need to feed sequences to the network, or are there cases that we may feed a single signal/frame/datapoint?

    Thanks a lot in advance!

    • Avatar
      Jason Brownlee July 14, 2018 at 6:23 am #

      Yes, you can feed one frame of video at a time and have the CNN interpret the frames, then the LSTM put the sequence together and interpret them all.

      I have an example of this in my LSTM book. I have a summary of how to do this in Keras here:

      • Avatar
        Neda July 14, 2018 at 7:08 am #

        Thanks! I went through your other blog post before (and now again). But still I don’t see how I can feed one frame at a time. How about the input size?

        Do you mean I can have a CNN and a seperate LSTM. Feed frames one at a time to CNN and then in a sequence to LSTM? This means, again, I have to create the sequence myself to feed to the network?

        What I don’t understand is that the input-shape of the trained network is defined to be (n, c, h, w), how can I feed an input of shape (1,c,h,w) when n is not 1?

        • Avatar
          Neda July 14, 2018 at 7:12 am #

          By the way, I have already wrapped my CNN in TimeDistributed layer. my code is as below

          model.add(TimeDistributed(Conv2D(24, (5, 5), padding=’same’, activation=’relu’, kernel_constraint = maxnorm(3), kernel_initializer=’he_normal’), input_shape=(5,1, 125, 150)))
          model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2), strides = (2,2))))

          model.add(TimeDistributed(Conv2D(36, (5, 5), activation=’relu’, padding=’same’, kernel_constraint=maxnorm(3))))
          model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2), strides = (2,2))))

          model.add(TimeDistributed(Conv2D(50, (5, 5), padding=’same’, activation=’relu’, kernel_constraint = maxnorm(3))))
          model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2), strides = (2,2))))

          model.add(TimeDistributed(Conv2D(70, (5, 5), padding=’same’, activation=’relu’, kernel_constraint = maxnorm(3))))
          model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2), strides = (2,2))))


          # define LSTM model
          model.add(LSTM(128, return_sequences=True))
          model.add(Dense(2, activation=’sigmoid’))

        • Avatar
          Jason Brownlee July 15, 2018 at 6:02 am #

          No, it is one model. The size to the CNN is the size of one image. Images are exposed to the model in a sequence, something like:
          [samples, frames, width, height, channels]

  36. Avatar
    WESIN RIBEIRO ALVES July 24, 2018 at 3:12 am #

    Thanks, Jason, for your wonderful post!

    I created a model before to read your post, and I see that I made a mistake. I swapped time step by feature. How much this can impact in model performance?

    Best regards!

    • Avatar
      Jason Brownlee July 24, 2018 at 6:22 am #

      The model learns over time. Time is key to the models understanding of the sequence.

  37. Avatar
    Guy Ben Mayor July 26, 2018 at 5:42 pm #

    Hi, thanks for the great article 🙂

    One thing I did’t understand on LSTM network-
    If the output of each time step suppose to predict the next input,
    how come that the input vector dimension (witch related to the features number) not equal to output vector dimension (witch related to the number of units in the layer)

  38. Avatar
    IM July 31, 2018 at 1:05 am #

    Hi Jason,

    Many thanks for the post – it was really useful!

    I would just like to run my problem through with you just to verify you feel the approach outlined in this tutorial is suitable for me:

    I have two columns of data – one is resonance energies and the other is corresponding neutron widths. I want to feed 70% of this data to the network i.e values of resonance energies and neutron neutrons. Then ask the network to predict the neutron width values given the remaining 30% unseen resonance energies.

    I wanted an LSTM layer as it may help use previous computations in its current prediction.

    So I believe I have 2 inputs and one output.

    If i have 300 values of [resonance energies, neutron widths] (i.e 300 rows of data) would my reshape be:
    reshape(1, 300, 2) or reshape(1, 300, 1) ? I’m not sure if the second column is technically a feature as its meant to be the output.

    Also would i need any explicit pairing given each resonance energy is related to the neutron width on the same row? Or should i use some key-value pair?

    (This is also the first experiment, I also hope to then use resonance energy and neutron width to predict another variable but essentially in exactly the same way as described in this problem just the new experiment contains one more feature)

  39. Avatar
    IM August 1, 2018 at 9:23 am #

    Thank you very much for the swift reply.

    Ah my data is here

    I have read the link you provided however I am unclear as to whether my data allows me to drop the time variable as mentioned in your article. If so, I could perhaps have the sample of [1,300,1] as opposed to [1,300,2]

    One query I have is I’m getting a score of ‘Test Score: 0.00 MSE (0.01 RMSE)” for my test set (which is 30% of my samples) would not having enough samples really be shown by such low RMSE scores? If anything doesn’t that show the predictions are almost too good (or overfitting)?

    Sorry one final thing from reading this tutorial – If one uses an LSTM layer, is it still possible to use look_back? (an argument used quite frequently in your other tutorials when creating a dataset). If I am correct, an LSTM layer essentially allows for previous calculations to be examined when determining the current calculation, but look_back determines how many previous timesteps can be consider at each timestep calculation?

    • Avatar
      Jason Brownlee August 1, 2018 at 2:22 pm #

      I do not have the capacity to shape the data for you. I believe you have everything you need to shape your data for an LSTM model.

      A look-back refers to the number of prior time steps of data to feed to the model in one sample. E.g. the “timesteps” in [samples, timesteps, features].

      • Avatar
        Isaac August 3, 2018 at 5:11 am #

        Hi Jason,

        Sure that sounds good I will have a go at that.

        One quick question I had is when I plot my results in many of your tutorial you tend to use the lines:

        testPredictPlot = numpy.empty_like(dataset)
        testPredictPlot[:, :] = numpy.nan
        testPredictPlot[len(trainPredict)+(look_back)+1:len(dataset)-1, :] = testPredict

        so the first line simply creates the numpy matrix like dataset,
        but does the second line fill it with nan values? (if so why, or is it just a check?)
        The third line then shift the test predict plot?

        Many thanks

  40. Avatar
    Ray August 18, 2018 at 11:26 pm #

    Hi Jason,
    Thank you for this post. I’ve learned a lot from it.
    I have a question about my LSTM model for classification.
    My input data is 4842 samples, 34 time steps, 254 features. In other words, it’s (4842,34,254).
    I have trained it with proper parameters. Although I got a decent result with 98% accuracy on validation data, I got pretty low accuracy at around 20% on test data (from separate data).
    My first thought is overfitting but I also tried callback function such as earlystopping or reducelronplateau. It does not give me a better result.
    Could you give me any suggestions on this issue?

    Many thanks!

  41. Avatar
    Amal September 2, 2018 at 5:40 pm #

    Hi Jason,

    I learn a lot from your blog posts. For this specific post, I have three specific questions.

    Q1: How do we decide the value of FIRST parameter of the constructor LSTM. You used LSTM (32, …). How was that value of 32 decided for the representative problem that is being addressed here? For word embedding input, is a vlaue between 200 and 500 reasonable?

    Q3: What is the significane of this parameter? Is it number of LSTM cells and should it be matched with the value of dimension of input layer of the Keras model (in case of work embedding, value b/w 200 and 500)?

    Q3: What would be perfformance impact of choosing a value of 500 for this parameter?

    Thanks and Regards

  42. Avatar
    Joshua Dawson September 4, 2018 at 11:13 am #

    I just don’t understand reshape. I do not see how you calculate the array and enter it to batch size or train data on it. I have been stuck on this all day and googling. Any help from anyone is greatly appreciated

  43. Avatar
    osam September 14, 2018 at 2:25 am #

    hello Jason, thanks for your great explanation
    can you help me with this question:

    • Avatar
      Jason Brownlee September 14, 2018 at 6:38 am #

      Perhaps you can summarize it for me in a sentence?

  44. Avatar
    Sophia September 22, 2018 at 2:48 am #

    Hi Jason,

    I have multidimensional timeseries data with a sample size of 200,000 and 50 dimensions.
    I want to train a sequence to sequence autoencoder on normal data to use it later to detect anomalies. I want to have a go at this task using a LSTM autoencoder, using the example from the keras site:

    inputs = Input(shape=(timesteps, input_dim))
    encoded = LSTM(latent_dim)(inputs)

    decoded = RepeatVector(timesteps)(encoded)
    decoded = LSTM(input_dim, return_sequences=True)(decoded)

    sequence_autoencoder = Model(inputs, decoded)
    encoder = Model(inputs, encoded)

    I am confused about how to convert my data.
    When generating the sequences lets say with a timestep of 100: do I convert this 200k data into separate sequences of 100, or use a sliding window to generate my sequences?

    Many thanks for your help.

    • Avatar
      Jason Brownlee September 22, 2018 at 6:32 am #

      Yes, you split the long sequence into subsequences.

      No need to overlap, but you can if you want and see if it improves detection.

  45. Avatar
    Pawan October 5, 2018 at 4:51 am #

    Hi Jason,
    Excellent tutorial. I have a question that I wanted to ask.
    I have a total of 3 sequences.
    Sequence 1: It has a shape of 800×2500 (800 observations and 2500 features) It falls into category 1
    Sequence 2: It has a shape of 1000×2500. It falls into category 2
    Sequence 3: It has a shape of 600×2500. It falls into category 3.
    I have combined these 3 sequences into 1 array which has 2400×2500 features. I want to train an LSTM network on this array. Want it to learn pattern of these sequences and predict the category (1,2 r 3) given a new test sequence of any length (? x 2500) shape.

    What should my input shape be? Should it be (1,2400,2500)?

    • Avatar
      Jason Brownlee October 5, 2018 at 5:41 am #

      Each sequence would be a separate sample and the number of time steps would have to be padded with zeros to match.

      The shape would be: [3, 1000, 2500]

      Training a model on 3 samples does not sound useful, you might need 3 thousand or 3 million samples.

      Also, 1000 time steps might be a little long, 200-400 is preferred.

  46. Avatar
    Halidi October 16, 2018 at 8:59 am #,trainY,epochs=100,verbose=0)

    ValueError: Error when checking input: expected lstm_2_input to have shape (3, 1) but got array with shape (1, 1) this error occur

  47. Avatar
    adel October 17, 2018 at 3:27 am #

    Hi Jason,
    thank you so much for this tutorial. I have a question conserning Conv2D

    i want to devlop model for binery image classification with size (256*256)

    i put my image in liste numpy with lenth (10000) and each index have np.array (256*256)

    i get error when i start fit the model with input_shape = (256,256,1)

    what i should define input shape

    thank you

    • Avatar
      Jason Brownlee October 17, 2018 at 6:55 am #

      A 2D CNN will require input with the shape: rows x cols x channels. [256,256,1] sounds right.

  48. Avatar
    Tomas Bo October 24, 2018 at 7:09 am #

    Hello, thank you for your tutorials.

    I am trying to understand data inputs to LSTM autoencoder, but I am lost.
    I want use autoencoder for anomaly detection in time series (falling detection).

    500 samples (data from gyroscope – walking, jumping, running…)
    600-10000 time steps in each sample (6-100seconds).
    3 features.

    Example of one sample(one file):
    time x y z
    1 1.3 9.6 1.3
    2 1.2 9.3 1.5
    3 0.9 8.0 -2
    . . . .
    . . . .
    . . . .
    1000 1.4 9.8 2

    Train autoencoder to reproduce normal data(walking,jumping,running…).
    Anomaly detection: Reproduction error.
    I want to check anomaly (fall) in 2second time windows(200 time steps) -> Reproduce every 2second window and check reproduction error.

    Can you please explain how to prepare dataset for this task?(dimensions, structure of dataset…)

    How big batches and time steps(seq_length) shoud I use?

    Should I generate batches randomly, or from start of dataset? (batch1: 0 – batch_size, batch2: batch_size-batch_size*2 …) -> I saw that someone generated random batches in every iteration. Couldn’t that cause some data to be used multiple times and others is not used at all?

    Thank you.

  49. Avatar
    Jaskaran October 24, 2018 at 5:30 pm #

    Hello, thanks a lot for this tutorial. I’m working on some project but still stuck after reading this. Here’s the description

    I have a dataframe of 2 columns, both text – one is title and other is the label to it.
    Unique label count is around 40k so one hot encode was out of question.
    I used word2vec with size=150 for both, trained and used the created model to encode both title and label.
    e.g. for hello world
    I split them and then use their respective word vectors of size 150, add them and normalize to create a vector that represents hello world.

    So both columns in my dataframe had been changed and each column of each row has a vector.

    dataframe.shape shows (len, 2) where len is length of dataframe
    and then I did

    X = df[‘title’].values
    y = df[‘label’].values

    and I got two numpy.ndarray with following shapes

    X.shape –> (len,)
    X[0].shape –> (150,)
    y.shape –> (len,)
    y[0].shape –> (150,)

    After this I’m stuck with input and output shapes for the network.

    I tried with LSTM and I got the error that lstm expected 3 dimensions but got array

    I tried with Dense layers and still got shape errors.

    Basically I’m struggling as to what the input and output should be for the network.

    Any help in this regard is appreciated. I can provide more details and code if needed.

    • Avatar
      Jason Brownlee October 25, 2018 at 7:50 am #

      What do you mean you used word2vec for the label? Does that mean you are predicting a vector that is then mapped to words? Sounds odd.

      I would recommend an embedding on the front end and softmax on the output with 40k one hot encoded vectors.

  50. Avatar
    rajesh November 5, 2018 at 6:48 pm #

    hi jason, thank you for the tutorial.

    I am trying to feed 3 column(3316 rows) merged encoded text Train_data and Train_Labels categorizing class 1-9(3316 rows) to an lstm network.
    Train_input last column is output of word embedded vector of 50 dimensions,(3316,50)
    Train_input first and second columns are words – one-hot encoded text data (3316,2)
    After merging three columns the shape is (samples=1, time_steps=3316, features=3)

    TraIN_LABELS categorizing above train_data into class 1-9. encoded it using label encoder.

    train_input=(1,3316,3), train_label=(1,3316,9) facing error with this data shape

    reshaping labels to (1,3316,3) is not happening

    how do i reshape labels to feed to lstm?

    • Avatar
      Jason Brownlee November 6, 2018 at 6:29 am #

      The labels are typically a 1D array with one element per input sample.

      • Avatar
        rajesh November 7, 2018 at 5:44 pm #

        Thank you for the reply jason.

        LSTM network returned error as “expecting dense 3dimensional shape instead recieved
        (3316,)” when given 1D array.
        Is there any other way that i can feed? Have i done any mistake in reshaping train data?

  51. Avatar
    cfdcfd November 5, 2018 at 8:26 pm #

    Really appreciated your explanation!
    Could you explain how to feed multi-input the LSTM, let’s say:
    you have: data = data.reshape(1, 10, 2)
    data = array([
    [0.1, 1.0],
    [0.2, 0.9],
    [0.3, 0.8],
    [0.4, 0.7],
    [0.5, 0.6],
    [0.6, 0.5],
    [0.7, 0.4],
    [0.8, 0.3],
    [0.9, 0.2],
    [1.0, 0.1]])

    and model.add(LSTM(32, input_shape=(10, 2)))

    So in1 iteration of epoch, the value in the first column: 0.1,0.2,….. 1.0 will be fed into xt, xt+1,…, xt+9 of input gate of LSTM. And the 2nd column: 1.0, 0.9…0.1, will they be also fed into xt,xt+1,…,xt+9 or they will fed into another input gate of LSTM: xxt, xxt+1,….,xxt+9 ?

    • Avatar
      Jason Brownlee November 6, 2018 at 6:30 am #

      Not quite, if there is 1 sample, and your batch size is 1 sample, then all time steps in the 1 input sample will be fed into the model for epoch 1.

      • Avatar
        cfdcfd November 6, 2018 at 11:54 am #

        Sorry, I am still not clear !
        So model.add(LSTM(32, input_shape=(10, 2)))
        what is the number 2 in the input_shape(10,2) mean ?

        • Avatar
          Jason Brownlee November 6, 2018 at 2:19 pm #

          (10,2) means that each sample will have 10 timesteps with 2 features.

  52. Avatar
    Jungbin November 5, 2018 at 8:35 pm #

    Hello Jason, thanks a lot for your great post.

    I had understood the post and made a dataset as follow.

    x_train_shape : (35849, 100, 3)
    y_train_shape : (35849, )
    x_validation_shape : (8963, 100, 3)
    y_validation_shape : (8963, )
    x_train_shape : (11204, 100, 3)
    y_train_shape : (11204, )

    It is a time series sensor data of three-channel.

    And, the actual form of the data is as follow.

    (2067, 1976, 1964)
    (2280, 1994, 1952)
    (2309, 1976, 1968)
    (2020, 2160, 1979)
    (1994, 2181, 2064)

    I did labeling the data for particular section by [window size : 100, stride : 15] per a channel.

    For example, if the particular section is from 251 to 750, 27 pieces of cutted-data are made as follows.

    251~350, 266~365, 281~380, …, 641~740

    With this data, I was able to proceed learning with DFN and CNN to perform effectively.

    However, when I do learning using LSTM, the learning does not proceed that the loss does not decrease and the accuracy is around 50%.

    So I would like to hear your opinion about this phenomenon, which has a broad and deep knowledge in this field.

    Should I change the data structure differently?

    Or, is there a particular LSTM-containing model structure suitable for data like this structure?

    Or, do you have another new opinion?

    Thank you so much for your interest in my problem.

    • Avatar
      Jason Brownlee November 6, 2018 at 6:31 am #

      LSTMs are generally poor a time series forecasting. You may need to carefully tune the model.

      • Avatar
        Jungbin November 6, 2018 at 6:54 pm #

        Thank you for your reply.

        Last night, I was contemplating a lot. I was able to figure out why my dataset is not suitable for this model.

        This is because a unit of data cannot affect the decision.

        So I’m going to find a way to increase the data processing unit on LSTM.

        I have been able to think a lot through your material and reply.

        Thank you so much again.

  53. Avatar
    Soheila November 9, 2018 at 4:58 pm #


    I am really confused with Input shape. Assume I have a data set of some info about houses.
    Features are [size, room_no, floor_no]

    My data set contains 4 samples as follow:
    dataX = [[200, 3, 1], [150, 1, 1], [270, 4, 2], [320, 3, 2]]. Which for example dataX[0] is house number0 with size of 200, 3 rooms and 1 floor.

    Now I want to train my LSTM. Are (samples, time_step, feature) different from the ones I defined here? I mean I have 4 samples with 3 features! How do you tell that to LSTM? For example would you please say that in the first 2 or 3 time steps what data is fed into LSTM?

    Thank you very much. 🙂

    • Avatar
      Jason Brownlee November 10, 2018 at 5:58 am #

      Each of size, rooms and floor are “features”. Features=3

      If you have 3 houses, these are “samples”. Samples=3

      But you did not mention any time steps. If you have no time steps (observations over time), an LSTM is a poor choice and I would recommend an MLP instead.

  54. Avatar
    nuiii November 21, 2018 at 1:04 am #

    Hello Jason,

    First of all, thanks for all your work. It is grateful to have a support like you give us in all your tutorials!

    On the other hand, I have a question. For LSTMs we need a 3D array (samples, features, timesteps). But I still don’t understand what “timesteps” means? Is it the same of loop_back variable?


    • Avatar
      Jason Brownlee November 21, 2018 at 7:53 am #

      Time steps are observations (features) made over time (e.g. minutes, hours, etc.). Or, they can be items in a sequence like words in a sentence.

      Does that help?

  55. Avatar
    Joe November 22, 2018 at 10:10 pm #

    Like others, I wanna say thank you for this and other useful articles.

    I need to connect output of ConvLSTM2D to a LSTM!
    The output of ConvLSTM2D is (samples, time, output_row, output_col, filters) which return_sequences is True.
    I’m confused here how this 5-D input can be feed to a LSTM!
    I will be thankful if any help!

    • Avatar
      Jason Brownlee November 23, 2018 at 7:50 am #

      Are you sure the output of the ConvLSTM2D is as you describe?

      If so, perhaps you can use a lambda to flatten rows/cols and filters.

  56. Avatar
    Ahmed November 30, 2018 at 10:12 pm #

    Hello Jason,

    I have a dataframe of n,p rows (n for the different samples and p for q timesteps and r static feature p = q+r ). For each row, I have q values for the measure of interest and r static features.

    Basically, I want to do a time series classification that is based also on the static features.

    From what I understood, the input shape should be (n,q,r) but I cannot transform my dataframe from (n,p) to (n,q,r) since p = q+r

    Thank you a lot for your help !

  57. Avatar
    Guido Salimbeni December 7, 2018 at 10:58 pm #

    Dear Jason,
    why do we need to reshape the data in numpy before feeding the data to the lstm. Why Keras doesn’t do it automatically?

    if I have a sequence of 10 values and I want to predict the 11th value, I guess Keras LSTM has already all the information in a numpy array of size (1,10) to answer the task of predicting the 11th value . Why do I need to reshape it to (1,10,1) ?


    • Avatar
      Jason Brownlee December 8, 2018 at 7:09 am #

      Why – that is the expectation of the API. Don’t fight it, meet it.

  58. Avatar
    an December 10, 2018 at 9:24 pm #

    Hi Jason,
    Thank you for your great post.
    I have problem about padding.

    I use the pre-trained sentence embedding(dim:300),
    and 1 sample means a document in my application.
    Now I have variable sentence number in documents.
    (Not sentence sequences with variable words length)
    E.g., doc1:[sent1, sent2]
    doc3:[sent1, sent2, sent3]

    The zero-padding in variable sentence sequences means pad 0 after sequence of word_index
    But pre-trained sentence embedding can’t map sentence to index

    How do I pad the sentence number to a fixed size?
    Does it just pad 300 zeros in each timestep, and use mask layer to skip those zeros?
    E.g., pad_sent = np.zeros((1,300))
    doc1:[sent1, sent2, pad_sent]
    doc2:[sent1, pad_sent, pad_sent]
    doc3:[sent1, sent2, sent3]

    Thank you.

    • Avatar
      Jason Brownlee December 11, 2018 at 7:43 am #

      The padded values can be mapped to the index of “unknown” in the embedding, which is often also 0.

  59. Avatar
    David King January 18, 2019 at 7:30 am #

    Hi Jason Hopefully you have not already answered this. To prep the x data for LSTM input,
    the data is reshaped to a 3 d array. Should something similiar also be done to the y data? Should it also be divided up into subsequences so that the output is synced up with the input?
    If so, should that be done prior to transforming the data with one hot encoding? Or does the
    one hot encoding come after? Thanks for all your help.

    • Avatar
      Jason Brownlee January 18, 2019 at 10:15 am #

      No, not unless you are predicting a sequence output.

      If you are predicting a multi-class label, a one hot encoding is used.

  60. Avatar
    Tobias Piechowiak January 31, 2019 at 11:19 pm #

    Hi Jason,

    great article – always come here when needed a boiled-down explanation. Thank you!

    Regarding input shapes – have been using LSTM for a while and didn’t have any problems with it but now I tried 1D convolutional layers for speeding up processing and now I run into trouble – can you see what the problem is with the following? (Dummy data used here)

    #load packages
    import numpy as np
    import pandas as pd
    from keras.models import Sequential
    from keras.layers import Dense, Dropout, Activation, GRU, TimeDistributed
    from keras.layers import Conv1D, MaxPooling1D, Flatten, GlobalAveragePooling1D
    from keras.layers import Conv2D, MaxPooling2D
    from keras.utils import np_utils

    nfeat, kernel, timeStep, length, fs = 36, 8, 20, 100, 100

    #data (dummy)
    data = np.random.rand(length*fs,nfeat)
    classes = 0*data[:,0]
    classes[:int(length/2*fs)] = 1

    #splitting matrix
    X = np.asarray([data[i*timeStep:(i + 1)*timeStep,:] for i in range(0,length * fs // timeStep)])
    Y = np.asarray([classes[i*timeStep:(i + 1)*timeStep] for i in range(0,length * fs // timeStep)])

    #split into training and test set
    from sklearn.model_selection import train_test_split
    trainX, testX, trainY, testY = train_test_split(X,Y,test_size=0.2,random_state=0)

    trainY_OHC = np_utils.to_categorical(trainY)
    trainY_OHC.shape, trainX.shape

    #set up model with simple 1D convnet
    model = Sequential()


    #compile model
    model.compile(loss=’mse’,optimizer=’Adam’ ,metrics=[‘accuracy’])

    #train model,trainY_OHC,epochs=5,batch_size=4,validation_split=0.2)

    I get an error for the fitting:

    ValueError: Error when checking target: expected dense_17 to have 2 dimensions, but got array with shape (400, 20, 2)

    I cannot see what is wrong here?!

    • Avatar
      Jason Brownlee February 1, 2019 at 5:38 am #

      I suspect your output variable y does not match the expectation of the model of 2 values per sample. Instead, you’ve provided [400,20,2]

  61. Avatar
    WishViv February 1, 2019 at 6:05 pm #

    Hello Jason,

    Thanks for the tutorial. Your tutorials are a life-saver! I have a simple doubt.

    I have extracted speech features from 597 .wav files into arrays. Each array is of shape Nx40, where N varies between 500-1100. In other words, for each of the 597 .wav files, I have an array of N rows (varying between 500-1100) and 40 columns (40 acoustic features).

    1. So, what should be my 3D input shape to the LSTM? Is it (1, ?, 40), where ? = N.

    2. If it is (1, ?, 40), then should I pick a particular length and pad the rest of them?

    Really stuck with this. Any response will be of immense help. Thanks!

    • Avatar
      Jason Brownlee February 2, 2019 at 6:11 am #

      Sounds like you have nearly 600 samples, 500-1100 time steps and about 40 features.

      I’d recommend padding the time steps and having a shape of something like [600, 1100, 40] as a start, then try truncating time steps to see how it impacts the model.

  62. Avatar
    Anusha Prakash February 12, 2019 at 6:11 pm #

    Hey Jason,
    I have a Time series dataset for 38,000 distinct patients, where each patient has 30 physiological parameters recorded for an hour. If I want to extract 48 hours of information of a patient for every hour, then technically I’ll have 48 rows for a single patient, every row containing observations for 30 feature for every hour! suppose I want to extract similar data for the other 29,000 patients, then ill end up with over a lakh rows i.e(29,000 * 48 rows). So should my input shape be (30,000, 48, 30) ?

    • Avatar
      Jason Brownlee February 13, 2019 at 7:55 am #

      Maybe, I’m not sure I follow about “48 hours every hour”. Perhaps try your approach?

  63. Avatar
    Suraj Pawar March 16, 2019 at 2:37 pm #

    What should be the output layer shape? Particularly, if I have only one sample, do I need to reshape it into three dimensions? For example, if I have the input data of shape [m,n] and output also has [m,n] shape. Do I need to change the input into [m,1,n] shape? Also, can I keep the output shape [m,n] if I am using return_sequences=False? I am dealing with time series data. Below are the input and output at different time steps for example
    Input: Output:
    [1,2] [3,4]
    [2,3] [4,5]
    [4,5] [6,7]
    Thank you

    • Avatar
      Jason Brownlee March 17, 2019 at 6:17 am #

      Output shape is often [samples, features] for a normal model or [sample, time steps, features] for an encoder-decoder model.

  64. Avatar
    kelvin March 25, 2019 at 9:55 pm #

    Hi Jason,

    Your article is so good. I want to apply this to my data set.

    I have 3000 samples and 88 features(columns).In that number of feature columns, I have 20 features called A, B, C, D etc.For each one of them I have 3 columns A1,A2,A3 , B1,B2,B3 , C1,C2,C3 etc as lag columns. Thus 20+20*3 = 80 are the feature columns with lags
    and 8 other features which do not have lag values.

    How to convert that into a shape of 3D array to feed as input to LSTM model?
    Your reply is greatly appreciated.

  65. Avatar
    jessy April 4, 2019 at 9:50 pm #

    hi jason
    I want to write code for functionalities of input gate ,output gate and forget gate for me sir

    • Avatar
      Jason Brownlee April 5, 2019 at 6:17 am #

      Sorry, I don’t have an example of coding an LSTM cell from scratch, thanks for the suggestion.

  66. Avatar
    jessy April 5, 2019 at 4:53 pm #

    directly we are using LSTM function to process the data .instead we can write the code for input gate ,forget gate and output gate by creating own lstm function

    • Avatar
      Jason Brownlee April 6, 2019 at 6:41 am #

      I recommend using Keras and not coding an LSTM from scratch as it will almost certainly have bugs and be less efficient.

  67. Avatar
    Christine April 11, 2019 at 12:59 am #


    Thanks so much for this, your stuff is really helping me get my head around LSTMs

    I’m struggling to understand how to load more than just one sample though. This works fine:

    samples = np.array([
    [0.1, 1.0],
    [0.2, 0.9],
    [0.3, 0.8],
    [0.4, 0.7],
    [0.5, 0.6],
    [0.6, 0.5],
    [0.7, 0.4],
    [0.8, 0.3],
    [0.9, 0.2],
    [1.0, 0.1]])
    samples = samples.reshape(1, 10, 2)

    But how should I present my data to use something with (2,10,2) or (3,10,2)? Is it an array of arrays?

    Ultimately I want to be able to go from an hdf5 file to a numpy array to this 3d shape, but I’m struggling to tell my model that there is more than one sample, even when i’m just making up and typing the data as a toy example. Any tips?

    • Avatar
      Jason Brownlee April 11, 2019 at 6:43 am #

      Load more data into memory, then reshape it.

      Perhaps I don’t understand the problem you’re having.

      Also, this may help with mindset:

      • Avatar
        christine April 12, 2019 at 5:43 am #

        I don’t think I explained it very well. Thanks for taking the time to reply. I think my question is what should my data look like if I have more than one sample and I’m trying to use reshape?

        What I mean is at the moment ‘samples’ is a numpy array representing just one example of 2 features and 10 timesteps. when really, samples should be a dataset of many examples. And I don’t understand how to write that so that it can be reshaped. Should samples be a list of numpy arrays. or does it need to be an array of arrays or something else? I’ve tried both but neither seems to work.

        Reading this suggests it could be that i need to use vstack?

        • Avatar
          Jason Brownlee April 12, 2019 at 7:58 am #

          Yes, a vstack, dstack or hstack will do it.

          I cannot know, I don’t have your code/data and it is completely specific to your data.

          If you’re having trouble, try tinkering with a few contrived samples in a separate python file until you get it right.

  68. Avatar
    Carlos April 13, 2019 at 9:58 pm #


    I have a df like this

    If I want use temporal component for a LSTM, i think that my sequence, will be make by date_col.
    But, If I select date_col, I will have a array of new information. I mean, it’s not a row typical secuence, I think this is more complex.

    My secuence will be [day1,day2,day3] and, in each day, I have a array with [product1,product2,product3], and each product [feat1, feat2].

    For day 1: [[feat1,feat2],[feat1,feat2],[feat1,feat2]]

    Secuence will be:
    [ [[feat1,feat2],[feat1,feat2],[feat1,feat2]],
    [[feat1,feat2],[feat1,feat2],[feat1,feat2]] ]

    This is correct? This will work with neural networks?

  69. Avatar
    Rich April 19, 2019 at 1:29 am #

    Hi Jason. Thank you for your informative articles. They have been very helpful to me.
    I was wondering if you could give me your recommendation on setting up a LSTM model. I’m dealing with three features that are measured when a passive RFID tag is read by an RFID reader. I understand that before setting up my model, I need to decide how many time steps I will be dealing with for each input. Let’s say I set that number at 32 time steps. This means my input shape will be (32, 3). What I’m not clear on is if I need to add a feature for a time stamp of when that read occurred. Tag reads can happen at any rate. There could be one read per second which would mean my input instance of (32, 3) would span 32 seconds, or there could be one read per minute which would mean my input instance of (32, 3) would span 32 minutes. Most importantly, the read rate will not be constant. I could get 3 reads in 1 second and then wait 30 seconds for the fourth read to come in.
    Does the LSTM need to know about this, or is it enough to simply give it the time ordered sequence of reads without it knowing the actual time span those 32 reads occurred over?
    If I did have to add a time stamp to my data, the input shape would now be (32, 4). As a follow up question, does the LSTM require me to define a fixed time step between the feature inputs? What I mean is, am I forced to pick a time span of say 1 second between the input feature list? If I am, and I only have reads at time 0 seconds, 4 seconds, and 6 seconds, do I then have to generate my (32, 4) input values as follows:
    (0, x0, y0, z0)
    (1, missing, missing, missing)
    (2, missing, missing, missing)
    (3, missing, missing, missing)
    (4, x4, y4, z4)
    (5, missing, missing, missing)
    (6, x6, y6, z6)
    is there no assumption needed on the time between inputs so I can I simply stack them without inserting missing values, such as:
    (0, x0, y0, z0)
    (4, x4, y4, z4)
    (6, x6, y6, z6)

    Any input you can provide would be appreciated.
    Thank you.

    • Avatar
      Jason Brownlee April 19, 2019 at 6:17 am #

      Perhaps try modeling as-is, then try with padded/missing values and see if there is any difference in model skill.

      Also try other model types. LSTMs are mostly terrible at time series.

      • Avatar
        Rich April 19, 2019 at 9:54 pm #

        Thanks Jason. I was surprised to hear you say LSTM’s are terrible at time series. What I’m trying to implement is a binary sequence classification model using the setup I described above. If you think LSTM’s are not the right approach given what I’ve described, what type of model do you think would work best? I want to process the data as new reads come in and give a classification output every time I get a new read.

        I was planning to use your post on: Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras as my starting point, but now you’ve got me concerned I’m going down the wrong path.

  70. Avatar
    lycan April 24, 2019 at 6:34 pm #

    Hi,Jason. I want add a dense layer before LSTM layer, but don’t know how to reshape the data. The input data (train_x) is 3D (batch_size,time_step,input_dim), I firstly want to reshape the data to 2D so as to apply to dense layer,.After operating the dense layer ,I have to reshape the 2D outcome of dense layer to 3D so as to apply in the LSTM layer. I am using keras function API, but I can not find a reshape layer to do that (keras.layers.reshape can not do that.).Do you have any idea?

    • Avatar
      Jason Brownlee April 25, 2019 at 8:09 am #

      A dense layyer cannot take 3d data as input, it must be [samples, features].

      • Avatar
        Ziwen Sun April 29, 2019 at 10:05 am #

        So I want to transform the shape of data, maybe this is not a correct idea.

      • Avatar
        Ziwen Sun April 29, 2019 at 10:06 am #

        But thank you very much for your patient help.

  71. Avatar
    Nick April 30, 2019 at 8:20 am #

    Hi Jason, when you write: “How to reshape multiple parallel series data for an LSTM model and define the input layer”

    Does this statement refer to text + tag data, for example text data with parallel IOB (inside – outside – begin) tags for named entity recognition? For example:

    Alex I-PER
    is O
    going O
    to O
    Los B-LOC
    Angeles I-LOC

    If you are talking about something different in this article, do you have another article on preparing data for such parallel text sequences?

  72. Avatar
    Steven Wu May 1, 2019 at 8:15 pm #

    Hi Jason,
    I’m kind of confused about the part of model.add(LSTM(32)). Does the number 32 represent 32 neurons? or more specifically, 32 memory cells for LSTM?



    • Avatar
      Jason Brownlee May 2, 2019 at 8:01 am #

      Yes, 32 is the number of units in the hidden layer. Each unit in the first hidden layer receives all of the inputs.

  73. Avatar
    Sam May 15, 2019 at 9:23 am #

    Hi Jason,
    I am working on project for crime prediction. I have a dataset containning row as date(timestamp) and columns as area(features). Each cell contains count of crimes happened in particular area.
    Total no of rows = 1825 days of crime counts per area or 5 years
    here is the dataset.

    date\ Area 111 112 113 114
    0 5 2 2 0
    1 3 3 9 0
    2 5 4 8 0
    3 4 4 3 0
    4 9 11 9 0

    I want to use sliding window to forecast which will take 100 days as input and predict
    101th day output i.e. crime count for each area.
    Here, I wiil consider first 3 rows(0-2) as input and predict output i.e. 4th row(3)
    I will be shifting dataframe by -3

    1) what is X_train shape, y_train shape?
    here samples = 1(?), timesteps = 1825 rows, features = 4 columns
    Am I correct?
    What is exactly sample?

    2) model.add(LSTM(4, batch_input_shape=(1,1825,1),

    What will be input of batch_input_shape [Batch_size, sequence_length, features]??
    Should batch_input_shape be same as X_train shape??


    I have wasted 2 days trying find out what is the relation between these two.


  74. Avatar
    Aya Tello May 18, 2019 at 4:51 am #

    Hello Jason, thanks for your tutorial,
    I have a question, I have a Time series dataset for about 38.000 patients, where each patient has 38 physiological parameters recorded for one hour, and each patient has at least 25 hours of parameters recorded, for clarification:
    pat1: feat1, feat2, ….. feat38, hour1, label
    feat1, feat2, ….. feat38, hour2, label
    ………………………………,hour25, label
    pat2: feat1, feat2, ….. feat38, hour1, label
    feat1, feat2, …., feat38, hour2, label
    ………………………………,hour25, label
    pat38000: ….
    The model should predict whether the patient has the disease or not, as early as possible.
    My question is how I would shape my input array? I do not understand what the samples will be
    (Samples, time-step, features) -> (?, 7 hours “for example” ?, 38) or what?

  75. Avatar
    Myles May 27, 2019 at 9:11 pm #

    Hi Jason ,
    How does reshaping effect training / target data.
    dataset = dataset.reshape(1,36,66)
    train_data = dataset[:,0:32,:65]
    train_targets = dataset[:,0:32,65:]
    test_data = dataset[:,32:,:65]
    test_targets = dataset[:,32:,65:]
    def build_model():
    model = Sequential()
    model.compile(optimizer=’rmsprop’, loss=’mse’, metrics=[‘mae’])
    return model
    model = build_model(), train_targets,
    epochs=30, batch_size=16, verbose=0)
    test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)

    I get this : ValueError: Error when checking target: expected dense_40 to have 2 dimensions, but got array with shape (1, 32, 1)

    If I reshape the targets
    train_targets = train_targets.reshape(32,1)
    test_targets = test_targets.reshape(4,1)

    I get this

    ValueError: Input arrays should have the same number of samples as target arrays. Found 1 input samples and 32 target samples.

    Seem I can’t win. What should I be doing? Thanks

    • Avatar
      Jason Brownlee May 28, 2019 at 8:14 am #

      You must have one output sample for each input sample.

  76. Avatar
    Emma June 4, 2019 at 3:17 am #

    Hi Jason, can you please help me with reshaping data for LSTM,
    I have data set with shape (4615, 9), 4618 inputs, 9 features and 3 classes (labels) to predict.
    I want to reshape my data so the input shape have 5 time steps. I try to do it in this way
    X = np.reshape(X, (923,5,X_train.shape[1]))
    but I have got an error when try to train_test_split.
    I only can make it work with one time step X_test = np.reshape(X_test, (X_test.shape[0],1,X_test.shape[1]))

  77. Avatar
    Charlie June 4, 2019 at 12:20 pm #

    Hi Jason,

    Thanks for your tutorials. I’m trying to build a hydrological model that can predict streamflow with a sample lead time of 10 days, though I’m confused as to how to shape the data. First, I assume I’d shift the x and y data to reflect the lag and end up with 355 (365-10) samples of each. Say I have 4 features – the base shape for x would be would be (355, 4). Does 355 represent the number of samples or the number of time steps? They seem as they are the same to me, though I think I need to reshape the data to either (1, 355, 4) or (355,1, 4). Or perhaps (5, 355/5, 1) – etc. Or, do I need to generate separate lagged (by 1) sequences to generate something like (100, 355, 4)?

  78. Avatar
    mindis June 11, 2019 at 6:37 pm #

    Hi Jason

    Do you have examples/suggestions how to use contemporaneous conditional context for time series prediction using LSTM and/or CNN (e.g. predicting next week product sales based on given price that week and history of sales and prices). The differences from multivariate prediction is that conditional context is known for the prediction week. Thanks!

    • Avatar
      Jason Brownlee June 12, 2019 at 7:54 am #

      You can frame the prediction problem with any inputs you wish.

  79. Avatar
    Shashank June 27, 2019 at 3:10 am #

    Sir if I use LSTMs(Encoder Decoder model) for summarizing articles , I’ll input the sentence vector , encoder encodes them to fixed sized vectors , but I wanted to know :
    1) how will the model know what keywords or sentences it must keep for summarizing , and then how does the decoder for sentences ?
    2 )can I use neural language model for framing sentences back ?
    but I didn’t get how to solve the (1) problem

    Thanks for your blogpost!

  80. Avatar
    Abreham June 28, 2019 at 4:04 am #

    dear Sir thank you a lot for your post if you have a source code in Decision tree regression to predict Time series data

    • Avatar
      Jason Brownlee June 28, 2019 at 6:11 am #


      I don’t have a tutorial on this specific case.

  81. Avatar
    Rahul Roy July 4, 2019 at 11:59 pm #

    From what I understand, samples are the number of rows (either the entire dataset or a subset of it) we have in a data frame; features are the independent variables (x-variables). I am confused about time steps. Is my understanding correct? I have a time series dataset of 10 features, 600 observations (rows). How do I set the input shape? Are there different ways of doing it? Thank you!

  82. Avatar
    Dennis Haijma August 10, 2019 at 12:11 am #

    Thanks for your contributions Jason, I have been visiting your blog numerous times by now and they always leave me with a better understanding of concepts.

    Anyway, I was wondering how to deal with each label having features being observed at an erratic interval. I have not yet found a solution how to deal with this. Maybe I’m overseeing something.

    Would I have to divide the timesteps in really small intervals and set observations to 0 if not observed at that timestep? Would I compute a regular interval and add an offset feature? Or something else?

    Thanks in advance.

    • Avatar
      Jason Brownlee August 10, 2019 at 7:18 am #

      You can normalize the shape of the data so all samples have the same shape, then use zero padding and a masking layer to ignore the padded values.

  83. Avatar
    Anya August 15, 2019 at 4:03 am #

    Is reshaping similar for the text classification problem? How can one reshape the input_shape when there are 30000 samples with 300 features each? Additionally, what will be the meaning of timestep here when it is not actually the time series data?

  84. Avatar
    saeid August 25, 2019 at 3:34 am #

    Hi jason,

    Thanks a lot for your wonderful tutorial. There is one thing that i do not follow. if a have 7000 dataset (samples), that each one contains 3 time series with lenght 3000 timesteps, and now i want to do binary classification on each dataset(anomaly detection) (for each sample) using LSTM,

    my question is, how do i proceed? how does my input look like? because (7000, 3000, 2) is too big for lstm .The lenght of my timeseries is huge( 3000). I have read this tutorial as well, but my problem is, that i want to use all information in my training.

    Second of all, each sample (data set with 2 time series) is independent from another one.

    Thanks in advance

  85. Avatar
    saeid August 27, 2019 at 1:39 am #

    Hi Jason,

    Thanks again, but i have read that link you provided before. But my problem is, I know how the input for small time series in my case should look like(7000, num_time steps , 2). But in this case each of my time series has 3000 time steps (each of 7000 sample).

    So, my question is how we can feed this input shape (7000, 3000,2) optimally into LSTM, because as you mentioned in one of your tutorial before, the optimum length of time series should be less that 400 time steps for LSTM. My case is Special since each of a data sets (each of 7000 sample) are independent from each other and my task is to classify them as anomaly or not anomaly, however the length of each time series is 3000 and it reduce the accuracy of the model.

    Thanks a lot for your time.

    • Avatar
      Jason Brownlee August 27, 2019 at 6:48 am #

      Naively, you cannot. You must trim the time steps.

      Or, you can, but you must use a dynamic LSTM.

      • Avatar
        saeid August 28, 2019 at 4:47 am #

        Thanks a lot 🙂

  86. Avatar
    SRK September 24, 2019 at 12:53 pm #

    Hello Sir ,
    my model has the following data,
    Xtrain shape : (62, 30, 100)
    Ytrain shape : (62, 1, 100)
    Xtrain shape : (16, 30, 100)
    Ytrain shape : (16, 1, 100)

    when doing,
    model = Sequential()
    model.add(LSTM(units=100, return_sequences= True, input_shape=(x_train.shape[1],100)))
    model.add(LSTM(units=100, return_sequences=True))

    I get ValueError: Error when checking target: expected dense_1 to have 2 dimensions, but got array with shape (62, 1, 100)
    Please Clarify On this.

    • Avatar
      Ragul Kesavan S September 24, 2019 at 1:00 pm #

      Correction :
      Xtrain shape : (62, 30, 100)
      Ytrain shape : (62, 1, 100)
      Xtest shape : (16, 30, 100)
      Ytest shape : (16, 1, 100)

    • Avatar
      Jason Brownlee September 24, 2019 at 1:18 pm #

      You must have a 2d array for target (y), not 3d

  87. Avatar
    Hemant September 29, 2019 at 3:51 am #

    What are lstm units like lstm(32, input-shape) ? How are these related to time steps or features or embedding ?

    • Avatar
      Jason Brownlee September 29, 2019 at 6:16 am #

      The number of units is unrelated from the number of input timesteps and features.

      Each unit gets all input.

  88. Avatar
    shr October 25, 2019 at 8:13 pm #

    hi Jason,

    I am new in ML and I have some confusion in the Dense layer and Dense output what is the difference between the two examples below, as shown below I put the number of features at was one to predict put example two why add dense like LSTM and output dense 32.
    its very important for me can you help me
    thank you.

    model = Sequential()
    model.add(LSTM(32, input_shape=(10, 2)))

    model = Sequential()
    model.add(Dense(32, input_shape=(16,)))

    • Avatar
      Jason Brownlee October 26, 2019 at 4:39 am #

      The first example outputs a vector with 1 element, the second outputs a vector with 32 elements.

  89. Avatar
    Shiva November 2, 2019 at 11:22 pm #

    Hi Jason,
    I a dataset that has 3 features and each feature has been represented by a 200-dimension vector.
    ([200d],[200d],[200d]), label.
    I wonder how to reshape the data for LSTM?

  90. Avatar
    ahmed November 4, 2019 at 6:38 am #

    i had to write this code

    model = Sequential()
    model.add(LSTM(4, input_shape=(1,4))
    model.add(Dense(4, activation=’relu’))
    model.add(Dense(1, activation=’sigmoid’)

    but I have seen the following error
    File “”, line 5
    model.add(Dense(4, activation=’relu’))
    SyntaxError: invalid syntax

  91. Avatar
    Nancy November 9, 2019 at 2:29 pm #

    Hi Jason,
    I have a dataset like this.
    2000-01 x1= [1, 2, 3, 4; x2=[5, 6, 9, 8; y=[4, 6, 3, 8;
    5, 6, 7, 8; 1, 6, 5, 4; 8, 5, 2, 9;
    4, 6, 0, 9;] 8, 4, 3, 5;] 7, 5, 3, 6;]
    2000-02 x1= [1, 6, 5, 4; x2=[5, 6, 9, 8; y=[4, 6, 3, 8;
    5, 6, 7, 8; 7, 5, 3, 6; 8, 5, 2, 9;
    8, 4, 3, 5;] 8, 4, 3, 5;] 5, 6, 9, 8;]
    2000-03 x1= [5, 6, 5, 4; x2=[5, 6, 9, 8;
    9, 6, 7, 8; 7, 5, 3, 6;
    1, 4, 3, 5;] 8, 4, 3, 5;]
    I want to know that whether I can predict y using LSTM and above data in 2000-03. If can, I wonder how to reshape the data for LSTM?
    Thank you.

  92. Avatar
    Oleksiy December 20, 2019 at 9:17 pm #

    Hello, Jason. Thanks for good explanation.
    Could you please help to understand following questions:
    1. LSTM with timestep = 1 – is it a simple MLP or it is still using hidden state / cell state? What’s state initial value in case of timestamp = 1?

    2. Recommendation for LSTM autoencoder timestep selection?

    Data example: index – week number, feature1 – number of visits per week, feature2 – number of holidays per week, feature3 – number of room visits.

    • Avatar
      Jason Brownlee December 21, 2019 at 7:10 am #

      Maybe a little similar, although the model maintains state across samples.

      Try different numbers of timesteps and use what gives you the most skill.

  93. Avatar
    Gopi March 31, 2020 at 11:08 pm #

    Hi Jason,
    Do you have a blog on how the shape of the input data changes at each gate of LSTM?
    lets say your input shape is (1000,10,36) and it is passed to LSTM(64) then what will be the shape of output at each gate of LSTM(i.e after forget gate, input gate and output gate)

    • Avatar
      Jason Brownlee April 1, 2020 at 5:50 am #

      We cannot configure each gate, only the input to all nodes in a layer.

  94. Avatar
    Prafull April 3, 2020 at 4:58 pm #

    The exmaple given is wrongly re-shaped.
    It shall have the shape of (2,10,1) and not (1,10,2) because there are 2 sequences with 10 time stamp observations with only 1 feature at a given time.

  95. Avatar
    Mariana April 11, 2020 at 3:15 am #

    Hi Jason,

    I understand how to make the 3D shape that the LSTM requires. Right now I have 200 samples of 600 timesteps and 7 features.


    X_train, X_test, y_train, y_test = train_test_split(dataset,test_size = 0.30)

    #Build the LSTM
    model = Sequential()
    model.add(LSTM(100, input_shape=(450, 11)))
    model.compile(loss=’categorical_crossentropy’, optimizer=Adam(0.001), metrics[‘accuracy’])
    model.summary(),y_train, batch_size=88, epochs=40, validation_split=0.2)

    I get the following error message:
    ValueError: Error when checking target: expected activation_10 to have 2 dimensions, but got array with shape (140, 600, 1). I cant seem to fit the model.

    Hope you can help me find my error.

    • Avatar
      Jason Brownlee April 11, 2020 at 6:24 am #

      You must specify the input_shape that matches your data.

      • Avatar
        Mariana April 11, 2020 at 7:43 am #

        I thought this matched: model.add(LSTM(100, input_shape=(450, 11)))

        • Avatar
          Jason Brownlee April 11, 2020 at 7:57 am #

          But you said the shape of your data is: [200,600,7]

          • Avatar
            Mariana April 11, 2020 at 8:51 am #


            X_train, X_test, y_train, y_test = train_test_split(dataset,test_size = 0.30)

            #Build the LSTM
            model = Sequential()
            model.add(LSTM(100, input_shape=(600, 7)))
            model.compile(loss=’categorical_crossentropy’, optimizer=Adam(0.001), metrics[‘accuracy’])
  ,y_train, batch_size=140, epochs=40, validation_split=0.2)

            ValueError: Error when checking target: expected activation_10 to have 2 dimensions, but got array with shape (140, 600, 1). I cant seem to fit the model.

            I still get the same value error.

          • Avatar
            Jason Brownlee April 11, 2020 at 11:53 am #

            The y must be 2d, not 3d.

  96. Avatar
    Mariana April 11, 2020 at 6:57 pm #

    But every sample has a 600 outputs as a label. (60,600,1)

    x_train(1,600,7) the first sample has 600 timesteps and 7 features that will help the neural network recognize certain patterns.

    y_train(1,600,1) the 600 label’s of the first sample. Should I reshape it like this? (1,600)

    • Avatar
      Jason Brownlee April 12, 2020 at 6:18 am #

      The error tells you the model expects output to have a 2d shape – the error even tells you what shape it expects.

      • Avatar
        Mariana April 13, 2020 at 7:50 pm #

        Thanks. I look through all of the examples mentioned in your blog.

        It looks like I was dealing with a Multivariate LSTM of Multiple Inputs. I have one question.

        # split a multivariate sequence into samples
        def split_sequences(sequences, n_steps):
        X, y = list(), list()
        for i in range(len(sequences)):
        # find the end of this pattern
        end_ix = i + n_steps
        # check if we are beyond the dataset
        if end_ix > len(sequences):
        # gather input and output parts of the pattern
        seq_x, seq_y = sequences[i:end_ix, :-1], sequences[end_ix-1, -1]
        return array(X), array(y)

        # define input sequence
        in_seq1 = array([10, 20, 30, 40, 50, 60, 70, 80, 90])
        in_seq2 = array([15, 25, 35, 45, 55, 65, 75, 85, 95])
        out_seq = array([in_seq1[i]+in_seq2[i] for i in range(len(in_seq1))])
        # convert to [rows, columns] structure
        in_seq1 = in_seq1.reshape((len(in_seq1), 1))
        in_seq2 = in_seq2.reshape((len(in_seq2), 1))
        out_seq = out_seq.reshape((len(out_seq), 1))
        # horizontally stack columns
        dataset = hstack((in_seq1, in_seq2, out_seq))
        # choose a number of time steps
        n_steps = 3
        # convert into input/output
        X, y = split_sequences(dataset, n_steps)

        In the example that you posted. It creates a 3D shape(7,3,2) from the original sequences of 9 rows and 2 columns.

        In practice, how will it work? My idea was to give a 2D tensor(100,3) of a fixed size and as an output get (100,) 100 outputs. I believe that lstm’s predict one output by one output. right?

        That means that if I want to get the 100 outputs I will need to feed the neural network with these 2D tensor(100,3) –> reshape it into a 3D shape of lets say (90,10timestemps,3 features) and concatenate all the outputs into an array untill I get 90 outputs?

  97. Avatar
    Anugrah April 13, 2020 at 10:27 pm #

    hi, Jason,

    I have read your other blogs and they were very helpful for me as a beginner in machine learning. I haven’t done LSTMs before. this is my first time.

    I am currently trying to train video sequence data to classify the emotions involved. However, by looking at the data I am a little confused about how to model my LSTM.

    My data consists of a single feature (vector of length 8192) extracted from every 90 consecutive frames from each of 615 short video clips. the video clips are having the same sample rate.
    the label for each sample (each video) is a 6 X 1 vector (6 different emotions)

    I know that when I am modeling the LSTM I need not bother about 615 as it is the number of samples. but how would I deal with 90 frames?. is that the number of time steps to LSTM?
    Can u please tell me how I should input the data to LSTM and input shape to LSTM?

    • Avatar
      Jason Brownlee April 14, 2020 at 6:17 am #

      The 8192 would be features in the [samples, timesteps, features] input data.

      • Avatar
        Anugrah April 14, 2020 at 5:54 pm #

        Hi Jason

        Thank you so much for your reply. I was able to build a model and compile it.
        this is the model I made:

        lstm=LSTM(100, activation= ‘relu’)(x)
        lstm = Dropout(rate = 0.5)(lstm)
        output1=Dense(1, activation=’softmax’)(lstm)
        output2=Dense(1, activation=’softmax’)(lstm)
        output3=Dense(1, activation=’softmax’)(lstm)
        output4=Dense(1, activation=’softmax’)(lstm)
        output5=Dense(1, activation=’softmax’)(lstm)
        output6=Dense(1, activation=’softmax’)(lstm)

        model =Model(x,[output1,output2,output3,output4,output5,output6])


        8192 is the shape of flattened feature output from InceptionV3 for each of the frames in the video.
        Shape of Y_train = (543,6,1)

        the problem now I am facing is that my model accuracy does not go beyond 30.
        I tried adding a dense layer of different shapes before the softmax unit, also tried with different lstm unit shapes as well. changing the batch size or optimizer is also not improving accuracy.

        Can you please look into my model and help me out with where I am making the mistake.

  98. Avatar
    Rahul April 19, 2020 at 4:39 am #

    Hi Jason,
    Thank you for your continued, top-class posts. I’ve a question in reshaping the dimension of the training data for LSTM. My data has 11 features and 544 observations. If I set timesteps = 1, my samples are 544/1 = 544 and the dimensions of data are correctly set to: dim(data) = c(544, 1, 11). However, when I set the timestep to, say, 7 and samples to 544/7, the dimensions of data aren’t set by dim(data) = c(544/7, 7, 11) and I get the error:

    dims [product 5929] do not match the length of the object [5984].

    Based on one of your FAQs, I assumed that the samples X timesteps = number of rows in the data. Can you please advise what am I doing wrong here?


  99. Avatar
    Rahul April 19, 2020 at 8:51 pm #

    Hi Jason,

    Can you please clarify how to reshape the data in R using the dim() function so that it matches the LSTM requirement? If I add the 3rd dimension, is it the timesteps? Besides, why is the number of timesteps = 1 a strange move? I’ll be obliged for your help.

    • Avatar
      Jason Brownlee April 20, 2020 at 5:27 am #

      Sorry, I don’t have examples of LSTMs in R. Perhaps try posting your question on stackoverflow?

  100. Avatar
    Piers May 12, 2020 at 8:40 pm #

    Great guidance.
    Clear and simple.
    Thank you.

  101. Avatar
    Kieran May 18, 2020 at 10:36 am #

    Hi Jason,

    I read, but am still not clear about the timesteps.

    For example, I’m using every past 7 days’ data to predict today’s value (price).
    Each day, there is one record of 6 features (Let’s call them feature1 – feature5, and price).
    Suppose I have 1000 days of data.
    Now I take [i:i+7] as my X, [i+7+1][price] as my y. Therefore, the shape of X is 993*7*6, the shape of y is 993.
    Does the shape of X match [samples][timesteps][features]?

    Or should I create 42 features in each row, e.g. t-1, t-2, …, t-7? then timesteps is 1, features is 42?

    • Avatar
      Jason Brownlee May 18, 2020 at 1:26 pm #

      Yes, the first approach sounds reasonable. Although 1000 time steps is too many, consider reducing to 200-400.

      • Avatar
        Kieran May 18, 2020 at 2:38 pm #

        Thanks for your reply.
        What do you mean by reducing to 200-400? Do you mean 7 days’ data is too small? Perhaps I should use, for example, past 500 days’ data as my X?

        • Avatar
          Jason Brownlee May 19, 2020 at 5:54 am #

          Sorry, I meant perhaps change the framing of the problem to have a maximum of 200-400 time steps, which is reported to be a reasonable limit for LSTMs in practice.

          • Avatar
            Kieran May 19, 2020 at 10:37 am #

            Got it. Thank you.

          • Avatar
            Jason Brownlee May 19, 2020 at 1:25 pm #

            You’re welcome!

  102. Avatar
    neha June 18, 2020 at 12:26 am #

    How would you reshape your data to fit a 1D conv model? Is it the same as LSTM [samples, timesteps, features]?

  103. Avatar
    SDG June 27, 2020 at 2:03 am #

    Dear Jason.
    My question is suppose I have a model lstm/dropout/LSTM/dropout/flatten/dense
    then no. of hidden layers here will be 5 ie dropout/LSTM/dropout/flatten/dense
    and if my data 2000 feature by 62 samples can I rehape it as 62,1,2000..
    and what should be the no of Input lstm neurons/cells 2000 or not.
    Regards Srirupa

  104. Avatar
    SDG June 27, 2020 at 11:05 am #

    Thank u Jason.
    Now what should be the ans of this question

    suppose I have a model lstm/dropout/LSTM/dropout/flatten/dense
    then no. of hidden layers here will be 5 ie dropout/LSTM/dropout/flatten/dense
    regards Srirupa

    • Avatar
      Jason Brownlee June 27, 2020 at 2:08 pm #

      Only layers with weights are counted.

      • Avatar
        SDG June 28, 2020 at 12:10 pm #

        Jason..Thank u every time for you response.
        Continuing the last query means suppose I have lstm(128)/dropout(0.2)/lstm(64)/dropout(0.2)/flatten()/dense(1)..should there be 3 hidden layers excluding dropout and input lstm layer..
        Please do is really helpful for me
        Regards Srirupa

        • Avatar
          SDG June 29, 2020 at 1:11 am #

          Dear Jason,does my above question make any sense or is it totally wrong
          Regards Srirupa

        • Avatar
          Jason Brownlee June 29, 2020 at 6:28 am #

          You’re welcome.

          That is how I would count.

  105. Avatar
    Kevin July 15, 2020 at 6:57 pm #

    Hi Jason,

    I have n samples, each sample has 49 timesteps (past observations) and 2 features each (x and y coordinates). I want the network to learn the next timestep (50th), so that it can predict next x and y coordinate.I implemented a LSTM network but it is not learning. I think the problem comes from the shape of the output. Do I have to reshape my target set so that it has shape (n_samples, 2) or do I have to modify the output dense layer and set the layer size to 2 ? I do not understand how to tell my network that the output is 2D….

    • Avatar
      Jason Brownlee July 16, 2020 at 6:30 am #

      Yes, you can configure the model to have 2 output features, e.g. 2 nodes in the output layer.

      That means the target for each sample will be a vector, not a value. e.g. the shape of y is [n, 2]

  106. Avatar
    Chung-Hao Ku July 21, 2020 at 11:31 am #

    Hello Jason,

    I am currently working with a LSTM architecture with some video skeleton data. I have prepared my data in the same way as above (multiple arrays, 144 features and varying number of frames for different video sequences). However, I am not too sure how I can reshape my training data for the input tensor. The question is that usually when you are dealing with video sequences that have different number of frames for different video samples, what would be an appropriate way to reshape the training data (video samples)? I was thinking since the input time steps are going to be fixed, maybe one way is to sample the same number of video frames for each video sequence, at the cost of losing some information. I would like to ask you opinion on this if possible. Many Thanks.

    • Avatar
      Chung-Hao Ku July 21, 2020 at 12:05 pm #

      My takeaway from the idea described in the code above is that maybe the only way is to use the predefined input time steps to sample frames for each video sequence and start from there.

    • Avatar
      Jason Brownlee July 21, 2020 at 1:50 pm #

      Yes, truncating/sampling to the same number of frames would be an easy first thing to try.

  107. Avatar
    Carohuan August 6, 2020 at 12:56 pm #

    hi, Jason,
    i am codeing about lstm model with keras recently, but i have some questions about the data shape. i have data like this:
    [list([[1, 2, 3, 4, 5, 6, 7, 8], [9, 10, 11, 3, 6, 5, 5, 4, 4, 12]])
    list([[13, 14, 15, 16, 17, 18, 1, 19], [16, 14, 18, 1, 20, 14, 21, 22]])
    list([[1, 2, 23, 6, 24, 25, 7, 26, 27], [28, 28, 29, 30, 31, 23, 1, 4, 6, 7, 25, 26], [32, 29, 33, 34, 17, 33, 1, 23, 6, 7, 35, 26]])]
    the first row/list, it is diagnoses about one patient, it consits of two visit, the first visit contains 1,2,3,4,5,6,7,8 diagnose,the second visit contains 9,…,12 diagnose. For different patient, there is different visit times, and for each visit, there is different length of diagnose code.
    How should i build a lstm model with this data? Looking forward to your replying!

  108. Avatar
    EOE August 10, 2020 at 10:32 pm #

    Hello Sir,

    I am trying to train a lstm model with multiple time series of varying length. For purpose of training, I have segmented each clip with a window size of 25ms. For each such segment, I extract 5 features.

    So, in summary #Samples = #Audio Clips, #Timesteps = #Segments (which varies for each file) and #Features = 5 (Fixed).

    My Doubts:

    1) Should I necessarily fix the #Timesteps by padding or dividing the longer files into smaller parts?

    2) If it is not required, then in which data type should I arrange the 3D Input?

  109. Avatar
    Yohan August 29, 2020 at 6:39 pm #

    Hi, nice explanation, but I am struggling with a problem if we assume sample as 1, but my dependent variable has for example shape (50000,1) then how to reshape to match the sample dimension? I cannot change sample to 50000 as that would exceed the matrix integer values, please let me know what to do.

  110. Avatar
    Yong September 1, 2020 at 12:49 am #

    Hi. Jason . Thanks a lot.

    I’m struggling with a problem, if I have the training set as [Samples=1 ,Timestep=6, Feature=6] and I put the lstm input shape as (Samples=1 ,Timestep=6, Feature=1), are the remaining features ignored or repeated until the 6th feature?

    • Avatar
      Jason Brownlee September 1, 2020 at 6:34 am #

      It will not let you reshape the data that way, an error will be thrown.

  111. Avatar
    Minya September 27, 2020 at 7:16 pm #

    Hi Jason,thank you for this post. I’ve learned a lot from it.
    I am confused with a problem.

    My dataset is users’ daily measured value for each hour in a day. For example:

    ID | 00:00 | 01:00 | … | 23:00
    1 | v1 | v2 | … | v24
    2 | v1 | v2 | … | v24

    I am doing a time series classification.
    If I want to use the hourly data, which are 24 values per day, to classify the ID, does the input shape will be like (sample number, 24, 1)?
    Or it should be (sample number, 1, 24)?

    I read most of the comment but a little confused. If the number of neuron is not relvent to input timestamp, is it mean that the network will feed the data according to timestamps?
    For example, if timestamp is 24, and feature is 2, each rnn neuron will be feed 24 times, and 2 feature will be feed in each time?

  112. Avatar
    Felipe October 3, 2020 at 9:28 am #

    Nice post Jason. Thank you so much!

  113. Avatar
    Nouman Mustafa October 24, 2020 at 2:06 am #

    I have a question. What is the relation between data and its shape with the neural network. For instance if I have data with shape (45,1,700) How should the neural network be like for predicting output?

  114. Avatar
    Azri November 5, 2020 at 1:24 am #

    Hi Jason,thank you for this post.

    I am working on a text classification model, I use the Keras functional API, I want to concatenate the three flatten outputs of a CNN multi-channel for text (size (0.768) ) with an additional features vector that has 16 features for each word and next fed this vector to an LSTM network. I’m a bit confused about how to reshape this vector.

  115. Avatar
    Ihtesham November 9, 2020 at 5:48 am #

    goal is to train a model that can predict patient’s mortality based on four physiological measurements for 48 hours after admitting to ICU.

    Description and Data
    Link to download data:
    In train folder, you are given four csv files (Heart.csv, Temperature.csv, Respiration.csv, Glucose.csv) containing four physiological measurements for the same 2000 patients admitted to ICU in a hospital. Rows represents patients and the columns represent the physiological measurements of the patient taken hourly. There are 48 columns indicating the first 48 hours after the patient is admitted to ICU. 50% of patients survives the ICU and the remaining 50% die at some point after the first 48 hours.
    The class labels are stored in y_train.csv (1: survives, 0: death). y_train.csv has only one column that indicates the class labels. Assume that order of patients in y_train.csv and the measurement csv files are same.
    First, you need to load these four files to a numpy tensor x_train of shape (2000,48,4) which in the form of (# of training samples, # of time steps, # of features). Here # of time steps is 48 and # of features = 4 (four physiological metrics). This can be done in different ways. One way to do is to first create an empty x_train tensor of shape (2000,48,4). Load each of the csv file and simply copy data to x_train along the final axis. For example x_train[:,:,0] can store the heart measurements, x_train[:,:,0] can store the Temperature measurements and so on. Also, load class labels to y_train tensor of shape (2000,)
    In test folder, you are also given four csv files (Heart.csv, Temperature.csv, Respiration.csv, Glucose.csv) containing four physiological measurements for the same test 400 patients. You will need to load these four files to x_test tensor of shape (400,48,4). The class labels are stored in y_test.csv (1: survives, 0: death). Also, load class labels to y_test tensor of shape (400,). y_test.csv has only one column that indicates the class labels. Assume that order of patients in y_test.csv and the measurement csv files are same.

    Create a keras CNN model that contains the following layers
    • 1D CNN layer
    • ReLU activation layer.
    • 1D MaxPooling layer of window size = 4, stride = 1.
    • Flatten layer.
    • Output Dense Layer with one node and sigmoid activation.
    • Use binary cross entropy as the loss function
    • Use early stopping criteria, with patience = 3.
    Hyperparameter optimization
    In the model above, perform hyperparameter optimization for the following hyperparameters. The different options for each hyperparameter to be explored are shown. Only one method of hyperparameter optimization is necessary.
    • Kernel length = 3, 5.
    • Number of kernels = 6, 32, 64, 92.
    • Batch size = 20, 30, 50.
    • Optimizers = Adam, SGD, RMSprop.
    • Learning rate = 1*lr, 0.1*lr, 10*lr, where lr is the default learning rate of the optimizer in Keras.

  116. Avatar
    Maya December 14, 2020 at 10:45 pm #


    first of all all of your articles are very helpful, thanks!

    I got the problem that eventhough reshape went fine I get an error that the test data does not fit to the train data considering the dimensions.

    So I have a dataframe with [5000,140]. Trough split 80/20 I get train_data=(4000,140) and test_data=(1000,140). Trough reshaping, if I understood it correctly, this sould be train_data=(1,4000,140) and test_data=(1,1000,140) ?

    Now even if the reshaping is correct, when running the autoencoder I get an error at validation_data, that time steps of train and test data is not equal but it mus be. What does that mean? How can I reshape the dimensions correctly so that this error does not occur?

  117. Avatar
    Parth December 15, 2020 at 12:39 am #

    Great post Jason. I have a question. My data set is 600 time-sequences of two features. So (600,2) dimensional array. I reshaped it as suggested by you to (1,600,2) or (3,200,2). This is my X_train. I’m not sure what my y_train should be?

    As per the structure of LSTM, this must be a single value (for (1,600,2)) or 3 values (3,200,2). What exact values should there be in this array?

    I am trying to develop a model to classify a given sequence of values. Would really appreciate any guidance.

    • Avatar
      Jason Brownlee December 15, 2020 at 6:27 am #

      The target (y) is the output of each sample that you want to predict or classify.

      • Avatar
        Parth December 21, 2020 at 9:01 pm #

        Thanks so much for your prompt response, Jason.

        If I understand it correctly when my test data is (1,600,2), the y_train should be (1,) with the class label as an element. How should I handle this when I split this into (3,200,2) as you suggested?

        My class labels wouldn’t work here because they stand for an entire class of 600 time steps.

        I hope I have explained the scenario properly and looking forward to your response.

        • Avatar
          Jason Brownlee December 22, 2020 at 6:44 am #

          Not sure I follow, sorry.

          Also, a dataset with just 1 or 3 samples does not sound reasonable for training a LSTM model.

  118. Avatar
    yassmine December 16, 2020 at 9:57 am #

    greeting sir i am a computer science student and specialize in software engineering and i am currently working on my final year project. my project says that i’m supposed to predict the life cycle of a given polymer(until it ruptures) so in my dataset there is 3 colomns the force aplied to the polymer during a specific hours, it gives us the stress on the polymer so the outpute is one numirical value whice is the stress value (non_lineare regression) i tryed working with the CNN but i couldn’t so i’m trying to use the LSTM network but the structure of my data needs to split randomly as the train_test_split function do but every LSTM network I saw uses a manualy patern that allows the data to be splited sequencly so it’s easier to append the time sequence to the training and testing data so i’m wondering if the LSTM works fine for this regression probleme and a tutorial how to append the time sequence to the outpute variables of the train_test_split function (ndarrays) knowing that i tried to convert them to arrays and lists but unfortunately there is alyawes an error pops up i really need the answer, and sorry for my english it’s not my language thanks

  119. Avatar
    Abdo December 25, 2020 at 6:18 am #

    Hello Jason and thanks for your support for all of us.

    I am wondering if I can use LSTM to predict the outcome of the following situation.

    I have the following dataset structure:

    3 13 0.1 70
    1 26 0.5 67
    2 … … …
    . … … …
    . … … …
    1 20 0 87
    3 26 1 66
    2 12 1.2 90
    3 27 1.8 88

    The MSE ranges from 0 to 30 (0..30), the CR ranges from 0 to 5 (0..5) and the Age ranges from 30 to 45 (30..45).

    As you can see, we have different rows regarding same device (for example in the table below we have three different readings regarding device#3). My goal s to use LSTM to predict the future reading for all the devices that we have in our dataset; in other word what is the MSE and CDR for device#3?

    Can you please help me is this something applicable, is so, what data preparation I need to feed to LSTM.

    Thanks in advance

    • Avatar
      Jason Brownlee December 26, 2020 at 5:02 am #

      LSTM might be appropriate, perhaps try it and compare results to other algorithms like MLP, CNN and other ML algorithms.

  120. Avatar
    Ananthakrishnan CG January 18, 2021 at 2:57 pm #

    I have a dataset of 444 rows and 81 columns ( features). I need to input this to LSTM. How do i convert this data to 3D data to input to LSTM model.

    Kindly waiting for your reply.

  121. Avatar
    Alex February 5, 2021 at 12:55 pm #

    Hi Jason,
    I have a doubt. Is it possible that the dimension of input and output of a bidirectional RNN (GRU/LSTM)is equal?

    Eg. If the input layer dimension is (128,16,128) followed by BiDirectional GRU with a hidden layer dimension of 32 and the output we get is also equal to Input i.e. (128,16,128) ??

    • Avatar
      Jason Brownlee February 5, 2021 at 1:03 pm #

      You can design the model any way you like.

      • Avatar
        Alex February 5, 2021 at 1:09 pm #

        Hi Jason,
        Thanks for the swift response. I tried but when I use 32 as hidden units in Bi-GRU I am getting output dimension as (None, 64,32) for input dimension (None, 64,16).

        • Avatar
          Jason Brownlee February 6, 2021 at 5:42 am #

          Perhaps an encoder-decoder type model would be more appropriate for you if you want to control the number of output steps of your model.

  122. Avatar
    Alex February 5, 2021 at 1:12 pm #

    I am using the code as shown below:

    output = Bidirectional(GRU(32, return_sequences=True), merge_mode=’mul’)(input)

  123. Avatar
    farheen February 27, 2021 at 5:08 am #

    pls help me with following

    import keras
    from keras.models import Sequential
    from keras.layers import LSTM, Dense, Dropout, LSTM
    from keras.optimizers import Adam
    import pandas as pd
    #dataset import
    dataset = pd.read_csv(‘C:\\Users\\Fatima\\Downloads\\train_subject1_raw01.csv’) #You need to change #directory accordingly
    X = dataset.iloc[:,:32].values
    y = dataset.iloc[:,32:33].values

    #Normalizing the data
    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    X = sc.fit_transform(X)
    from sklearn.preprocessing import OneHotEncoder
    ohe = OneHotEncoder()
    y = ohe.fit_transform(y).toarray()

    from sklearn.model_selection import train_test_split
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)

    print(“train size: {}”.format(X_train.shape))
    print(“train Label size: {}”.format(y_train.shape))
    print(“test size: {}”.format(X_test.shape))
    print(“test Label size: {}”.format(y_test.shape))

    X_train = X_train.reshape(1, 97893, 32)
    y_train = y_train.reshape(1,97893 ,3)

    X_test = X_test.reshape(1,24474,32)
    y_test = y_test.reshape(1,24474,3)

    print(“train size: {}”.format(X_train.shape))
    print(“train Label size: {}”.format(y_train.shape))
    print(“test size: {}”.format(X_test.shape))
    print(“test Label size: {}”.format(y_test.shape))

    #Initializing the classifier Network
    classifier = Sequential()
    classifier.add(LSTM(128, input_shape=(X_train.shape[1:]), return_sequences=True))

    #Adding a second LSTM network layer

    #Adding a dense hidden layer
    classifier.add(Dense(64, activation=’relu’))

    #Adding the output layer
    classifier.add(Dense(3, activation=’softmax’))
    #Compiling the network
    classifier.compile( loss=’categorical_crossentropy’,
    optimizer=Adam(lr=0.001, decay=1e-6),
    metrics=[‘accuracy’] )

    #Fitting the data to the model, y_train,epochs=3,validation_data=(X_test, y_test))

    print(“train size: {}”.format(X_train.shape))
    print(“train Label size: {}”.format(y_train.shape))
    print(“test size: {}”.format(X_test.shape))
    print(“test Label size: {}”.format(y_test.shape)), y_train, batch_size =batch_size, epochs = 1, verbose = 5)

    test_loss, test_acc = classifier.evaluate(X_test, y_test)
    print(‘Test Loss: {}’.format(test_loss))
    print(‘Test Accuracy: {}’.format(test_acc))

    error at, y_train, batch_size =64, epochs = 1, verbose = 5)

    ValueError: Shapes (None, 97893, 3) and (None, 3) are incompatible

  124. Avatar
    Sangita May 1, 2021 at 7:00 pm #

    hi jason
    please help me with the codes.

    #Import the libraries
    import math
    from math import sqrt
    import numpy as np
    from numpy import concatenate
    import pandas as pd
    from pandas import DataFrame
    from pandas import concat
    from pandas import read_csv
    from sklearn.preprocessing import MinMaxScaler
    #from sklearn.preprocessing import LabelEncoder
    from sklearn.metrics import mean_squared_error
    from keras.models import Sequential
    from keras.layers import Dense, LSTM
    import matplotlib.pyplot as plt
    from matplotlib import pyplot‘fivethirtyeight’)

    # load dataset
    dataset = read_csv(‘feature5_ctu_nil_targ.csv’, header=0, index_col=0)
    values = dataset.values

    # ensure all data is float
    values = values.astype(‘float32′)
    # normalize features
    scaler = MinMaxScaler(feature_range=(0, 1))
    scaled = scaler.fit_transform(values)

    # split the dataset into 75/25 for train and test
    training_dataset_length = math.ceil(len(dataset) * .75)
    train = values[:training_dataset_length, :]
    test = values[training_dataset_length:, :]

    # split into input and outputs
    train_X, train_y = train[:, :-1], train[:, -1]
    test_X, test_y = test[:, :-1], test[:, -1]

    # reshape input to be 3D [samples, timesteps, features]
    train_X = np.reshape(train_X, (train_X.shape[0], 1, train_X.shape[1]))
    #train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
    #train_y = train_y.reshape((train_y.shape[0], 1, train_y.shape[1]))
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
    #test_X = np.reshape(test_X (test_X.shape[0], 1, test_X.shape[1]))
    #test_y = test_y.reshape((test_y.shape[0], 1, test_y.shape[1]))
    print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

    # design network
    model = Sequential()
    model.add(LSTM(50, return_sequences=True, input_shape=(train_X.shape[1], train_X.shape[2])))
    model.add(LSTM(units=30, return_sequences=True))
    model.compile(loss=’mae’, optimizer=’adam’)

    # fit network
    history =, train_y, epochs=10, batch_size=50, validation_data=(test_X, test_y), verbose=2, shuffle=False)

    # plot history
    ”’pyplot.plot(history.history[‘loss’], label=’train’)
    pyplot.plot(history.history[‘val_loss’], label=’test’)

    ”’#check predicted values
    predictions = model.predict(test_X)
    #Undo scaling
    predictions = scaler.inverse_transform(predictions)

    #Calculate RMSE score
    rmse=np.sqrt(np.mean(((predictions- test_y)**2)))

    # make a prediction
    yhat = model.predict(test_X)
    test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

    # invert scaling for forecast
    inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
    inv_yhat = scaler.inverse_transform(inv_yhat)
    inv_yhat = inv_yhat[:,0]

    # invert scaling for actual
    test_y = test_y.reshape((len(test_y), 1))
    inv_y = concatenate((test_y, test_X[:, 1:]), axis=1)
    inv_y = scaler.inverse_transform(inv_y)
    inv_y = inv_y[:,0]

    # calculate RMSE
    rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
    print(‘Test RMSE: %.3f’ % rmse)

    ValueError: operands could not be broadcast together with shapes (81367,3) (4,) (81367,3)

  125. Avatar
    Mahan May 19, 2021 at 11:07 am #

    Hi Jason,
    Thank you for your great post. I have a dataset of 5000 patients. For each patient, a biosignal has been recorded for 60 seconds with a frequency of 80Hz, i.e. I have 5000 univariate samples with each sample containing 4800 data points. I want to classify my time series into two categories (mortality vs. non-mortality). If I want to use LSTMs or CNNs for classification, could you please let me know how I should structure X_train and what the input_shape looks like? As far as I’ve understood, the time step should be between 200-400 for LSTMs. Let’s say I pick 200. Is either of these two situations correct?

    1) Do I have to split each signal into 200 segments? Since each segment would have 24 data points (4800/200 = 24), do I have to convert each segment to a single number (for example by averaging the 24 values)? Is it correct to say that:
    X_train = (5000, 200)
    input_shape = (5000, 200, 1)?

    2) Do I have to divide each patient’s signal into 24 samples (with each sample containing 200 datapoints)? Is it correct to say that:
    X_train = (24*5000, 200)
    input_shape = (24*5000, 200, 1)?

    I would appreciate your help.

  126. Avatar
    khaoula July 27, 2021 at 10:54 pm #

    hi jason
    Could you be Input input layer not first hidden layer ?

  127. Avatar
    marco ripamonti September 11, 2021 at 3:24 am #

    Hello Jason thank you for the tutorial on LSTM.

    I tried to apply the concepts on pretty simple example.
    I got the close price of a stock for 5 years (i.e 1500 values) , and try to predict the close price of stock for the day after
    I’d like to take into consideration 10 previous samples in order to predict the next day.

    So the time series is one feature and the Y( the target) is the same time series shifted by one day

    X= [2.376227, 2.376227, 2.360171, …, 4.983 , 4.973 , 5.004 ]

    Y= [ 2.376227, 2.360171, …, 4.983 , 4.973 , 5.004 ]

    Then after have normalized both and split between training and test
    X_scaled, Y_scaled  X_scaled_train, Y_scaled_train

    Now I have to reshape the X_scaled_train before passing it to the LSTM, but not the Y_scaled_train (correct?)

    The input to every LSTM layer must be three-dimensional.
    The three dimensions of this input are:
    • Samples. One sequence is one sample. A batch is comprised of one or more samples.
    • Time Steps. One time step is one point of observation in the sample.
    • Features. One feature is one observation at a time step.
    In this case 1500 samples should be reshaped to 150 (1500/10)

    X_scaled_train = np.reshape(X_scaled_train, (150, 10, 1))

    Build the model
    3 model = Sequential()
    model.add(LSTM(32, input_shape=(10, 1)))

    optimizer = tf.keras.optimizers.Adam(learning_rate=2e-04)

    Fit the model
    history =,y_scaled_train,epochs=100,batch_size = 1,callbacks = [early_stopping])

    Could you please tell whether the procedure is correct?
    Thanks a lot.

    • Avatar
      Adrian Tam September 11, 2021 at 6:48 am #

      Sorry, it is too long for me to read. Can you try this out and see if the result fits your expectation?

  128. Avatar
    Sophia September 15, 2021 at 11:32 pm #

    Hi Jason! Thank you for your work!! I have a question: In my work I have a dataframe in which each line has a patient record over the years and each column is a feature. The output is 0 or 1, depending on whether the patient has the disease or not.
    Example for 2 years with 5 features: X is [[1 2 3 4 5], [1 3 5 6 4]] and the output [0 1]

    My goal is through an LSTM to be able to predict whether a next patient will have the disease or not, but I don’t understand how I should reshape the data. For 10 years and 5 features, I have a y with 10 values (one for each year) and should I put the X in what shape?

  129. Avatar
    leili October 10, 2021 at 6:43 pm #

    Hello Jason, thank you for the awesome tutorials.

    You have a tutorial about Human Activity Recognition with LSTM. In that example, the input shape is (128,9). It means 128 time steps and 9 features. Also the batch size equals 64. As far as I understood, it indicates that in each process, 64 samples are fed to the network, and each sample contains 128 time steps and 9 features. Am I right?
    In that tutorial, it is mentioned that there is a 50% overlap window because 128/64 = 2. I am confused about the meaning of overlap here. When in each process 64 samples are fed into the layer, how window overlapping occurs?

    Thanks again for your time.

    • Avatar
      Adrian Tam October 13, 2021 at 7:07 am #

      You’re right on the batch.

      Overlap is a concept specific to time series. Imagine you have time 1, 2, 3, etc. And you have a window of size 10. Obviously the first window of data is 1 to 10, but what about second window of data? In case of 50% overlap, that is 6 to 15. In case of 0% overlap, you use 11 to 20.

  130. Avatar
    farzaneh October 15, 2021 at 10:37 pm #

    Hi Jason
    I changed the code to compatible with my data,( a csv file with 405-time step and 920 feature)
    when I run the code I after epoc 1 the loss is nan and the output vector is completely nan
    I have tested it for a small file (csv file with 405 timesteps and 4 features) it correctly works.
    can you help me if it is possible?

  131. Avatar
    Anthony The Koala October 24, 2021 at 11:24 am #

    Dear Dr Jason,
    Thank you for your tutorial.
    The input data to the LSTM model is 3D as in the simple example

    The array is shaped in the form of:
    number of samples = batch size = 1
    number of time steps = number of rows = 10
    number of features = number of columns = 1

    Questions for clarification:
    (1)number of samples = batch size = 1 – is that a one step ahead prediction:?
    2(a) number of samples = batch size =2 is that 2 steps ahead prediction?
    2(b) Can we predict one step ahead even though batch size = 2?
    2(c) Setting a model,with batch_size >= 2 – input_shape for LSTM 3 parameters.
    model.add(LSTM(32, input_shape=(2, 10, 1)))

    Thank you,
    Anthony of Sydney

    • Avatar
      Adrian Tam October 27, 2021 at 2:04 am #

      The batch size here should be how much your LSTM should remember in the hidden layer. You always get one step ahead prediction but the batch size tells when you should reset the hidden state.

  132. Avatar
    Miguel August 13, 2022 at 12:39 am #

    Dear Jason and Team,

    Thanks so much for all this content, it is the web I have seen on the internet.

    I have a question regarding a problem I want to solve. I have a signal (voltage along time) with periodicity, and I want to detect with a neural network when this period decreases a certain percentage so I can activate an actuator (on-off actuator). I have read your book about LSTMs, and I think that I can use a Sequence Classification LSTM arquitecture. Does this makes sense?

    Also I don’t know how to prepare the data for the x and y inputs of the fitting model. My guess is that x would be the voltage along time, and y would be the booelan values of this actuator (i.e. when the period decreases). Is this correct?.

    Thanks a lot and greetings from Spain!

  133. Avatar
    Omer Birinci October 17, 2022 at 12:27 am #

    Hello. I have a problem related to those things. I have sequential data and every input of one sample composed of 4 different time series. As an output, I have one time series for each sample that having same length with the inputs. When I rearranging the matrices, naturally input have 4x rows than the output. And keras gave the error “ValueError: Data cardinality is ambiguous:
    x sizes: 504
    y sizes: 126
    Make sure all arrays contain the same number of samples.”

    How can i solve this problem? Row number must be same but i have 4 input rows and one output row for every sample.


  134. Avatar
    Alessia November 24, 2022 at 7:22 pm #

    Goodmorning sir, thank you for your explanation.
    My questione Is: how to treat with more than One sample with different timestep each? Should I reshape them? And then, what It I apply the model on Real Life? Does the length of timestep must match or It’s Just a training problem?

  135. Avatar
    Rafael Eder February 16, 2023 at 4:28 pm #

    Is correct use: data = data.reshape(-1, 10, 2) too ?

  136. Avatar
    M. Ramos March 31, 2023 at 7:11 am #

    Dear Dr. Jason,

    I’ve beeing using your examples and explanations on my learnings and job activities.

    I have a question regarding the shape, that’s why I came back to the basics, but I still didn’t find my answer and would like to share my example.

    I have a Multi-Channel Multi Step Input
    ( 1000 samples , 20 timesteps, 4 features).

    While training this 3D Input with LSTM and CNN it works fine, but I was wondering if I could improve the speed and spare some resources of my modell, since 3 of these features doesn’t change inside the sample.

    imagine that just the first feature is changing ( curve) (like a sinus curve with noise) during all 20 timesteps, but all other 3 features are relative constant ( Temperature, Pressure, Humidity).
    curve = [0,18,36,54,72 …. 344]
    feature_temp = [10,10,10,10,11…10]
    feature_press = [990,991,991,991,….,991]
    feature_humidity = [60,60,60,60…..60]

    I was wondering:
    Is there anyway to feed a modell with an mixed Input (1000,20,1) and (1000,1,3)? I would like to feed the curve and the mean of the features.
    I was wondering if I could just add the mean of the features as last timesteps for each curve, but I’m affraid it’s not the real approach.

    Thank you very much for your courses, books, blog entries and of course for a reply

  137. Avatar
    Wenchao Xu July 7, 2023 at 5:36 pm #


    First of all, thank you so much for the work you do! This provides a lot of learning materials to enable me to learn this piece of content.

    I’m doing LSTM for system identification stuff. I first describe the system identification task, using numerical integration (e.g. 4th order Runge-Kutta integrator) to discretize nonlinear state-space models (ODEs: xdot = f(x, u, t)) with control inputs, which is The time series data of each state will be obtained.

    Before this, I have roughly used NARX-MLP for system identification and achieved good prediction accuracy. When I want to replace the predictive model with LSTM, I run into some problems:

    1. In NARX-MLP,the predication model is xk = fnn(xk-1, …, xk-na, uk-1, …, uk-nb), where fnn is approximated by MLP. This simply requires transforming the state time series through a sliding window and enabling supervised training. The dynamic model is approximated by sliding window and static MLP. But LSTM can learn time dependence by itself, do we still need to use sliding windows? I see that the author mentioned that “many-to-one” has better performance. Does this mean that sliding windows can still be used?

    2. If I use sliding windows, then how do I convert these time series into legal LSTM inputs? Taking single-state single-control input as an example, at time t=tk, let na=nb=2, then: features=(xk-1, xk-2, uk-1, uk-2); label=xk. All sequence data is converted to this form. In pytorch, the LSTM input data format is: (batch, seq_len, hidden_size). it’s here,seq_len is 2 or 1?

    There is a lot to write, thanks again!

  138. Avatar
    Vins September 16, 2023 at 4:11 am #

    Thank you for the explanation.
    I am still having doubt.
    I have 12 different time series datasets with same time interval 0.002s with different number of rows in each dataset.
    it is a sensor data.
    My approach is to make a new column that identify each dataset and combine all dataset to make a generalize model that can correctly predict output in 0s and 1s.
    my train data shape is (5111548, 60).
    can you tell me what should my input data shape be for lstm and can you tell me is there any other approach to this problem?
    Thank you

Leave a Reply