How to Convert a Time Series to a Supervised Learning Problem in Python

Last Updated on August 21, 2019

Machine learning methods like deep learning can be used for time series forecasting.

Before machine learning can be used, time series forecasting problems must be re-framed as supervised learning problems. From a sequence to pairs of input and output sequences.

In this tutorial, you will discover how to transform univariate and multivariate time series forecasting problems into supervised learning problems for use with machine learning algorithms.

After completing this tutorial, you will know:

  • How to develop a function to transform a time series dataset into a supervised learning dataset.
  • How to transform univariate time series data for machine learning.
  • How to transform multivariate time series data for machine learning.

Kick-start your project with my new book Time Series Forecasting With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Convert a Time Series to a Supervised Learning Problem in Python

How to Convert a Time Series to a Supervised Learning Problem in Python
Photo by Quim Gil, some rights reserved.

Time Series vs Supervised Learning

Before we get started, let’s take a moment to better understand the form of time series and supervised learning data.

A time series is a sequence of numbers that are ordered by a time index. This can be thought of as a list or column of ordered values.

For example:

A supervised learning problem is comprised of input patterns (X) and output patterns (y), such that an algorithm can learn how to predict the output patterns from the input patterns.

For example:

For more on this topic, see the post:

Pandas shift() Function

A key function to help transform time series data into a supervised learning problem is the Pandas shift() function.

Given a DataFrame, the shift() function can be used to create copies of columns that are pushed forward (rows of NaN values added to the front) or pulled back (rows of NaN values added to the end).

This is the behavior required to create columns of lag observations as well as columns of forecast observations for a time series dataset in a supervised learning format.

Let’s look at some examples of the shift function in action.

We can define a mock time series dataset as a sequence of 10 numbers, in this case a single column in a DataFrame as follows:

Running the example prints the time series data with the row indices for each observation.

We can shift all the observations down by one time step by inserting one new row at the top. Because the new row has no data, we can use NaN to represent “no data”.

The shift function can do this for us and we can insert this shifted column next to our original series.

Running the example gives us two columns in the dataset. The first with the original observations and a new shifted column.

We can see that shifting the series forward one time step gives us a primitive supervised learning problem, although with X and y in the wrong order. Ignore the column of row labels. The first row would have to be discarded because of the NaN value. The second row shows the input value of 0.0 in the second column (input or X) and the value of 1 in the first column (output or y).

We can see that if we can repeat this process with shifts of 2, 3, and more, how we could create long input sequences (X) that can be used to forecast an output value (y).

The shift operator can also accept a negative integer value. This has the effect of pulling the observations up by inserting new rows at the end. Below is an example:

Running the example shows a new column with a NaN value as the last value.

We can see that the forecast column can be taken as an input (X) and the second as an output value (y). That is the input value of 0 can be used to forecast the output value of 1.

Technically, in time series forecasting terminology the current time (t) and future times (t+1, t+n) are forecast times and past observations (t-1, t-n) are used to make forecasts.

We can see how positive and negative shifts can be used to create a new DataFrame from a time series with sequences of input and output patterns for a supervised learning problem.

This permits not only classical X -> y prediction, but also X -> Y where both input and output can be sequences.

Further, the shift function also works on so-called multivariate time series problems. That is where instead of having one set of observations for a time series, we have multiple (e.g. temperature and pressure). All variates in the time series can be shifted forward or backward to create multivariate input and output sequences. We will explore this more later in the tutorial.

The series_to_supervised() Function

We can use the shift() function in Pandas to automatically create new framings of time series problems given the desired length of input and output sequences.

This would be a useful tool as it would allow us to explore different framings of a time series problem with machine learning algorithms to see which might result in better performing models.

In this section, we will define a new Python function named series_to_supervised() that takes a univariate or multivariate time series and frames it as a supervised learning dataset.

The function takes four arguments:

  • data: Sequence of observations as a list or 2D NumPy array. Required.
  • n_in: Number of lag observations as input (X). Values may be between [1..len(data)] Optional. Defaults to 1.
  • n_out: Number of observations as output (y). Values may be between [0..len(data)-1]. Optional. Defaults to 1.
  • dropnan: Boolean whether or not to drop rows with NaN values. Optional. Defaults to True.

The function returns a single value:

  • return: Pandas DataFrame of series framed for supervised learning.

The new dataset is constructed as a DataFrame, with each column suitably named both by variable number and time step. This allows you to design a variety of different time step sequence type forecasting problems from a given univariate or multivariate time series.

Once the DataFrame is returned, you can decide how to split the rows of the returned DataFrame into X and y components for supervised learning any way you wish.

The function is defined with default parameters so that if you call it with just your data, it will construct a DataFrame with t-1 as X and t as y.

The function is confirmed to be compatible with Python 2 and Python 3.

The complete function is listed below, including function comments.

Can you see obvious ways to make the function more robust or more readable?
Please let me know in the comments below.

Now that we have the whole function, we can explore how it may be used.

One-Step Univariate Forecasting

It is standard practice in time series forecasting to use lagged observations (e.g. t-1) as input variables to forecast the current time step (t).

This is called one-step forecasting.

The example below demonstrates a one lag time step (t-1) to predict the current time step (t).

Running the example prints the output of the reframed time series.

We can see that the observations are named “var1” and that the input observation is suitably named (t-1) and the output time step is named (t).

We can also see that rows with NaN values have been automatically removed from the DataFrame.

We can repeat this example with an arbitrary number length input sequence, such as 3. This can be done by specifying the length of the input sequence as an argument; for example:

The complete example is listed below.

Again, running the example prints the reframed series. We can see that the input sequence is in the correct left-to-right order with the output variable to be predicted on the far right.

Multi-Step or Sequence Forecasting

A different type of forecasting problem is using past observations to forecast a sequence of future observations.

This may be called sequence forecasting or multi-step forecasting.

We can frame a time series for sequence forecasting by specifying another argument. For example, we could frame a forecast problem with an input sequence of 2 past observations to forecast 2 future observations as follows:

The complete example is listed below:

Running the example shows the differentiation of input (t-n) and output (t+n) variables with the current observation (t) considered an output.

Multivariate Forecasting

Another important type of time series is called multivariate time series.

This is where we may have observations of multiple different measures and an interest in forecasting one or more of them.

For example, we may have two sets of time series observations obs1 and obs2 and we wish to forecast one or both of these.

We can call series_to_supervised() in exactly the same way.

For example:

Running the example prints the new framing of the data, showing an input pattern with one time step for both variables and an output pattern of one time step for both variables.

Again, depending on the specifics of the problem, the division of columns into X and Y components can be chosen arbitrarily, such as if the current observation of var1 was also provided as input and only var2 was to be predicted.

You can see how this may be easily used for sequence forecasting with multivariate time series by specifying the length of the input and output sequences as above.

For example, below is an example of a reframing with 1 time step as input and 2 time steps as forecast sequence.

Running the example shows the large reframed DataFrame.

Experiment with your own dataset and try multiple different framings to see what works best.


In this tutorial, you discovered how to reframe time series datasets as supervised learning problems with Python.

Specifically, you learned:

  • About the Pandas shift() function and how it can be used to automatically define supervised learning datasets from time series data.
  • How to reframe a univariate time series into one-step and multi-step supervised learning problems.
  • How to reframe multivariate time series into one-step and multi-step supervised learning problems.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Want to Develop Time Series Forecasts with Python?

Introduction to Time Series Forecasting With Python

Develop Your Own Forecasts in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Introduction to Time Series Forecasting With Python

It covers self-study tutorials and end-to-end projects on topics like: Loading data, visualization, modeling, algorithm tuning, and much more...

Finally Bring Time Series Forecasting to
Your Own Projects

Skip the Academics. Just Results.

See What's Inside

398 Responses to How to Convert a Time Series to a Supervised Learning Problem in Python

  1. Avatar
    Mikkel May 8, 2017 at 7:07 pm #

    Hi Jason, thanks for your highly relevant article 🙂

    I am having a hard time following the structure of the dataset. I understand the basics of t-n, t-1, t, t+1, t+n and so forth. Although, what exactly are we describing here in the t and t-1 column? Is it the change over time for a specific explanatory variable? In that case, wouldn’t it make more sense to transpose the data, so that the time were described in the rows rather than columns?

    Also, how would you then characterise following data:

    Customer_ID Month Balance
    1 01 1,500
    1 02 1,600
    1 03 1,700
    1 04 1,900
    2 01 1,000
    2 02 900
    2 03 700
    2 04 500
    3 01 3,500
    3 02 1,500
    3 03 2,500
    3 04 4,500

    Let’s say, that we wanna forcast their balance using supervised learning, or classify the customers as “savers” or “spenders”

    • Avatar
      Jason Brownlee May 9, 2017 at 7:40 am #

      Yes, it is transposing each variable, but allowing control over the length of each row back into time.

      • Avatar
        Mostafa March 2, 2018 at 2:43 am #

        Hi Jason, thanks for very helpful tutorials, I have the same question as Mikkel.

        how would you then characterise following data?

        let’s suppose we have a dataset same as the following.
        and we want to predict the Balance of each Customer at the fourth month, how should I deal with this problem?

        Thanks a bunch in advance

        Customer_ID Month Balance
        1 01 1,500
        1 02 1,600
        1 03 1,700
        1 04 1,900
        2 01 1,000
        2 02 900
        2 03 700
        2 04 500
        3 01 3,500
        3 02 1,500
        3 03 2,500
        3 04 4,500

        • Avatar
          Jason Brownlee March 2, 2018 at 5:35 am #

          Test different framing of the problem.

          Try modeling all customers together as a first step.

          • Avatar
            Raha August 13, 2020 at 12:35 am #

            Hi Jason I also have a similar dataset where we are looking at deal activity over a number of weeks and noting whether they paid early or not in a particular time period. I am trying to predict who is likely to pay early (0 for No and 1 for Yes). Can you explain a bit more what you mean by modeling all customers together as a first step. Please see sample data below:

            Deal Date Portfolio Prepaid
            1 1/1/18 A 0
            1 1/8/18 A 0
            1 1/15/18 A 0
            1 1/22/18 A 1
            2 1/1/18 B 0
            2 1/8/18 B 0
            2 1/15/18 B 0
            2 1/22/18 B 0
            3 1/1/18 A 0
            3 1/8/18 A 0
            3 1/15/18 A 0
            3 1/22/18 A 1
            4 1/1/18 B 0
            4 1/8/18 B 0
            4 1/15/18 B 0
            4 1/22/18 B 1

          • Avatar
            Jason Brownlee August 13, 2020 at 6:18 am #

            The idea is whether it makes sense to model across subjects/sites/companies/etc. Or to model each standalone. Perhaps modeling across subjects does not make sense for your project.

          • Avatar
            JOJO January 9, 2023 at 8:02 pm #

            The time series prediction problem is: the input X is distance and angle, and the predicted result y is two-dimensional coordinates. Is there one data in each column, or the distance and angle as a column z=(distance,angle), if it is a whole column, How to generate supervised sequence pairs?

          • Avatar
            James Carmichael January 10, 2023 at 8:06 am #

            Hi JOJO…I would recommend that you investigate sequence to sequence techniques for this purpose:


        • Avatar
          Yavuz June 21, 2018 at 11:08 pm #

          Hi Mostafa,

          I am dealing with a similar kind of problem right now. Have you found any simple and coherent answer to your question? Any article, code example or video lecture?

          I appreciate if you found something and let me know.

          Thanks, regards.

        • Avatar
          Sandipan Banerjee March 19, 2019 at 12:20 am #

          This is similar to my Fluid Mechanics problem too, where in the customer id is replaced by the location of unique point in the 2-d domain (x,y coordinates of the point), and the balance can be replaced by velocities. I, too could not find any help online regarding handling these type of data.

          • Avatar
            Ahzam Ejaz August 22, 2023 at 10:54 pm #

            I had recommend using LSTM for this purpose. You would suppose (t-1) input with [Customer_ID Month Balance] for each individual time step and each unit of LSTM outputs a hypothesis value in one dimension that would correspond to Balance.

    • Avatar
      WangGang June 25, 2018 at 10:16 pm #

      I would like to ask if I have the data for the first 5 hours, how to get the data for the sixth hour, Thanks

    • Avatar
      Abhimanyu September 20, 2019 at 1:39 am #

      How i can detect patterns in time series data. suppose i ahve a timseries influx db box where i am storing total no of online players every minute and i want ti know when the numbers of players shows flat line behavior. Flat line could be on 1 million or on 100 or on 1000..

      • Avatar
        Jason Brownlee September 20, 2019 at 5:48 am #

        Perhaps you can model absolute change in an interval?

  2. Avatar
    Daniel May 9, 2017 at 4:56 pm #

    Hey Jason,

    this is an awesome article! I was looking for that the whole time.

    The only thing is I am general programming in R, so I only found something similar like your code, but I am not sure if it is the same. I have got this from and it deals with lagged and leaded values. Also the output includes NA values.

    return(sapply(shift_by,shift, x=x))

    out 0 )
    else if (shift_by < 0 )
    out<-c(rep(NA,abs_shift_by), head(x,-abs_shift_by))

    x df_lead2 df_lag2
    1 1 3 NA
    2 2 4 NA
    3 3 5 1
    4 4 6 2
    5 5 7 3
    6 6 8 4
    7 7 9 5
    8 8 10 6
    9 9 NA 7
    10 10 NA 8

    I also tried to recompile your code in R, but it failed.

    • Avatar
      Jason Brownlee May 10, 2017 at 8:44 am #

      I would recommend contacting the authors of the R code you reference.

      • Avatar
        chris May 12, 2017 at 3:28 am #

        Can you answer this in Go, Java, C# and COBOL as well????? Thanks, I really don’t want to do anything

        • Avatar
          Jason Brownlee May 12, 2017 at 7:45 am #

          I do my best to help, some need more help than others.

          • Avatar
            José Luis Sydor September 19, 2019 at 3:51 am #


          • Avatar
            Jason Brownlee September 19, 2019 at 6:06 am #

            I know. You should see some of the “can you do my assignment/project/job” emails I get 🙂

  3. Avatar
    Lee May 9, 2017 at 11:40 pm #

    Hi Jason, good article, but could be much better if you illustrated everything with some actual time series data. Also, no need to repeat the function code 5 times 😉 Gripes aside, this was very timely as I’m just about to get into some time series forecasting, so thanks for this article!!!

  4. Avatar
    Christopher May 12, 2017 at 9:16 pm #

    Hi Jason,
    thank you for the good article! I really like the shifting approach for reframing the training data!
    But my question about this topic is: What do you think is the next step for a one-step univariate forecasting? Which machine learning method is the most suitable for that?
    Obviously a regressor is the best choice but how can I determine the size of the sliding window for the training?

    Thanks a lot for your help and work
    ~ Christopher

  5. Avatar
    tom June 8, 2017 at 4:06 pm #

    hi Jason:
    In this post, you create new framings of time series ,such as t-1, t, t+1.But, what’s the use of these time series .Do you mean these time series can make a good effect on model? Maybe
    my question is too simple ,because I am a newer ,please understand! thank you !

    • Avatar
      Jason Brownlee June 9, 2017 at 6:19 am #

      I am providing a technique to help you convert a series into a supervised learning problem.

      This is valuable because you can then transform your time series problems into supervised learning problems and apply a suite of standard classification and regression techniques in order to make forecasts.

      • Avatar
        tom June 9, 2017 at 11:40 am #

        Wow, your answer always makes me learn a lot。Thank you Jason!

        • Avatar
          Jason Brownlee June 10, 2017 at 8:12 am #

          You’re welcome.

          • Avatar
            Josh August 26, 2021 at 6:03 am #

            Hi Jason. Fantastic article & useful code. I have a question. Once we have added the additional features, so we now have t, t-1, t-2 etc, can we split our data in to train/test sets in the usual way? (Ie with a shuffle). My thinking is yes, as the temporal information is now included in the features (t-1, t-2, etc).
            Would be great to hear your thoughts.
            Love your work!

          • Avatar
            Adrian Tam August 27, 2021 at 5:44 am #

            That’s correct. The whole point of the conversion is to create intervals from the time series, which the model is to consider only the interval but not anything more (and no memory from data outside of the interval). In this case, shuffling the intervals are fine. But shuffling within an interval is not.

  6. Avatar
    Brad Suzon June 23, 2017 at 11:32 pm #

    If there are multiple variables varXi to train and only one variable varY to predict will the same technique be used in the below way:
    varX1(t-1) varX2(t-1) varX1(t) varX2(t) … varY(t-1) varY(t)
    .. .. .. .. .. ..
    and then use linear regression and as Response= varY(t) ?

    Thanks in advance

    • Avatar
      Jason Brownlee June 24, 2017 at 8:03 am #

      Not sure I follow your question Brad, perhaps you can restate it?

    • Avatar
      Brad June 25, 2017 at 4:47 pm #

      In case there are multiple measures and then make the transformation in order to forecast only varXn:

      var1(t-1) var2(t-1) var1(t) var2(t) … varN(t-1) varN(t)

      linear regression should use as the response variable the varN(t) ?

  7. Avatar
    Geoff June 24, 2017 at 8:10 am #

    Hi Jason,
    I’ve found your articles very useful during my capstone at a bootcamp I’m attending. I have two questions that I hope you could advise where to find better info about.
    First, I’ve run into an issue with running PCA on the newly supervised version only the data. Does PCA recognize that the lagged series are actually the same data? If one was to do PCA do they need to perform it before supervising the data?
    Secondly, what do you propose as the best learning algorithms and proper ways to perform train test splits on the data?
    Thanks again,

  8. Avatar
    Kushal July 1, 2017 at 1:31 pm #

    Hi Jason

    Great post.

    Just one question. If the some of the input variables are continuous and some are categorical with one binary, predicting two output variables.

    How does the shift work then?


    • Avatar
      Jason Brownlee July 2, 2017 at 6:26 am #

      The same, but consider encoding your categorical variables first (e.g. number encoding or one hot encoding).

      • Avatar
        Kushal July 15, 2017 at 5:22 pm #


        Should I then use the lagged versions of the predictors?


        • Avatar
          Jason Brownlee July 16, 2017 at 7:57 am #

          Perhaps, I do not follow your question, perhaps you can restate it with more information?

  9. Avatar
    Viorel Emilian Teodorescu July 8, 2017 at 9:45 am #

    great article, Jason!

  10. Avatar
    Chinesh August 10, 2017 at 5:15 pm #

    very helpful article !!

    I am working on developing an algorithm which will predict the future traffic for the restaurant. The features I am using are: Day,whether there was festival,temperature,climatic condition , current rating,whether there was holiday,service rating,number of reviews etc.Can I solve this problem using time series analysis along with these features,If yes how.
    Please guide me

  11. Avatar
    Hossein August 23, 2017 at 1:16 am #

    Great article Jason. Just a naive question: How does this method different from moving average smoothing? I’m a bit confused!

    • Avatar
      Jason Brownlee August 23, 2017 at 6:56 am #

      This post is just about the framing of the problem.

      Moving average is something to do to the data once it is framed.

  12. Avatar
    pkl520 August 26, 2017 at 10:29 pm #

    Hi , Jason! Good article as always~

    I have a question.

    “Running the example shows the differentiation of input (t-n) and output (t+n) variables with the current observation (t) considered an output.”

    values = [x for x in range(10)]
    data = series_to_supervised(values, 2, 2)

    var1(t-2) var1(t-1) var1(t) var1(t+1)
    2 0.0 1.0 2 3.0
    3 1.0 2.0 3 4.0
    4 2.0 3.0 4 5.0
    5 3.0 4.0 5 6.0
    6 4.0 5.0 6 7.0
    7 5.0 6.0 7 8.0
    8 6.0 7.0 8 9.0

    So above example, var1(t-2) var1(t-1) are input , var1(t) var1(t+1) are output, am I right?

    Then,below example.

    raw = DataFrame()
    raw[‘ob1’] = [x for x in range(10)]
    raw[‘ob2’] = [x for x in range(50, 60)]
    values = raw.values
    data = series_to_supervised(values, 1, 2)
    Running the example shows the large reframed DataFrame.

    var1(t-1) var2(t-1) var1(t) var2(t) var1(t+1) var2(t+1)
    1 0.0 50.0 1 51 2.0 52.0
    2 1.0 51.0 2 52 3.0 53.0
    3 2.0 52.0 3 53 4.0 54.0
    4 3.0 53.0 4 54 5.0 55.0
    5 4.0 54.0 5 55 6.0 56.0
    6 5.0 55.0 6 56 7.0 57.0
    7 6.0 56.0 7 57 8.0 58.0
    8 7.0 57.0 8 58 9.0 59.0

    var1(t-1) var2(t-1) are input, var1(t) var2(t) var1(t+1) var2(t+1) are output.

    can u answer my question? I will be very appreciate!

    • Avatar
      Jason Brownlee August 27, 2017 at 5:48 am #

      Yes, or you can interpret and use the columns any way you wish.

  13. Avatar
    Thabet August 30, 2017 at 7:36 am #

    Thank you Jason!!
    You are the best teacher ever

  14. Avatar
    Charles September 29, 2017 at 12:24 am #


    I love your articles! Keep it up! I have a generalization question. In this data set:

    var1(t-1) var2(t-1) var1(t) var2(t)
    1 0.0 50.0 1 51
    2 1.0 51.0 2 52
    3 2.0 52.0 3 53
    4 3.0 53.0 4 54
    5 4.0 54.0 5 55
    6 5.0 55.0 6 56
    7 6.0 56.0 7 57
    8 7.0 57.0 8 58
    9 8.0 58.0 9 59

    If I was trying to predict var2(t) from the other 3 data, would the input data X shape would be (9,1,3) and the target data Y would be (9,1)? To generalize, what if this was just one instance of multiple time series that I wanted to use. Say I have 1000 instances of time series. Would my data X have the shape (1000,9,3)? And the input target set Y would have shape (1000,9)?

    Is my reasoning off? Am I framing my problem the wrong way?


  15. Avatar
    Sean Maloney October 1, 2017 at 5:24 pm #

    Hi Jason!

    I’m really struggling to make a new prediction once the model has been build. Could you give an example? I’ve been trying to write a method that takes the past time data and returns the yhat for the next time.

    Thanks you.

  16. Avatar
    Sean Maloney October 1, 2017 at 5:28 pm #

    P.S. I’m the most stuck at how to scale the new input values.

    • Avatar
      Jason Brownlee October 2, 2017 at 9:38 am #

      Any data transforms performed on training data must be performed on new data for which you want to make a prediction.

      • Avatar
        Vikram August 1, 2019 at 7:20 pm #

        But what if we don’t have that target variable in dataset, like take an example of air pollution problem, now i want to predict the future values based on some expected of other variable just like we do in regression where we train our model on training dataset and then testing it and then making prediction for new data where we don’t now anything about target variable,. But in lstm with keras when we make prediction on new data that have one variable less than training dataset like air pollution we get a shape mismatch…

        I am struggling with this from last one week and haven’t foung a solution yet….

        • Avatar
          Jason Brownlee August 2, 2019 at 6:47 am #

          You can frame the problem anyway you wish.

          Think about it in terms of one sample, e.g. what are the inputs and what is the output.

          Once you have that straight, shape the training data to represent that and fit the model.

  17. Avatar
    Nish October 23, 2017 at 11:42 am #

    Hi Jason,
    This is great, but what if I have around ten features (say 4 categorical and 6 continuous), a couple of thousand data points per day, around 200 days worth of data in my training set? The shift function could work in theory but you’d be adding hundreds of thousands of columns, which would be computationally horrendous.
    In such situations, what is the recommended approach?

  18. Avatar
    Shud November 1, 2017 at 5:37 pm #

    Hey Jason,

    I converted my time series problem into regression problem and i used GradientBoostingRegressor to model the data. I see my adjusted R-squared keep changing everytime i run the model. I believe this is because of the correlation that exists between the independent variable (lag variables). How to handle this scenario? Though the range of fluctuation is small, i am concerned that this might be a bad model

  19. Avatar
    Nitin Gupta November 13, 2017 at 10:10 pm #

    Hey Jason,

    I applied the concept that you have explained to my data and used linear regression. Can I expand this concept to polynomial regression also, by squaring the t-1 terms?

  20. Avatar
    Samuel November 15, 2017 at 9:43 pm #

    Hey Jason,

    thanks a lot for your article! I already read a lot of your articles. These articles are great, they really helped me a lot.

    But I still have a rather general question, that I can’t seem to wrap my head around.

    The question is basically:
    In which case do I treat a supervised learning problem as a time series problem, or vice versa?

    For further insight, this is my problem I am currently struggling with:
    I have data out of a factory (hundreds of features), which I can use as my input.
    Additionally I have the energy demand of the factory as my output.
    So I already have a lot of input-output-pairs.
    The energy demand of the factory is also the quantity I want to predict.
    Each data point has its own timestamp.
    I can transform the timestamp into several features to take trends and seasonality into account.
    Subsequently I can use different regression models to predict the energy demand of the factory.
    This would then be a classical supervised regression problem.

    But as I unterstood it from your time series articles, I could as well treat the same problem as a time series problem.
    I could use the timestamp to extract time values which I can use in multivariate time series forecasting.

    In most examples you gave in your time series articles, you had the output over time.
    And in this article you shifted the time series to get an input, in order to treat the problem as a supervised learning problem.

    So let’s suppose you have the same number of features in both cases.
    Is it a promising solution to change the supervised learning problem to a time series problem?
    What would be the benefits and drawbacks of doing this?

    As most regression outputs are over time.
    Is there a general rule, when to use which framing(supervised or time series) of the problem?

    I hope, that I could phrase my confusion in an ordered fashion.

    Thanks a lot for your time and help, I really appreciate it!

    Cheers Samuel

    • Avatar
      Jason Brownlee November 16, 2017 at 10:29 am #

      To use supervised learning algorithms you must represent your time series as a supervised learning problem.

      Not sure I follow what you mean by tuning a supervised learning problem into a series?

      • Avatar
        Samuel November 29, 2017 at 10:19 pm #

        Dear Jason,

        thank you for your fast answer.
        I’m sorry that I couldn’t frame my question comprehensibly, I’m still new to ML.
        I’ll try to explain what I mean with an example.

        Let’s suppose you have the following data, I adapted it from your article:

        input1(time), input2, output
        1, 0.2, 88
        2, 0.5, 89
        3, 0.7, 87
        4, 0.4, 88
        5, 1.0, 90

        This data is, what you would consider a time series. But as you already have 2 inputs and 1 output you could already use the data for supervised machine learning.
        In order to predict future outputs of the data you would have to know input 1 and 2 at timestep 6. Let’s assume you know from your production plan in a factory that the input2 will have a value of 0.8 at timestep 6 (input1). With this data you could gain y_pred from your model. You would have treated the data purely as a supervised machine learning problem.

        input1(time), input2, output
        1, 0.2, 88
        2, 0.5, 89
        3, 0.7, 87
        4, 0.4, 88
        5, 1.0, 90
        6, 0.8, y_pred

        But you could do time-series forecasting with the same data as well, if I understood your articles correctly.

        input1(time), input2, output
        nan, nan, 88
        1, 0.2, 89
        2, 0.5, 87
        3, 0.7, 88
        4, 0.4, 90
        5, 1.0, y_pred

        This leads to my questions:

        In which case do I treat the data as a supervised learning problem and in which case as a time series problem?
        Is it a promising solution to change the supervised learning problem to a time series problem?
        What would be the benefits and drawbacks of doing this?
        As my regression outputs are over time.
        Is there a general rule, when to use which framing (supervised or time series) of the problem?

        I hope, that I stated my questions more clearly.

        Thanks a lot in advance for your help!

        Best regards Samuel

        • Avatar
          Jason Brownlee November 30, 2017 at 8:16 am #

          I follow your first case mostly, but time would not be an input, it would be removed and assumed. I do not follow your second case.

          I believe it would be:

          What is best for your specific data, I have no idea. Try a suite of different framings (including more or less lag obs) and see which models give the best skill on your problem. That is the only trade-off to consider.

  21. Avatar
    MJ November 18, 2017 at 12:46 am #

    Ver helpful, thanks!

  22. Avatar
    Michael November 30, 2017 at 6:47 am #

    Thank you for all the time and effort you have expended to share your knowledge of Deep Learning, Neural Networks, etc. Nice work.

    I have altered your series_to_supervised function in several ways which might be helpfut to other novices:
    (1) the returned column names are based on the original data
    (2) the current period data is always included so that leading and lagging period counts can be 0.
    (3) the selLag and selFut arguments can limit the subset of columns that are shifted.

    There is a simple set of test code at the bottom of this listing:

    • Avatar
      Jason Brownlee November 30, 2017 at 8:30 am #

      Very cool Michael, thanks for sharing!

    • Avatar
      MonkeeYe June 25, 2019 at 4:54 pm #

      Very Helpful, THX!

    • Avatar
      Tillmann June 28, 2022 at 5:53 pm #

      @ Michael Thank you for sharing your extended version of Jasons function. I encountered, however, a small limitation as the actual values are positioned at the first column of the result, i.e. the resulting order of the columns looks like:

      values values(t-N) […] values(t-1) values(t+1) […] values(t+M)

      In Jason version you can easily select the first N columns as input features (for example here: and the others as targets (including the actual values). Using your code, however, the following columns are selected as input

      values values(t-N) […] values(t-2)

      and as target

      values(t-1) values[t+1] […] values(t+M).

      Solution: Move lines 26-28 between the two for-loops, i.e. to line 41.

  23. Avatar
    Maciej December 1, 2017 at 7:11 am #

    When I do forecasting, let’s say only one step ahead, as the first input value I should use any value that belongs i.e. to validation data (in order to set up initial state of forecasting). In second, third and so on prediction step I should use previous output of forecasting as input of NN. Do I understand correctly ?

    • Avatar
      Jason Brownlee December 1, 2017 at 7:46 am #

      I think so.

      • Avatar
        Maciej December 2, 2017 at 4:15 am #

        Ok, so another question. In the blog post here:, as an input for NN you use test values. The predictions are only saved to a list and they are not used to predict further values of timeseries.

        My question is. Is it possible to predict a series of values knowing only the first value ?
        For example. I train a network to predict values of sine wave. Is it possible to predict next N values of sine wave starting from value zero and feeding NN with result of prediction to predict t + 1, t + 2 etc ?

        • Avatar
          Maciej December 2, 2017 at 4:18 am #

          If my above understanding is incorrect then it means that if your test values are completely different than those which were used to train network, we will get even worse predictions.

          • Avatar
            Jason Brownlee December 2, 2017 at 9:05 am #

            Yes. Bad predictions in a recursive model will give even worse subsequent predictions.

            Ideally, you want to get ground truth values as inputs.

          • Avatar
            Maciej December 3, 2017 at 5:34 am #

            Does it mean that using multi-step forecast (let’s say I will predict 4 values) I can predict a timeseries which contains 100 samples providing only initial step (for example providing only first two values of the timeseries) ?

          • Avatar
            Jason Brownlee December 4, 2017 at 7:40 am #

            Yes, but I would expect the skill to be poor – it’s a very hard problem to predict so many time steps from so little information.

  24. Avatar
    Liz January 12, 2018 at 7:02 am #

    Hello Mr. Brownlee,

    thank you for all of your nice tutorials. They really help!
    I have two questions about the input data for an LSTM for multi-step predictions.
    1. If I have multiple features that I use as input for the prediction and at a point (t) I have no new values for any of them. Do I have to predict all my input features in order to make make a multi-step forecast?
    2. If some of my input data is binary data and not continuous can I still predict it with the same LSTM? Or do I need a separate Classification?

    Sorry if its very basic, I am quite new to LSTM.
    Best regards Liz

    • Avatar
      Jason Brownlee January 12, 2018 at 11:49 am #

      No, you can use whatever inputs you choose.

      Sure you can have binary inputs.

      • Avatar
        Liz January 13, 2018 at 1:18 am #

        Thank you for your quick answer.
        Unfortunately I still have some trouble with the implementation.
        If I use feature_1 and feature_2 as input for my my LSTM but only predict feature_1 at time (t+1) how do I make the next step to know feature_1 at time (t+2).
        Somehow I seem to miss feature_2 at time (t+1) for this approach.
        Could you tell me where I am off?
        Best regards Liz

        • Avatar
          Jason Brownlee January 13, 2018 at 5:34 am #

          Perhaps double check your input data is complete?

  25. Avatar
    strawberry lv January 31, 2018 at 6:41 pm #

    Hello,thank you for the article and i have learned a lot from it.
    Now i have a question about it.
    The method can be understood as using the value before to forecast the next value. If i need to forecast the value at t+1,…t+ N, whether i need to use the model to first forecat the value at t + 1, and then using the value to forecast t+ 2, then, …. until t+N.
    or do you have any aother methed

    • Avatar
      Arslan Ahmed March 17, 2018 at 9:01 am #

      I am working on energy consumption data and I have the same question. Did you get to know any efficient method to forecast the value at t+1, t+2, t+3 + …… t+N?

  26. Avatar
    Sameer January 31, 2018 at 11:05 pm #

    Hello Dr.Brownlee,

    I’m planning to purchase your Introduction to Time series forecasting book. I just want to know that if you’ve covered Multivariate cum multistep LSTM

  27. Avatar
    Victor February 21, 2018 at 5:10 am #

    Hi Jason,

    Thanks for the article. I have a question about going back n periods in terms of choosing the features. If I have a feature and make for example 5 new features based off of some lag time, my new features are all very highly correlated (between 0.7 and 0.95). My model is resulting in training score of 1 and test score of 0.99. I’m concerned that there is an issue with multicollinearity between all the lag features that is causing my model to overfit. Is this a legitimate concern and how could I go about fixing it if so? Thanks!

    • Avatar
      Jason Brownlee February 21, 2018 at 6:42 am #

      Try removing correlated features, train a new model and compare model skill.

  28. Avatar
    Ram Seshadri February 21, 2018 at 12:06 pm #

    Dear Jason:

    My sincere thanks for all you do. Your blogs were very helpful when I started on the ML journey.

    I read this blog post for a Time Series problem I was working on. While I liked the “series_to_supervised” function, I typically use data frames to store and retrieve data while working in ML. Hence, I thought I would modify the code to send in a dataframe and get back a dataframe with just the new columns added. Please take a look at my revised code.


    Please take a look and let me know. Hope this helps others,

    • Avatar
      Jason Brownlee February 22, 2018 at 11:14 am #

      Very cool Ram, thanks for sharing!

    • Avatar
      Varun Gupta February 14, 2021 at 4:11 am #

      Thanks a ton Ram! You’re a saviour

  29. Avatar
    Marius Terblanche February 26, 2018 at 11:33 pm #

    Dear Jason,
    great article, as always!
    May I ask a question, please?
    Once the time series data (say for multi-step, univariate forecasting) have been prepared using code described above, is it then ready (and in the 3D structure) required for feeding into the first hidden layer of a LSTM RNN?
    May be dumb question!
    Many thanks in advance.

  30. Avatar
    MikeF March 7, 2018 at 12:49 pm #

    Hi Jason, thanks for this post. Its simple enough to understand. However, after converting my time series data I found some feature values are from the future and won’t be available when trying to make predictions. How do you suggest I work around?

  31. Avatar
    Adarsh March 27, 2018 at 3:11 pm #

    i have a dataset liike this

    accno dateofvisit
    12345 12/05/15 9:00:00
    123345 13/06/15 13:00:00
    12345 12/05/15 13:00:00

    how will i forecast when that customer will visit again

  32. Avatar
    Fatima April 10, 2018 at 6:32 pm #


    I need to develop input vector which uses every 30 minutes prior to time t for example:

    input vector is like (t-30,t-60,t-90,…,t-240) to predict t.

    If I wanna use your function for my task, Is it correct to change the shift function to df.shift(3*i) ?


    • Avatar
      Jason Brownlee April 11, 2018 at 6:33 am #

      One approach might be to selectively retrieve/remove columns after the transform.

      • Avatar
        fatima April 12, 2018 at 7:14 pm #


        So I should take these steps:

        1- transform for different lags
        2-select column related to first lag (for example 30min(
        3- transform for other lags
        4- concatenate along axis=1

        When I perform such steps seems the result is equivalent to when I shift by 3?
        I have some questions
        which one is better to use?(Shift by 3 or do above steps)
        should I remove time t after each transform and just keep time t for last lag?


        • Avatar
          Jason Brownlee April 13, 2018 at 6:37 am #

          Use an approach that you feel makes the most sense for your problem.

  33. Avatar
    vishwas April 16, 2018 at 3:15 pm #

    Hi Jason,

    Amazing article for creating supervised series. But I have a doubt,
    Suppose If I wanted the predict sales for next 14 days using Daily sales historical data. Would that require me too take 14 lags to predict the next 14 days??
    Ex: (t-14, t-13 …..t-1) to predict (t,t+1,t+2,t+14)

    • Avatar
      Jason Brownlee April 17, 2018 at 5:53 am #

      No, the in/out obs are separate. You could have 1 input and 14 outputs if you really wanted.

      • Avatar
        Vishwas April 17, 2018 at 3:31 pm #

        Thanks for the quick response Jason!!

      • Avatar
        Shaun November 19, 2019 at 8:39 am #

        Still confused.
        Please talk more in detail.
        For example, now we are at time t, want to predict t+5, as an example,
        Do we need data at t+1,t+2,t+3,t+4 first?

        Thanks, Jason

  34. Avatar
    Sanketh Nagarajan April 17, 2018 at 8:37 am #

    Hi Jason,

    I want to predict if the next value will be higher or lower than the previous value. Can I use the same method to frame it as a classification problem?
    For example:

    V(t) class

    0.2 0
    0.3 1
    0.1 0
    0.5 0
    2.0 1
    1.5 0

    where class zero represents a decrease and class 1 represents an increase?


  35. Avatar
    brandon May 7, 2018 at 11:17 pm #

    Hi Jason, really nice explanations in your blog. When I have the shape e.g. (180,20) of a shifted dataframe, how can I come back to my original data back with shape (200,1) back ?

    • Avatar
      Jason Brownlee May 8, 2018 at 6:14 am #

      You will have to write custom code to reverse the transform.

  36. Avatar
    Farooq Arshad May 8, 2018 at 8:23 pm #

    Hi Jason,

    Amazing article.
    I have a question, Suppose I want to move the window by 24 steps instead of just one step, what modifications do I have to do in this case?
    Like i have Energy data with one hour interval and I want to predict next 24 hours (1 day) looking at last 21 days (504 hours) then for the next prediction i want to move window by 24 hours (1 day).

    • Avatar
      Jason Brownlee May 9, 2018 at 6:23 am #

      Perhaps re-read the description of the function to understand what arguments to provide to the function.

  37. Avatar
    Alex May 19, 2018 at 5:43 am #

    Models blindly fit on data specified like this are guaranteed to overfit.

    Suppose you estimate model performance with a cross-validation procedure and you have folds:

    Fold1 (January)
    Fold2 (February)
    Fold3 (March)
    Fold4 (April)

    Consider a model fit on folds 1, 2 and 4. Now you predicting some feature for March based on the value of that feature in April!

    If you choose to use a lagged regressor matrix like this, please please please look into appropriate model validation.

    One good reason is Hyndman’s textbook, available freely online:

  38. Avatar
    marc May 23, 2018 at 10:28 am #

    Hi Jason, really nice blog and learned much from you. I implement one LSTM encoder decoder with sliding windows. The prediction was nearly the same as the input, is it usual that this happens on sliding windows ? I am a bit surprised because the model saw only in the training a little part of data and the model later predicted almost the same input. That makes me thinking I might be wrong. I do not want to post the coding it ist just standard lstm encoder decoder code, but the fact that the model saw only a little part of the data in training is confusing me.

  39. Avatar
    james May 23, 2018 at 1:15 pm #

    HI json,it’s so good your code,but i have a question that i change the window size(reframed = series_to_supervised(scaled, 1, 1) to reframed = series_to_supervised(scaled, 2, 1)),then i get bad prediction,how can i solve or what cause it
    Please take a look at my revised code.

    from math import sqrt
    from numpy import concatenate
    from matplotlib import pyplot
    from pandas import read_csv
    from pandas import DataFrame
    from pandas import concat
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.preprocessing import LabelEncoder
    from sklearn.metrics import mean_squared_error
    from keras.models import Sequential
    from keras.layers import Dense
    from keras.layers import LSTM
    # convert series to supervised learning
    def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, … t-1)
    for i in range(n_in, 0, -1):
    names += [(‘var%d(t-%d)’ % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, … t+n)
    for i in range(0, n_out):
    if i == 0:
    names += [(‘var%d(t)’ % (j+1)) for j in range(n_vars)]
    names += [(‘var%d(t+%d)’ % (j+1, i)) for j in range(n_vars)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
    return agg
    # load dataset
    dataset = read_csv(‘pollution.csv’, header=0, index_col=0)
    values = dataset.values

    # integer encode direction

    encoder = LabelEncoder()
    values[:,4] = encoder.fit_transform(values[:,4])

    # ensure all data is float
    values = values.astype(‘float32′)

    # normalize features
    scaler = MinMaxScaler(feature_range=(0, 1))
    scaled = scaler.fit_transform(values)

    # frame as supervised learning
    reframed = series_to_supervised(scaled, 2, 1)
    # drop columns we don’t want to predict
    reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)

    # split into train and test sets
    values = reframed.values
    n_train_hours = 365*24
    train = values[:n_train_hours, :]
    test = values[n_train_hours:, :]

    # split into input and outputs
    train_X, train_y = train[:, :-1], train[:, -1]
    test_X, test_y = test[:, :-1], test[:, -1]

    # reshape input to be 3D [samples, timesteps, features]
    train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
    print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

    # design network
    model = Sequential()
    model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
    model.compile(loss=’mae’, optimizer=’adam’)
    # fit network
    history =, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)
    # plot history
    pyplot.plot(history.history[‘loss’], label=’train’)
    pyplot.plot(history.history[‘val_loss’], label=’test’)
    # make a prediction
    yhat = model.predict(test_X)
    test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
    # invert scaling for forecast
    inv_yhat = concatenate((yhat, test_X[:, 1:8]), axis=1)
    inv_yhat = scaler.inverse_transform(inv_yhat)
    inv_yhat = inv_yhat[:,0]
    # invert scaling for actual
    inv_y = scaler.inverse_transform(test_X[:,:8])
    inv_y = inv_y[:,0]
    # calculate RMSE
    rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
    print(‘Test RMSE: %.3f’ % rmse)
    # plot prediction and actual
    pyplot.plot(inv_yhat[:100], label=’prediction’)
    pyplot.plot(inv_y[:100], label=’actual’)

    • Avatar
      Jason Brownlee May 23, 2018 at 2:40 pm #

      The model may require further tuning for the change in problem.

      • Avatar
        james May 23, 2018 at 4:51 pm #

        I noticed that your code takes into account the effect of the last point in time on the current point in time.But this is not applicable in many cases. What are the optimization ideas?

        • Avatar
          Jason Brownlee May 24, 2018 at 8:08 am #

          Most approaches assume that the observation at t is a function of prior time steps (t-1, t-2, …). Why do you think this is not the case?

          • Avatar
            james May 24, 2018 at 11:45 am #

            oh,maybe i don’t describe my question clearly,my question is why just consider t-1,when consider(t-1,t-2,t-3),the example you gave has poor performance

          • Avatar
            Jason Brownlee May 24, 2018 at 1:51 pm #

            No good reason, just demonstration. You may change the model to include any set of features you wish.

  40. Avatar
    Ishrat Sarwar May 24, 2018 at 2:21 pm #

    Dear Sir:
    I have 70 input time series. I Only need to predict 1, 2 or 3 time series out of input(70 features) time series. Here are my questions.

    -> Should I use LSTM for such problem?
    -> Should I predict all 70 of time series?
    -> If not LSTM then what approach should I use?

    (Forex trading prediction problem)

    • Avatar
      Jason Brownlee May 25, 2018 at 9:17 am #

      Great questions!

      – Try a suite of methods to see what works.
      – Try different amounts of history and different numbers of forward time steps, find a sweet spot suitable for your project goals.
      – Try classical ts methods, ml methods and dl methods.

  41. Avatar
    marc June 1, 2018 at 6:50 pm #

    HI Jason, I have a huge data with small steps between data time series, they nearly change not in total till the last cycles. I thought maybe not only shifting by 1, how can I shiift more e.g. t-1 and t by 20 steps. Does also this make sense ?

    • Avatar
      Jason Brownlee June 2, 2018 at 6:27 am #

      Not sure I follow, sorry. Perhaps give a small example?

      • Avatar
        marc June 2, 2018 at 9:55 pm #

        lets say I have this data:

        and usually if you make sliding windows, shifting them by 1 from t-2 to t
        5 6 7
        6 7 8
        7 8 9
        8 9 10
        9 10 11
        10 11 12
        11 12

        how can I do shifting not by 1 but maybe 3 looking at the first row (in this case) or more from t-2 to t:
        5 8 11
        6 9 12
        7 10
        8 11
        9 12

        I ask that because my data range is so small that shifting by 1 is not having much effect and thought maybe something like this could help. How do I have to adjust your codes for supervised learning to do that. And do you think this is a good idea ?

  42. Avatar
    Bootstrap June 17, 2018 at 10:44 am #

    Hi Jason!

    Once I apply this function to my data, what’s the best way to split the data between train and test set?

    Normally I would use sklearn train_test_split, which can shuffle the data and apply a split based on a user set proportion. However, intuitively, something tells me this incorrect, rather I would need to split the data based on the degree of shift(). Could you please clarify?

  43. Avatar
    brad June 27, 2018 at 5:15 am #

    When I give the function a sliding window of 20 series_to_supervised(values, 20) my new data shape is (None,21) none is variable here. Why do I get 21 ? Do i need to remove the last column ? or how do I move on ? thanks a lot for your posts.

    • Avatar
      Jason Brownlee June 27, 2018 at 8:22 am #

      I would guess 20 for the input 1 for the output.

      Confirm this by printing the head() of the returned data frame.

  44. Avatar
    lara June 28, 2018 at 7:54 am #

    why we must convert it into a supervised learning for lstm problem ?

    • Avatar
      Jason Brownlee June 28, 2018 at 2:04 pm #

      Because the LSTM is a supervised learning algorithm.

  45. Avatar
    vinsondo July 11, 2018 at 9:55 am #

    Hi Jason, I love the articles. Thank you very much.

    I have seen you have the multiple time series inputs to predict time series output.
    I have a different input feature setup and try to figure it out how to implement them and use RNN to predict the time series output.

    Let’s say I have 7 input features, feature1 to feature7 in which feature1 is a time series.
    feature2 to feature5 is a scalar value and feature6 and feature7 are the scalar vectors.

    Another way to describe the problem, for a given single value from feature2 to feature5, (ex, 2,500, 7Mhz, 10000, respectively), and a given range of values in Feature6 and Feature7, (ex, feature6 is array [2,6,40,12,….,100] and feature7 is array [200,250,700,900,800,….,12]. Then, I need to predict the times series output from the time series input feature1.

    How do I design all these 7 feature inputs to the RNN?
    If you have a book that cover this, please let me know. Thank you.

    • Avatar
      Jason Brownlee July 11, 2018 at 2:55 pm #

      If you have a series and a pattern as input (is that correct?), you can have a model with an RNN for the series and another input for the pattern, e.g. a multi-headed model.

      Or you can provide the pattern as an input with each step of the series along with the series data.

      Try both approaches, and perhaps other approaches, and see what works best for your problem.

  46. Avatar
    James Adams July 26, 2018 at 11:04 pm #

    Thank you for this helpful article, Jason.

    In case it’s helpful to others, I’ve modified the function to be used for converting time series data over an entire DataFrame, for use with multivariate data when a DataFrame contains multiple columns of time series data, [available here](

  47. Avatar
    James August 1, 2018 at 7:05 am #

    Hi Jason,

    This article was really helpful as a starting point in my adventure into LSTM forecasting. Along with a couple of your other articles I was able to create a multivariate multiple time step LSTM model. Just a thought on your article itself: you used really complicated data structure (I think I ended up with array of arrays and individual values very quickly) when something simpler would do and be more easily adaptable. Over all, though, this was very good tutorial and was helpful to understand the basics of my own project.

    • Avatar
      Jason Brownlee August 1, 2018 at 7:51 am #

      Thanks James.

      Do you have a suggestion of something simpler?

  48. Avatar
    Martin Šomodi August 15, 2018 at 8:27 pm #

    Love and appreciate the article – helped me a lot with my master’s work in the beginning. I still have lot of work and studying to do, but this tutorial along with “Multivariate Time Series Forecasting with LSTMs in Keras” helped me to understand basics of working with keras and data preparation. Keep up the good work 🙂

  49. Avatar
    Juan Carlos Vargas Sosa August 16, 2018 at 5:52 am #

    Hi Jason,

    Thanks for the effort you put in all the blogs that you have shared with all of us.
    I want to share a small contribution of simpler series_to_supervised function. I think it only works in Python 3.

  50. Avatar
    Xu August 17, 2018 at 1:57 pm #

    Hi Jason,

    Thanks for your posts. My question is: for the classification problem, is OK using the same way to reframe the data?

  51. Avatar
    Carlos B August 21, 2018 at 2:07 am #

    Hi Jason,

    Your site is always so helpful! I’m slightly confused here though. If I have a time series dataset that already consists of some input variables (VarIn_1 to VarIn_3) and the corresponding output values (Out_1 and Out_2), do I still need to run the dataset through the series_to_supervised() function before fitting to my LSTM model?

    Example dataset:
    Time Win, VarIn_1, VarIn_2, VarIn_3, Out_1, Out_2
    1, 5, 3, 7, 2, 3
    2, 6, 2, 4, 3, 1
    3, 4, 4, 6, 1, 4
    …, …, …, …, …, …,

    Best wishes,

  52. Avatar
    Julien August 29, 2018 at 8:24 am #

    Dear Jason,
    Thank you so much for your great efforts.

    I am trying to predict day ahead using the h2o package in r. below i.e glm model.

    glm_model <- glm(RealPtot ~ ., data= c(input3, target), family=gaussian)

    Then I calculate the MAPE for each day using :

    mape_calc <- function(sub_df) {
    pred <- predict.glm(glm_model, sub_df)
    actual <- sub_df$Real_data
    mape <- 100 * mean(abs((actual – pred)/actual))
    new_df <- data.frame(date = sub_df$date[[1]], mape = mape)

    df_list <- by(test_data, test_data$date, mape_calc)

    final_df <-, df_list)

    I am trying to implement the same above code using h2o, but I am facing difficulties in data conversion in the h2o environment. Any thoughts will be appreciated. Thanks in advance.

    • Avatar
      Jason Brownlee August 29, 2018 at 9:18 am #

      Sorry, I don’t have any experience with h2o, perhaps contact their support?

  53. Avatar
    BenniEvolent September 10, 2018 at 5:46 pm #

    Jason your articles are great. I do not mind code repetition, it does take care of issues newbies might face. The Responses section is also a big help. Thanks!

  54. Avatar
    Aladji Diallo September 13, 2018 at 12:13 am #

    I wonder how you get rid of the dates. I trying to use your method to make my prediction for times series But. I have the date as index.

    • Avatar
      Jason Brownlee September 13, 2018 at 8:05 am #

      Remove the column that contains the dates. You can do this in code or in the data file directly (e.g. via excel).

  55. Avatar
    Andy September 14, 2018 at 12:22 am #

    Hello Jason,
    nice post, I have a question regarding the train/test split in this case:
    E.g. I now take the first 80 % of the rows as training data and the rest as test data.
    Would it be considered data leakage since the last two samples in the training data contain the first values of the test set as targets (values for t, t+1)?

    • Avatar
      Jason Brownlee September 14, 2018 at 6:37 am #


      • Avatar
        Andy September 18, 2018 at 4:46 am #

        Hi Jason,

        thanks for your response, but why is that?
        Maybe I wasn’t clear, but I found what I wanted to say in a post on medium:

        See their second visualization, they call it “look ahead gap” which excludes the ground truth data of the last prediction step in the training set from the test set.

        What do you think about that? Is that common practice?

        • Avatar
          Jason Brownlee September 18, 2018 at 6:23 am #

          I have seen many many many papers use CV to report results for time series and they are almost always invalid.

          You can use CV, but be very careful. If results really matter, use walk-forward validation. You cannot mess it up.

          It like coding, you can use “goto”, but don’t.

          • Avatar
            Andy September 18, 2018 at 9:25 am #

            They also argue against classical CV, they actually do use walk-forward validation (I think their usage of the term “walk forward cross validation” is a little misleading).
            So yes, I am definitely using walk forward validation!

            Let me illustrate my question with a simplified example:

            If we have this time series:
            [1, 3, 4, 5, 6, 1]

            I would split the data into training set
            [1, 3, 4, 5]

            … and test set
            [ 6, 1]

            I would do this before converting it into a supervised problem.
            So if I do the conversion to a supervised problem now, I will end up with this for my training set:

            t | t+1
            1 | 3
            3 | 4
            4 | 5
            5 | NaN

            For the 4th row, I do not have a value for t+1, since it is not part of the training set. If I took the value 6 from my test set here, I would include information about the test set.
            So here I would only train up to the 3rd row, since that is the last complete entry.

            For the test I would then use this trained model to predict t+1 following the value 6.
            This leads to a gap, since I will not receive a prediction for the fourth row in this iteration (the “look ahead gap”?).

            If I were to convert the series into a supervised problem before the split, this issue (is it one?) doesn’t become as clear, but I would remove the last row of the training set in this case, since it contains the first value of my test set as a target.

            So, can I convert first and then split or do I need to split first, then convert like in the example?
            The underlying question is, if “seeing” or not “seeing” the value of following time step as a target, has an influence on the performance of the prediction in following time step?

          • Avatar
            Jason Brownlee September 18, 2018 at 2:23 pm #

            Sounds like you’re getting caught up.

            Focus on this: you want to test the model the way you intend to use it.

            If data is available to the model final prior the need for a prediction, the model should/must make use of the data in order to make the best possible prediction. This is the premise for walk-forward validation.

            Concerns of train/test data only make sense at the point of a single prediction and its evaluation. It is not leakage to “see” data that was part of the test set for the prior forecast, unless you do not expect to use the model in that way. In which case, change the configuration of walk-forward validation from one-step to two-step or whatever.

            Does that help at all?

          • Avatar
            Andy September 18, 2018 at 5:54 pm #

            I was caught up and it helps to think about what will be available when making predictions.

            My problem was that I am doing a direct 3-step ahead forecast, so there are three “dead“ rows before each further prediction step, since I need 3 future values for a complete training sample (they are not really dead since I use the entries as t+1, t+2, and t+3 at t).

            Thank you for your patience!

          • Avatar
            Jason Brownlee September 19, 2018 at 6:16 am #

            Yes, they are not dead/empty.

            They will have real prior obs at the time a prediction is being made. So train and eval your model under that assumption.

  56. Avatar
    SRIKANTH October 31, 2018 at 9:49 pm #

    I sincerely Thanks a lot for this information by yours. Great job!!!!! and also I wish more articles from yours in future.

    I am understand concepts from these two articles
    Convert-time-series-supervised-learning-problem-python and Time-series-forecasting-supervised-learning.

    Now I want to predicate and set the boolean either TRUE or False value based on the either Latidtude and Longitude or Geohash value, for this how can I used Multivariate Forecasting.
    I am completely new to this area please suggest me the directions I will follow it and do it.

    Thanks inadvacne. I am doing this in Python3 in my Mac mini

  57. Avatar
    FP November 4, 2018 at 3:02 pm #

    Hi Jason,

    How can I apply the lag only to variable 1 in multivariate time series? In other words, I have 5 variables, but would only to lag variable 1?

    • Avatar
      Jason Brownlee November 5, 2018 at 6:09 am #

      One approach is to use the lag obs from one variable as features to a ML model.

      Another approach is to have a multi-headed model, one for the time series and one for the static variables.

  58. Avatar
    Babak December 18, 2018 at 5:27 am #


    I guess all the following lines by the code samples above:

    n_vars = 1 if type(data) is list else data.shape[1]

    should be rewritten as:

    n_vars = 1 if type(data) is list else data.shape[0]

    • Avatar
      Babak December 18, 2018 at 6:00 am #

      OK I see, actually it’s correct the way it is, so data.shape[0] but if you pass a numpy array, then the rank should be 2 not 1.

      So this doesn’t work (the program will crash):

      values = array([x for x in range(10)])

      But this one does:

      values = array([x for x in range(10)]).reshape([10, 1])

      Sorry for confusion.

    • Avatar
      Jason Brownlee December 18, 2018 at 6:05 am #

      No, shape[1] refers to columns in a 2d array.

  59. Avatar
    mk December 21, 2018 at 3:58 pm #

    One-Step Univariate Forecasting problem: t-1)as input variables to forecast the current time step (t).
    if we don’t know t-1,we can not forecast the current time step (t).
    e.g1. 1,2,3,4,5,6 there is no 7,how to forecast the 9?

    random mising value
    e.g2.1,3,4,6 there are no 2 and 5,how to forecast the 7?


  60. Avatar
    Dazhi December 25, 2018 at 12:40 am #

    I have one question about multivariate multi-steps forecasting. For example,another Air pollution forecasting(not your tutorial showed), total 9 features. I want the out put is just the air pollution. Using 3 time-steps ahead to predict next 3 time-steps, So:
    train_X and the test_X is :’var1(t-3)’, ‘var2(t-3)’, ‘var3(t-3)’, ‘var4(t-3)’, ‘var5(t-3)’,
    ‘var6(t-3)’, ‘var7(t-3)’, ‘var8(t-3)’, ‘var9(t-3)’, ‘var1(t-2)’,
    ‘var2(t-2)’, ‘var3(t-2)’, ‘var4(t-2)’, ‘var5(t-2)’, ‘var6(t-2)’,
    ‘var7(t-2)’, ‘var8(t-2)’, ‘var9(t-2)’, ‘var1(t-1)’, ‘var2(t-1)’,
    ‘var3(t-1)’, ‘var4(t-1)’, ‘var5(t-1)’, ‘var6(t-1)’, ‘var7(t-1)’,
    ‘var8(t-1)’, ‘var9(t-1)’,
    train_y and test_y is : ‘var1(t)’, ‘var1(t+1)’, ‘var1(t+2)’ (I dropped the columns that I not want).
    I used the minmax() to be normalized,If the out-put is one step, I am easily to inverse the value. However,the question is that I have three out-put values. So, can you give me some advice?
    The key point is that i used the minmax(),,,,I don’t know how to inverse it when it with 3 out-puts. Could you please give me some advice? Thank you very much!

    • Avatar
      Jason Brownlee December 25, 2018 at 7:24 am #

      Perhaps use the function to get the closest match, then modify the list of columns to match your requirements.

  61. Avatar
    Prajwal Shrestha December 28, 2018 at 12:02 am #

    Hi! i’m a novice at best at this, and am trying to create a forecasting model. I have no idea what to do with the “date” variable in my dataset. should i just remove it and add a row index variable instead for the purpose of modeling?

    • Avatar
      Jason Brownlee December 28, 2018 at 5:58 am #

      Discard the date and model the data, assuming a consistent interval between observations.

  62. Avatar
    Prajwal Shrestha December 28, 2018 at 12:24 am #

    One more question, how do I export the new dataframe with t+1 and t-1 variables to a csv file?

  63. Avatar
    Rajesh January 5, 2019 at 7:50 am #

    Hello Jason,

    what you’re doing for machine learning should earn you a Nobel peace prize. I constantly refer to multiple entries on your website and slowly expand my understanding, getting more knowledgeable and confident day-by-day. I’m learning a ton, but there is still a lot to learn. My goal is to get good at this within the next 5 months, and then unleash machine learning on a million projects I have in mind. You’re enabling this in clear and concise ways. Thank you.

    • Avatar
      Jason Brownlee January 6, 2019 at 10:13 am #

      Thanks, I’m very happy that the tutorials are helpful!

  64. Avatar
    Jayashree January 11, 2019 at 4:13 pm #

    Hi Jason,
    Thanks for the nice tutorial 🙂 I am working on the prototype of the student evaluation system where I have scores of the students for term 1 and term 2 the past 5 years along with 3 dependent features. I need to predict the score of the student from the second year onward till one year in the future. I need your guidance on how to create the model that takes whatever data available in past to predict the current score.


  65. Avatar
    Leen January 22, 2019 at 4:46 am #

    Hi Jason,

    I have my data in a time series format (t-1, t, t+1), where the days (the time component in my data) are chronological (one following the other). However, in my project I’m required to subset this one data frame into 12 sub data frames (according to certain filtering criteria – I am filtering by some specific column values), and after I do the filtering and come up with these 12 data frames, I am required to do forecasting on each one separately.

    My question is: the time component in each of these 12 data frames is not chronological anymore (days are not following each other. Example: the first row’s date is 10-10-2015, the second row’s date is 20-10-2015 or so), is that okay? and will it create problems in forecasting later on ? If it will, what shall I do in this case?

    I’ll highly appreciate your help. Thanks in advance.

    • Avatar
      Jason Brownlee January 22, 2019 at 6:28 am #

      I’m not sure I follow, sorry.

      As long as the model is fit and makes predictions with input-output pairs that are contiguous, it should be okay.

  66. Avatar
    daniele January 25, 2019 at 12:51 pm #

    Hi Jason, what better way to split the data set into training and testing?

    • Avatar
      Jason Brownlee January 26, 2019 at 6:07 am #

      It depends on your data, specifically how much you have and what quality. Perhaps test different sized splits and evaluate models to see what looks stable.

  67. Avatar
    Sk January 28, 2019 at 10:32 am #

    Hi Jason,

    Say I have a classification problem. I have 100 users and I have their sensing data for 60 days. The sensing data is aggregated over each day. For each user, I have say 2 features. What I am trying is to perform binary classification — I ask them to choose a number at the start, either 0 or 1 and I am trying to classify each user to one of those class, based on their 60 days of sensing data.

    So I got the idea that I could convert it to a supervised problem like you suggested in following way:

    day 60 feat 1, day 60 feat 2, day 59 feat 1, day 59 feat 2.. day 1 feat 1, day 1 feat 2, LABEL

    Is this how I should be converting my dataset? But that would mean that I’ll only have 100 unique rows to train on, right?

    So far, I was thinking I could structure the problem like this, but I wonder if I’m violating the independent assumption of supervised learning. For each user, I have their record for each day as a row and the same label they selected at the start as a label column.

    Example: for User 1:

    date, feat 1, feat 2, label
    day 1, 1, 2, 1
    day 2, 2, 1, 1
    day 60, 1, 2, 1

    This way I’d have 100×60 records to train on.

    My question is: Is the first way I framed the data correct and the second way incorrect? If that is the case, then I’d have only 100 records to train on (one for each user) and that’d mean that I cannot use deep learning models for that. In such a case, what traditional ml approach can you recommend that I can try looking into? Any help is appreciated.

    Thank you so much!

  68. Avatar
    Daniel January 31, 2019 at 4:49 am #

    Hi Jason, thanks a lot for the article!

    I have two questinos:

    1) In “n_vars = 1 if type(data) is list else data.shape[1]”, n_var should not be the length of the data colecitions on the list, like “n_vars = len(transitions[0]-1) if type(transitions) is list else transitions.shape[1]”

    2) In for i in range(0, n_out):
    cols.append(df.shift(-i)) ==> should not be “df.shift(-i+1))?


    • Avatar
      Jason Brownlee January 31, 2019 at 5:37 am #

      Why do you suggest these changes, what issues do they fix exactly?

  69. Avatar
    Mike February 11, 2019 at 9:12 am #

    Hi Jason
    Your posts are awesome. They have saved me a ton of time on understanding time series forecasting. Thanks a lot.

    I have tested all types of time series forecasting using using your codes (multi-step, multi-variate, etc.) and it works fine.but I have problem getting the actual predictions back from the model.

    For instance on the pollution data, and trying the stuff in the wonderful post at:
    I am looking for predicting three time steps ahead (t, t+1 and t+2) of not one but two features (types of observations or simply the ones that are labeled var1(t), var2(t), var1(t+1), var2(t+1), var1(t+2), var2(t+2)). I arrange everything (the dimensions and stuff to fit the structure of what O am looking for, for instance I use :
    reframed = series_to_supervised(scaled, 3, 3)
    and I drop the columns which are related to features other than those two that I want to predict (I do this for all the three time steps ahead).

    But after the model is fit (which also looks really fine), when I try the command:
    yhat = model.predict(test_X)
    I figure out that the yhat number of columns is always 1, which is weird since it is expected to be 6 (the predicted values for var1(t), var2(t), var1(t+1), var2(t+1), var1(t+2), var2(t+2)).
    Am I missing something here?

    • Avatar
      Jason Brownlee February 11, 2019 at 2:09 pm #

      The shape of the input for one sample when calling predict() must match the expected shape of the input when training the model.

      This means, the same number of timesteps and features.

      Perhaps this will help:

      • Avatar
        Mike February 11, 2019 at 9:11 pm #

        Thanks for replying Jason.
        It matches that expected shape, it is in fact the same test_X used for validation when fitting the model.
        The reframed data looks like this:
        var1(t-3) var2(t-3) var3(t-3) var4(t-3) var5(t-3) var6(t-3) \
        3 0.129779 0.352941 0.245902 0.527273 0.666667 0.002290
        4 0.148893 0.367647 0.245902 0.527273 0.666667 0.003811
        5 0.159960 0.426471 0.229508 0.545454 0.666667 0.005332
        6 0.182093 0.485294 0.229508 0.563637 0.666667 0.008391
        7 0.138833 0.485294 0.229508 0.563637 0.666667 0.009912

        var7(t-3) var8(t-3) var1(t-2) var2(t-2) … var5(t-1) \
        3 0.000000 0.0 0.148893 0.367647 … 0.666667
        4 0.000000 0.0 0.159960 0.426471 … 0.666667
        5 0.000000 0.0 0.182093 0.485294 … 0.666667
        6 0.037037 0.0 0.138833 0.485294 … 0.666667
        7 0.074074 0.0 0.109658 0.485294 … 0.666667

        var6(t-1) var7(t-1) var8(t-1) var1(t) var2(t) var1(t+1) var2(t+1) \
        3 0.005332 0.000000 0.0 0.182093 0.485294 0.138833 0.485294
        4 0.008391 0.037037 0.0 0.138833 0.485294 0.109658 0.485294
        5 0.009912 0.074074 0.0 0.109658 0.485294 0.105634 0.485294
        6 0.011433 0.111111 0.0 0.105634 0.485294 0.124748 0.485294
        7 0.014492 0.148148 0.0 0.124748 0.485294 0.120724 0.470588

        var1(t+2) var2(t+2)
        3 0.109658 0.485294
        4 0.105634 0.485294
        5 0.124748 0.485294
        6 0.120724 0.470588
        7 0.132797 0.485294

        [5 rows x 30 columns]

        So, the shape of the train and test data prior to fitting the model are like this:
        print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
        (8760, 1, 24) (8760,) (35035, 1, 24) (35035,)

        And after fitting the model I am calling test_X with shape (35035, 1, 24) to be predicted but still it gives me yhat with shape (35035, 1).

        What’s wrong?

        • Avatar
          Mike February 11, 2019 at 9:34 pm #

          I just realized the second dimension (8760, 1, 24) and (35035, 1, 24) should be set to 3. But doing this and fitting the model again does not change the dimension of yhat.

          • Avatar
            Jason Brownlee February 12, 2019 at 8:01 am #

            No, exactly. Input shape and output shape are unrelated.

        • Avatar
          Mike February 11, 2019 at 10:51 pm #

          Is it because of the dense layer? Since maybe dense(1) at the end of the sequential model returns dimension 1?
          Howver, if I chznge this 1 in the dense layer to for example 3, I get error on not matching dimensions. So confused right now, and a lot of searching did not work

          • Avatar
            Jason Brownlee February 12, 2019 at 8:03 am #


            If you change the model to predict a vector with 3 values per sample, you must change the training data to have 3 values per sample in y.

        • Avatar
          Jason Brownlee February 12, 2019 at 7:59 am #

          That suggests one output value for each input sample, exactly how you have designed the model.

          Perhaps I don’t understand your intent?

  70. Avatar
    Mike February 12, 2019 at 2:25 am #

    I figured it out Jason. It was because of the dense layer I had to set it to dense(6) also with some modifications in the shape of train and test data.
    Thanks again

  71. Avatar
    Areej February 14, 2019 at 7:05 am #


    How can I introduce a sequence of images into the forecasting problem?


  72. Avatar
    Alex Torex February 23, 2019 at 4:38 am #

    Hi , my problem is how to classify time series.

    I have a series of user events which happen at various distances in time and I want to classify the type of user by the events he is producing.

    Ho do I pass the data to LSTM?

  73. Avatar
    Henry Lawson March 19, 2019 at 6:06 am #

    Hi Jason,

    What approach would you recommend for a modelling problem where I have many time series (in this case each for a different patient), but the measurements are not taken at regular intervals. In fact, there are many concurrent time series, all with different, varying, sample times. The data is sometimes very sparse (no measurements for days) and sometimes very dense (many measurements in one hour), so I don’t want to lose information by interpolating.

    I want to train a model on a subset of the patients and use it to predict for other patients. How would you recommend formatting the data?

    • Avatar
      Jason Brownlee March 19, 2019 at 9:04 am #

      I would recommend testing a range of different framings of the problem.

      For example, try normalizing the intervals, try zero padding, try persistence padding, try ignoring the intervals, try limiting history to a fixed window, try only modeling a fixed window, etc. See what is learnable/useful.

  74. Avatar
    Gauranga Das March 24, 2019 at 4:53 pm #

    I am trying to get a line plot of the final results but, I get a bar graph instead.


    # plot history
    plt.plot(inv_yhat, label=’forecast’)

    • Avatar
      Jason Brownlee March 25, 2019 at 6:42 am #

      Perhaps confirm that the data is an array of floats?

  75. Avatar
    josh malina March 30, 2019 at 5:22 am #

    For your function “series_to_supervised”, I like the dropna feature, but I could imagine that the user would not want to drop rows in the middle of her dataset that just happened to be NaNs. Instead, they might just like to chop the first few and the last few.

    • Avatar
      Jason Brownlee March 30, 2019 at 6:33 am #

      Yes, it is better to take control over the data preparation process and specalize it for your dataset.

  76. Avatar
    josh malina April 2, 2019 at 5:44 am #

    What’s an easy way to convert this to input required by a Keras LSTM? I would assume we would use multi-indices

  77. Avatar
    Juan_A April 5, 2019 at 1:34 am #

    Hi Jason,

    Thanks for your incredible tutorials.

    I have a doubt about your function “series_to_supervised”…in my case, I have a time series of speed data with an extra column of “datetime” information in which the traffic measures were taken.

    I want to keep the “datetime” also as an input feature within your function, but without adding lagged variables for it. Any idea about how to proceed?


    • Avatar
      Jason Brownlee April 5, 2019 at 6:18 am #

      You may have to write some custom code, which will require a little design/trial and error/ and unit tests.

  78. Avatar
    Emin April 8, 2019 at 2:58 am #

    Hello Jason,

    Thank you for the post. I have a question regarding classification task. Let’s say we take this series_to_supervised approach. But in that case, our goal is to predict our original values at time ‘t’, correct? What if the target is binary, let’s say? Thank you.

    • Avatar
      Emin April 8, 2019 at 3:04 am #

      I will also like to add, that if this approach is taken, then our original target function that contains 0/1 classes, will have more samples that the transformed data frame (due to dropNAN command)

    • Avatar
      Jason Brownlee April 8, 2019 at 5:57 am #

      It sounds like you are describing time series classification.

      I have an example here that may help as a starting point:

      • Avatar
        Emin April 8, 2019 at 8:02 am #

        Well, while I agree with you just this is a classification problem (see my first post), if there is a need to predict a class (0/1) in advance, this becomes a prediction problem, correct?

        I went through the linked URL before, and if I remember correctly, you have couple of time-series classification examples but none of the “let’s try to predict Class 0 1 day in advance”.

        • Avatar
          Jason Brownlee April 8, 2019 at 1:56 pm #

          Yes, whether the classification is for the current or future time step, is just a matter of framing – e.g. little difference.

          What problem are you having exactly?

          • Avatar
            Emin April 8, 2019 at 11:46 pm #

            My problem is the following . We have a set of targets associated with every time step. For example:

            X y
            0.2 0
            0.5 1
            0.7 1
            0.8 0

            We perform shift once and we get:

            X (t-1) X(t) y
            NAN (or 0) 0.2 0
            0.2 0.5 1
            0.5 0.7 1
            0.7 0.8 0


            Now, my problem is the following: If I use X(t-1) as my input, my target sample will be larger than X(t-1). So in this case, how can I relate/connect my lag timesteps (X(t-1), X(t-2) and so on) to my classes?

          • Avatar
            Jason Brownlee April 9, 2019 at 6:26 am #

            Each row connects the inputs and outputs.

          • Avatar
            Emin April 9, 2019 at 7:29 am #

            So, I guess the correct thing would be to apply series_to_supervised to my target class as well, and use y (t) as the target variable, while y (t-1), y (t-2),…,y(t-k) will be used as my inputs, along with X(t-1), X(t-2), …,X(t-k).

            Does my approach sound like a correct one? Thank you.

          • Avatar
            Jason Brownlee April 9, 2019 at 2:37 pm #

            Perhaps try it and see if it is viable.

        • Avatar
          aimendezl August 28, 2019 at 9:51 pm #

          Hi Emin, Im working in a similar problem. I got a daily series and for each day I have a label (0,1) and I would like to use LSTM to predict the label for the next day. Did you manage to solve the issue of how to transform the input variables? Could you shed some light into the matter if you did?
          Thanks in advance!

  79. Avatar
    Emin April 8, 2019 at 11:54 pm #

    Btw, it will be larger, because in several cases (at least in mine), adding 0 is not a correct thing to do, as 0 represents something else, related to the domain. So, if we have NAN and we drop them, out input will be the size of (3,) and our target will be the size of (4,).

  80. Avatar
    Emin April 13, 2019 at 3:12 am #

    Jason, I have one more question. In case of framing the problem as lagging, our time series have to be in descending order, correct? So, 11:00 AM (row 1), 0:00 AM (row 2). In that case, when we shift by 1 down, we essentially try to predict the original value at time t.

    To demo with an example:
    12:00 PM 0.5
    11:00 AM 1.2
    10:00 AM 0.3

    Once shift is applied with lag_step=1 =>

    12:00 PM NaN
    11:00 AM 0.5
    10:00 AM 1.2
    9:00 AM 0.3

    By doing so, we essentially shift all the values to the past and try to predict the minimize the error between real observed values at the original time (t) and modeled values at the same original time (t).

    Unfortunately, all the examples I have found so far, model the time in ascending order:

    It will be great if you can clarify.

    Thank you.

    • Avatar
      Jason Brownlee April 13, 2019 at 6:40 am #

      Yes, data must be temporally ordered in a time series.

      Typically ascending order, oldest to newest in time.

      • Avatar
        Emin April 13, 2019 at 9:23 am #

        Well, in that case, if we take (t-1) as our input, don’t we face a problem with data leakage? We are pushing the value one step down, which essentially means that we push our data in the future time.

        • Avatar
          Jason Brownlee April 13, 2019 at 1:48 pm #

          No. It comes down to how you choose to define the problem – what you are testing and how.

  81. Avatar
    Ahmed May 4, 2019 at 1:45 am #


    Do we still have to worry about removing trend and seasonality if we use this approach?

    • Avatar
      Jason Brownlee May 4, 2019 at 7:11 am #

      It depends on the model used.

      Typically, removing trend and seasonality prior to the transform makes the problem simpler to model.

  82. Avatar
    Alla May 27, 2019 at 5:21 am #

    I tried this simple code to do the example in your book. i just have started reading it today.
    x = []
    d = np.arange(12)
    for i in range(len(d)):
    if i+3 <=11:

    x = [array([0, 1, 2]), array([1, 2, 3]), array([2, 3, 4]), array([3, 4, 5]), array([4, 5, 6]), array([5, 6, 7]), array([6, 7, 8]), array([7, 8, 9]), array([ 8, 9, 10])]

    y= [3, 4, 5, 6, 7, 8, 9, 10, 11]

  83. Avatar
    Alla Abdella May 27, 2019 at 6:38 am #

    Thank you Jason. I really enjoyed this tutorial.

  84. Avatar
    Carlos June 6, 2019 at 12:20 am #

    Hi Jason,
    Thanks for the interesting article.
    I’m having trouble with univariate forecasts that have multiple observations. I don’t see that case here and don’t see how to translate it to TimeSeriesGenerator.
    Say, I have series of 100 observations and I want to train a RNN to predict the next observation after 50. This is easily done with TimeSeriesGenerator.
    But now, what if I have 1000 of these series of 100 observations? Concatenating them is no good, as observation 101 has no relation with observation 99. Can I still use TimeSeriesGenerator in this scenario?

  85. Avatar
    Ala June 22, 2019 at 7:25 am #

    Hi Jason. Your tutorials are amazing.
    1-Do you have any book about time series prediction using LSTM? (I know you have a book about time series prediction for classical methods.

    2- Do you know how to predict multiple values for multivariate time series (Do you have any tutorial or can you tell me what setting I should change in keras lstm ? here is the detailed description:

    assume my multivariate series after converting it to the supervised learning problem is (The first 4 numbers is input and the last 4 is output):

    1 2 3 4 5 6 7 8

    5 6 7 8 9 10 11 12

    9 10 11 12 13 14 15 16

    I would like to learn all 4 of the output values. Do you know if it is possible with your methods and what I need to change so it multivariate with more than one output and my target or output is 4 dimensional. I would be appreciated if you can help.

  86. Avatar
    Mat June 22, 2019 at 1:48 pm #

    Hi Jason. One question that I could not find any tutorial about that in your website.

    Assume I have multiple time series they are very similar generated for the same event. lets say the dataset you use a lot like Shampoo dataset. Assume instead of one dataset I have 100 separate dataset (consists of time series from 1993-2000) How can I use all these data for the training ? I know how to train lstm for one but then how do you keep training for all of them. I cannot concatenate time series or sort them according to values since they might be multivariate.

    I will be grateful if you can help me solve this problem

  87. Avatar
    George July 29, 2019 at 4:38 pm #

    How to use AR model coefficients for generate feature input for machine learning algorithm such as SVM ?

    Now i can find AR coef already,
    By this code

    model = AR(train)
    model_fit =
    print(‘Lag: %s’ % model_fit.k_ar)
    print(‘Coefficients: %s’ % model_fit.params)

    but in don’t known how to use coeff for extract to feature input for SVM
    Thank you ..

    • Avatar
      Jason Brownlee July 30, 2019 at 6:02 am #

      Sorry, i’m not familiar with the approach that you’re describing.

  88. Avatar
    George July 30, 2019 at 1:54 am #

    How to use AR model coefficients for generate feature input to machine learning such as SVM ?

    Now i can find AR coef of this code

    model = AR(train)
    model_fit =
    print(‘Lag: %s’ % model_fit.k_ar)
    print(‘Coefficients: %s’ % model_fit.params)

    How to create feature input (model_fit.params) for SVM with AR model coefficients ? .. Thank you

    • Avatar
      Jason Brownlee July 30, 2019 at 6:18 am #

      Sorry, i don’t have a tutorial on this topic, I cannot give you good off the cuff advice.

  89. Avatar
    Soumya Sourav August 3, 2019 at 5:13 am #

    Hello Jason. Excellent article to get started with. I just have two questions. Let’s say I have three variables time current and voltage and voltage is my target variable to be predicted. So if I transform my dataset according to the mentioned techniques and train the model and then I get a new data (validation) with just time and current as input and I need to predict the output voltage over that period of time given in the validation set, how would I transform so that my model is able to predict the voltage?
    Also how would you suggest to move ahead if I have panel data in my dataset

    • Avatar
      Jason Brownlee August 3, 2019 at 8:16 am #

      Design the model based on one sample, e.g. what data you want to provide as input and what you need as output.

      Then prepare data to match, fit and model and evaluate.

  90. Avatar
    aimendezl August 28, 2019 at 10:11 pm #

    Hi Jason, thank you for all your tutorials. These have been a tremendous help so far. I am now working on a classification problem with time series and I would like to know if you could help me with some advise.
    I have a daily time series, and for each day there’s a label (0,1), and i would like to reframe the problem as a binary classification – predict the label of the next day – but I am having troubles with the format of the input variables and the target variable.

    Do you think the function series_to_supervised() can be applied here? And if so, how could I do it?

    I am thinking the following as a very naive experiment to check if this works. Label the days as (0) if the value of the variable dropped or (1) if it goes up (This can of course be done by a simple if statement but it’s for this example’s sake only)

    My data then would look something like this:


    date var1 label
    1 10 nan
    2 20 1
    3 9 0
    4 8 0

    if I apply series_to_supervised(data,1,1) the data set would look something like


    date var1(t-1) var1(t) label(t)
    1 10 20 1
    2 20 9 0
    3 9 8 0

    then to define my input/output pairs:

    X = supervised.drop(‘label(t)’)
    y = supervised[‘label(t)’]

    is this approach correct? It seems to me the label should be very easy to learn for any ANN, but I’m not sure this format of the input/output is correct.

    I would appreciate any advise on this topic. And thanks again for the amazing books and tutorials.

    • Avatar
      Jason Brownlee August 29, 2019 at 6:10 am #

      It may be, only you will know for sure – it’s your data.

  91. Avatar
    sara September 8, 2019 at 4:44 am #

    Hi Jason ,
    thanks for the tutorial
    I have a financial time series ,I turned the series into a supervised learning problem. I want to predict the t + 1 value using the previous 60 days. my question is about the prediction method that I have to use after xgboost and SVR. is what I am trying to do is ok

    • Avatar
      Jason Brownlee September 8, 2019 at 5:21 am #

      I recommend testing a suite of methods in order to discover what works best for your specific dataset.

  92. Avatar
    Jaydeep September 12, 2019 at 6:24 pm #

    Hi Jason,

    As always great article, thanks a ton. I just want to ask how does the approach of converting Time Series problem to Supervised learning problem compare against treating it as a Time Series problem ? I think with advanced techniques such as LSTM we should no longer need conversion of Time Series problem to Supervised learning problem.

  93. Avatar
    buttonpol September 30, 2019 at 2:01 am #

    Great article, it helped me a lot.

    I’ve added a little modification (I had a similar but very simpler function).

    I wanted to use the data at time t-n1, t-n2, …,t-nx, to forecast values at (lets say) time t+n7 and I wanted to avoid a new step for deleting the intermediate t+n1, t+n2, …,t+n6. So I added a window parameter to do that.

    It is not so great improvement, but it helped me.

    For future versions, it would be nice to handle actual column names in the input data, for automated post processing (i.e. if original column name is “date”, output would be “date t-1”).
    Later I would give it a try.

  94. Avatar
    Eli September 30, 2019 at 3:24 pm #

    Hey Jason,

    I’ve been going through all your LSTM examples and I’ve been using your ‘series_to_supervised function for preprocessing’. For the life of me, though, I can’t wrap my head around what I’m supposed to do when I need to reshape data for n_in values and n_out values greater than 1.

    For instance I used the function with 50 timesteps (n_in = 50) to predict another 50 values in the future (n_out = 50). I have over 300,000 individual samples with 19 observations each, so the output of the function, ‘agg’, is understandably quite large.

    Onto reshaping. My intuition tells me to input the tuple (train_X.shape[0],50,train_X.shape[1]) for reshaping my training X data. This throws an error. What am I doing wrong here? Is your ‘series_to_supervised function’ the right away to approach this in the first place? I believe it is, but I’m at a loss for how to workaround this.

    I’m particularly interested in how to frame this for both predicting 50 single observations, or even 50 sequences of all observations, assuming my terminology is correct. And for reference I’ve looked through just about every LSTM post you have, but perhaps I overlooked something. Regardless, your work has been incredibly helpful thus far–thank you for all the hard work you’ve put into your site!

  95. Avatar
    Karan Sehgal October 5, 2019 at 2:19 am #

    Hi Jason,

    Till how many lags should we create the new variables ? we only need to create the lag using the target variable ?

    • Avatar
      Jason Brownlee October 6, 2019 at 8:12 am #

      Perhaps test a few different values and see what works best for your model and dataset.

  96. Avatar
    Karan Sehgal October 7, 2019 at 5:24 am #

    Hi Jason,

    Thanks you so much Jason for your inputs on the above query. I have few more queries please.

    1) We should make the data stationary before using supervised machine learning for time series analysis?

    2) Introducing Lag (t-1) or (t-2) in the dataset makes the dataset stationary or not ?

    3) Machine learning models cannot simply ‘understand’ temporal data so we much explicitly create time-based features. Here we create the temporal features day of week, calendar date, month and year from the date field using the substr function. These new features can be helpful in identifying cyclical sales patterns. Is this true ?

    4) There are a wide variety of statistical features we could create here. Here we will create three new features using the unit_sales column: lag_1 (1-day lag), avg_3 (3-day rolling mean) and avg_7 (7-day rolling mean). Can we create these kind of features also ?

    • Avatar
      Jason Brownlee October 7, 2019 at 8:32 am #

      Yes, try making the series stationary prior to the transform, and compare results to ensure it adds skill to the model.

      Adding lag variables does not make it stationary, you must use differencing, seasonal adjustment, and perhaps a power transform.

      New features can be helpful, try it and see for your dataset.

  97. Avatar
    Karan Sehgal October 7, 2019 at 5:39 am #

    Hi Jason,

    I forgot to post some more queries on the above post.

    1) Lags always needs to be created for the Y variable i.e (dependent variable) only ?

    2) Apart from introducing lag in dataset, do we also need to add the column of differencing for the Y variable to make the data stationary ?

    Karan Sehgal

    • Avatar
      Jason Brownlee October 7, 2019 at 8:33 am #

      No, but lag obs from the target are often very useful.

      Adding variables does not make a series stationary.

      • Avatar
        Karan Sehgal October 7, 2019 at 9:59 pm #

        Hello Jason,

        Thanks for your inputs.

        1. ) So, I need to create both of the variables i.e for lag and differencing because differencing will help me creating the dataset stationary and lag can be of very useful for model and then we need to consider both of these variables together in the algorithm?

        2) Above you have also mentioned seasonal adjustment, and perhaps a power transform. How can we get Seasonal Adjustment power of transform. Do we need to create a separate columns for seasonal adjustment and transformed data or we can create both in one column only.


  98. Avatar
    Radhouane Baba October 23, 2019 at 10:29 pm #

    Hi Jason,

    i am trying to forecast one day in the future (1 day = 144 values output) based on data from last week. (144*7= 1008 timesteps)
    in each timestep i have 10 variables such as temperatuire and etc…

    that means i have a very big input vector (144*7*10 = 10,080)

    isnt it too much data at once???? ( 10,080 input values ———-> 144 output)

  99. Avatar
    Andrei February 20, 2020 at 11:30 pm #

    This is different than what you do on the LSTM tutorial:

    Shouldn’t the output from here be usable in the LSTM model?

    I’m a bit confused, can you explain the difference?

  100. Avatar
    Manju February 25, 2020 at 10:17 pm #

    Can we use this approach for weather forecasting?

    • Avatar
      Jason Brownlee February 26, 2020 at 8:19 am #

      Yes, but physics models perform better than ml models for weather forecasting.

  101. Avatar
    manjunath February 26, 2020 at 5:55 am #

    Sliding window and rolling window are they same ?
    Sliding Window and Expanding window are they same?
    Pls share some information it will help more


  102. Avatar
    David March 24, 2020 at 1:56 am #

    Hi, if we use this method of converting a time series to a supervised learning problem, we still need to use the timestamp as a feature? Or we could simply delete that column?
    Since we will only establish a window, we do not need the timestamp right? If I am wrong, there is a correct way to prepare this feature?

    Another question that I want to make is: There is a problem if the data not have the same interval of time (have samples in completely random intervals, however, sequential)?

    Thanks, I really your tutorials.

  103. Avatar
    uthman April 3, 2020 at 6:20 pm #

    One Hell of a blog you have Jason.
    Much appreciated

  104. Avatar
    Matt May 11, 2020 at 3:36 am #

    Hi Jason,

    Great and informative post. I’m in the midst of trying to tackle my first time series forecasting problem and had a couple questions for you. So once the series_to_supervised() function has been called and the data has been appropriately transformed, can we use this new dataframe with traditional ML algorithms (i.e. RandomForestRegressor)? Or do we then set up an ARIMA model or some other time series specific forecasting model? And then how do we make the transition to forecasting multiple variables or steps ahead in time? Sorry if these are bad questions, I’m a newbie here.


  105. Avatar
    Srinivas Gummadi May 11, 2020 at 12:43 pm #

    Great blog Jason. Learnt a lot and code works as advertised by you!!!

    Four questions: and one suggestion.

    1) if I have daily forecast data, and my target is to forecast for upcoming 6 weeks, i thought this is the way to do:
    a) resample the original data on Week basis and sum the sales – so my sample size becomes 1/7
    b) convert to supervised data as suggested here
    c) then model it with my data – train/test and save the model
    Question is: how do i use this model to forecast for future 6 weeks as the whole model is dependent on previous events. Will the learning be good

    2) If i have potentially other influencing items like sales promotion, rebates, marketing event etc., how does this model comprehend? This ARIMA model is taking only a sequence of diffs. Can you pl. provide guidance how do i incorporate additional features into the model

    3) If there are extraordinary items like outliers – way off the pattern -> can i prune the data and remove them?

    4) Do we ever scale the data?

    I wish u used dataframes more often than Series as more and more users use that data structure and code can be reused verbatim – just a suggestion


  106. Avatar
    Giselle May 18, 2020 at 3:29 am #

    Hi Jason,

    As far as I understand, if I would like to predict the next 24 hours based on 5 variables’ previous observations (which is a multivariate, multi_step case) I would use ”” series_to_supervised(dataframe, 1, 24) ”” and then drop the unwanted columns, am I mistaken ?

    Otherwise, if I would like to use previous month’s values, should I put “”” n_in=720 “”” instead of “”” n_out=1 “”” or use something else ?

    Thank you

    • Avatar
      Jason Brownlee May 18, 2020 at 6:21 am #

      Looks like you are predicting 24 steps based on one step. That is aggressive. You might want to provide more context to the model when making a prediction – e.g. give more history/inputs.

      • Avatar
        Giselle May 18, 2020 at 9:58 am #

        Exactly, it won’t be accurate. For that, I would like to include more previous values but I don’t know how using the same function?

        • Avatar
          Jason Brownlee May 18, 2020 at 1:25 pm #

          Change the”n_in” argument to a larger number to have more lag observations as input.

  107. Avatar
    md faiz June 4, 2020 at 1:35 am #

    First of all i thank you for writing such a detailed articles.

    my question is:

    i have 3 independent variable and 1 dependent variable(call volume hourly data.)
    i have created 24 lags for dependent variable and trained the model.

    But now i have to forecast in future for 1 week ( 24 rows for a day * 7(days)= 168).
    My problem is ,i am not getting how to create lags for the forecast period as i have train the model till ,for example say ,today. Now i have to forecast till 10th JUNE,2020.

    Now how to create lags for dependent variable as i have no future data of call volume and that is what i have to forecast.

    In the forecast data ,i have created future data for 3 independent variables because they all were derived from date such as DayofWeek etc …but how to create lags ????

    i have stuck here since 4 days…please help

  108. Avatar
    Morgan June 12, 2020 at 3:03 am #

    Hi Jason, I adore your articles! Dense and fluid.

    Is sequence_to_supervised() synonymous to “windowizing” data in preparation for the LSTM?

  109. Avatar
    Eric June 27, 2020 at 8:30 am #

    What will happen when we make predictions? Does the model expect a sequence of data as well since that is what it was trained on?

    • Avatar
      Jason Brownlee June 27, 2020 at 2:08 pm #

      New examples are provided in the same form, e.g. past observations to predict future observations.

  110. Avatar
    Rajiv July 7, 2020 at 5:59 pm #

    Hi Mr.Jason,

    My question is regarding the “Multi-Step or Sequence Forecasting” section.
    Suppose we have to forecast next “m” time steps, with some particular lag, say “n” in my sequence, I will have my dataset like:

    v1(t-n)….v1(t-3),v1(t-2),v1(t-1) V1(t), V1(t+1), v1(t+2), v1(t+3), v1(t+4), v1(t+5)…v1(t+m) .

    Now. Let me know which case is relevant to my problem:

    Will I have ‘m’ separate models for each time period. i.e.:
    v1(t+1) = f(v1(t), v1(t-1), v1(t-2), ….. v1(t-n))
    v1(t+2) = f(v1(t), v1(t-1), v1(t-2), ….. v1(t-n))
    v1(t+3) = f(v1(t), v1(t-1), v1(t-2), ….. v1(t-n))
    v1(t+m) = f(v1(t), v1(t-1), v1(t-2), ….. v1(t-n))

    Should I feed my previous predicted value to predict my next value as a sequence.
    v1(t+1) = f(v1(t), v1(t-1), v1(t-2), ….. v1(t-n))
    v1(t+2) = f(v1(t+1)hat,v1(t), v1(t-1), v1(t-2), ….. v1(t-n-1))
    v1(t+3) = f(v1(t+2)hat,v1(t+1)hat,v1(t), v1(t-1), v1(t-2), ….. v1(t-n-2))
    v1(t+m) = f(v1(t+m-1)hat,v1(t+m-2)hat,v1(t+m-3)hat,….v1(t+m-n)hat)

    I am confused in choosing between the approaches:

    * In case-1 the number of models will be a big number and I feel the model maintenance part might be problematic if “m” is a big number…!!!
    * In case-2, I will have one single model, But as I go down the time line, my predictions will more depend on the previous predicted values which will make my inputs more fragile…

    Thanks In advance and Thank you for the wonderful post.

    • Avatar
      Jason Brownlee July 8, 2020 at 6:27 am #

      You can do either, the choice is yours, or whichever results in the best performance/best meets project requirements.

  111. Avatar
    Anon August 1, 2020 at 1:11 pm #

    Thanks for the article Jason, pleasure to read. I have a question: how is this different from making a window the “normal” way, over rows? Is there any benefit to doing it this way, or can I just as easily have M timesteps for my X, and 1-N timesteps for my Y, both having 1 timestep per row?

    • Avatar
      Jason Brownlee August 1, 2020 at 1:29 pm #

      You’re welcome.

      Sorry, I don’t follow what you’re comparing this approach to. Perhaps you can elaborate.

      This is a sliding window approach generally used when converting a time series to a supervised learning problem.

      • Avatar
        Anon August 3, 2020 at 1:12 am #

        As I understand it, the sliding window approach in this article has the window progress across columns:

        Where X(t+1) is the target output, for as many features that are there in the original dataset.

        On the other hand, what if the window progresses across the rows, like so:

        So that if you have a window of size 3, you’d have N-3 sliding windows composed of (X(t), X(t+1), X(t+2)) to predict X(t+3), all the way up to (X(t+N-3), X(t+N-2), X(t+N-1)) to predict X(t+N)?

        Is there any difference in the two strategies? The reason I ask this is because I was training an LSTM using the first method (sliding windows across columns) and kept encountering out-of-memory errors when using pandas’ shift() for a large window, but it was a relatively trivial matter when sliding across rows without using shift() as no preprocessing was necessary.


  112. Avatar
    kourosh August 10, 2020 at 5:46 pm #

    Hi, Mr Brownlee

    i have 276 files (from 1993-2015) with dimensions of (213*276). each of the files belong to one month. i want to predict last year(last 12 month).

    how i can split data and what is time steps?

    • Avatar
      kourosh August 10, 2020 at 5:57 pm #

      i mean should i reshape it to column and concatenate all years like a long column? because the data are like pixel values (like heat map) and this is confusing to me.

      • Avatar
        Jason Brownlee August 11, 2020 at 6:30 am #

        Perhaps. Try it and see if it makes sense for your dataset.

    • Avatar
      Jason Brownlee August 11, 2020 at 6:28 am #

      Perhaps load all data into memory at once or all into one file, then perhaps fit a model on all but the last year, then use walk-forward validation on the data in the last year month by month.

  113. Avatar
    Mohammad August 21, 2020 at 3:16 am #

    Hey Jason,
    Thanks for the wonderful materials on your website.

    the function “series_to_supervised” is great and so straightforward to use. However, I see when we use shift and transform data, data type from integer changes to float. Isn’t it better to adjust the dtype of the columns in the function as well?

    Thanks again.

    • Avatar
      Jason Brownlee August 21, 2020 at 6:35 am #

      You’re welcome!

      We should be working with floats – almost all ML algorithms expect floats, except maybe trees.

  114. Avatar
    Darrell K August 28, 2020 at 12:34 pm #

    Hi Jason, thank you for the awesome article. I’ve been looking for something like this for quite some time.

    In my case, I’m trying to build an AR model with exogenous inputs, so I need to train the net with all the training data (X and y) and then make forecasts based on X only. My data is highly nonlinear and I want to make forecasts for many steps ahead.

    Reading other sites, I understood that I would need to refer to the last N values in the X and y_hat vectors (not the whole, lagged series), slightly different from what you taught here. I would appreciate very much if you could offer any hint on how to achieve this.

    Thank you in advance.

  115. Avatar
    James September 7, 2020 at 8:35 am #

    I’m wondering why we need to do all this in the first place. Why can’t we just treat the Date column as any old feature, X1, and then predict y?

    For ex: What’s wrong with just plugging in features X1, X2, X3 (‘Date’, ‘Temp’, ‘Region’), and training a random forest to predict y (‘Rainfall’)? If recent data is more important, shouldn’t the model be able to figure this out?

    • Avatar
      Jason Brownlee September 7, 2020 at 8:38 am #

      Great question!

      You can if you like. Try it and compare.

      We do this because recent lag observations are typically highly predictive of future observations. This representation is called an autoregressive model/representation.

  116. Avatar
    Senthilkumar Radhakrishnan September 7, 2020 at 2:14 pm #

    Hi Jason,

    I have a training data with 143 instances and test with 30 instances with additional features like temperature and others with my target in training .
    So if i create lag values it should be above 30th lag right!? because we will not be having lag 2 for all those 30 instances as we have to forecast all those 30
    In this case what is the best solution and how can i add lag components to get result

    • Avatar
      Jason Brownlee September 8, 2020 at 6:44 am #

      Generally, it is a good idea to either use an ACF/PACF plot to choose an appropriate lag or grid search different values and discover what results in the best performance.

  117. Avatar
    Jeff Hernandez September 16, 2020 at 8:09 am #

    Great tutorial! This open source tool is helpful for labeling time series data.

  118. Avatar
    Carlos September 24, 2020 at 10:18 am #

    Hi Jason,

    Any idea to use this function series_to_supervised with PySpark Dataframe or can I handle it?

    Thanks a lot!


  119. Avatar
    Joao Silva September 26, 2020 at 1:12 am #

    Hi Jason,

    I have a question related you dividing the time series into x (input) and y (output).

    I’ve noticed that most people shift the series one step independent of the amount of steps they want to forecast. (1 Option)

    1 Option
    – – – – – – –
    1 2 3 4 5 –> 6 7 8
    2 3 4 5 6 –> 7 8 9
    3 4 5 6 7 –> 8 9 10

    Wouldn’t that make the model forecast the same values? (7 8 and 9 )

    And then make the 2nd Option more feasible since the model would have to predict new values every time?

    2 Option
    – – – – – – –
    1 2 3 4 5 –>. 6 7 8
    9 10 11 12 13 –> 14 15 16
    17 18 19 20 21 –> 22 23 24

    Thank you for helping the ML community!

    • Avatar
      Jason Brownlee September 26, 2020 at 6:21 am #

      You are training a function to map exampels of inputs to exampels of outputs. We have to create these exampels from the sequence.

      You can design these mapping examples anyway you like. Perhaps one step prediction is not appropriate for your specific dataset. That’s fine, change it.

  120. Avatar
    May October 7, 2020 at 8:07 pm #

    Hi Jason, thanks for this tutorial. I noticed in the comments that many are using LSTM for time series prediction. Do you know if it is also possible to use other models such as logistic regression for this problem? if say we would like to predict what will be the energy compliance of a home appliance (low or high) for the next hour based on the energy consumption in the last 2 to 3 hours as an example? Thank you.

    • Avatar
      Jason Brownlee October 8, 2020 at 8:30 am #

      Yes, you can use any algorithm you like once the data is prepared.

  121. Avatar
    Yannick October 8, 2020 at 1:08 am #

    Hello Jason,
    I have some short series of 5 data values in range 0 to 2 under the form let’s say [1,2,1,0,0] that represent the five last results of a given soccer team, 1 being a win 2 a draw and 0 a loss where I want to predict the probability for the next value to be a 1, once I predict the next one the first value (here 1) is droped to form a new serie.
    I have a good intuition on what should be the next one but would like to create a model.
    Please keep in mind i’m a beginner. So i just guess i should use time series.
    I would like to weight each value since I know each one depends on many parameters like the ranking of the opponent, the number of shots the team made in the last game and so on ..
    I try to wrap my mind around but it’s hard. Thanks for what you do !

  122. Avatar
    Andreas October 13, 2020 at 1:00 am #

    “Again, depending on the specifics of the problem, the division of columns into X and Y components can be chosen arbitrarily, such as if the current observation of var1 was also provided as input and only var2 was to be predicted.”

    In the case where we want to predict var2(t) and var1(t) is also available.

    var1(t-2),var2(t-2),var1(t-1) ,var2(t-1),var1(t),var2(t)

    Lstm networks want a 3D input. What shape should we give to the train_X?

    Do i have to give shape [X,1,5] ?

    In case we had an even number for train_X (when we dont have var1(t)), we had to shape like this,


    But now its not an even number and i cannot shape like this because we have 5 features for train_X

    The only solution is to give shape [X,1,5]?

    *X length of dataset

  123. Avatar
    Rob October 16, 2020 at 7:09 am #

    Wow, your webpage is such a great help, thanks!
    I’m having difficulties understanding something that I assume is basic.
    It centers around the question of the rolling window. In other tutorials you show how to implement LSTM models for time series forecasting, often with examples in which a next timestep is forecasted by a previous one. In the example above you talk about rolling windows and the possibility to forecast based on nnumber of old values. As the optimal window would be very data point dependant, I’m still wondering whether an LSTM could ever function well with a window size of 1? How would the network learn about different movement patterns? Or am I missing something?
    Maybe you could help me understand or point me to a resource of yours.
    Thanks a lot!

    • Avatar
      Jason Brownlee October 16, 2020 at 8:10 am #


      It could, but would put pressure on the model to capture required knowledge in the internal state variables and that you not reset state across samples.

  124. Avatar
    Derni November 13, 2020 at 7:01 pm #

    Hi, thank you for this tutorial, I want to make a time series prediction related to power consumption and probably want to adopt this concept into my project and also my thesis. do you have any reference for this method? so I can include your method from your paper(maybe if you have).

    thanks in advance!

  125. Avatar
    ATW December 8, 2020 at 1:40 am #

    About the input_shape.

    When I want to use 20 input steps, and 50 output steps, I specify this in the series_to_supervised() method, and reshape my data/columns according. But the model doesn’t accept any other input_shape of steps of 1. I was wondering how this is possible, if you want to predict 50 seqs into the future, wouldn’t that mean that the model needs to know this?

    Kind Regards.

    • Avatar
      Jason Brownlee December 8, 2020 at 7:45 am #

      Yes, there are many ways to make a multi-step forecast, perhaps start here:

      • Avatar
        ATW December 8, 2020 at 11:41 pm #

        Thanks I have read that, but how does that pertain to lstm? I’ve mad a time-series lstm model following and the mse metrics all show a value under 1. I don’t have the y for the future steps so I don’t think I can predict the y for the 4 strategies abovementioned as that asks to run the predicted data on the model over and over.

        To be clear, my y is an int value, and x consists of 4 columns of data extracted from the date including the y value. My timesteps are inconsistent so that’s why I chose to subtract values from the date to min/hour/weekday to connect to the y value. I could leave out the y out of x but I think that would mean it’s not a lstm model anymore? In that case the predicted values would just show an average over the week.

        I appreciate any input.

        • Avatar
          ATW December 9, 2020 at 12:08 am #

          To be even more clear, I am trying to predict a forecast of how many people would be present at a time in the future. I have tried binary classification and lstm so far but I just can’t seem to output any reall “future” values.

        • Avatar
          Jason Brownlee December 9, 2020 at 6:23 am #

          You can use an LSTM as a direct model, a recursive model, and more. That is how it is related.

          You will have y (target values) for the future in your training dataset, or you will be able to construct them. If this is not the case – then you cannot train any model as data is required to train a model.

  126. Avatar
    engimp March 11, 2021 at 9:18 pm #

    wonderful extensive elaboration

  127. Avatar
    mhr April 4, 2021 at 2:05 pm #

    Hi, you are using -1 lag for the data and then splitting the train and test set ,right ? so I can say that you have some behavior of test data in train data for supervised learning .You can’t predict to the unknown future right ? .. Like if i have time series data until today we cannot predict the future .Am I correct ?

    • Avatar
      Jason Brownlee April 5, 2021 at 6:09 am #

      If we use past observations to predict future observations, then it is supervised learning.

  128. Avatar
    Abdulafeez April 28, 2021 at 12:20 pm #

    Hi, am currently working on anomaly detection on data stream but please i need your assistance on how to used sliding window base to detect outlier multivariate time series data. how can I realize the correlation coefficient in multi variate data. Thanks

  129. Avatar
    sajad April 30, 2021 at 4:01 am #

    Hi , thanks for good explanations.
    in my work , input of lstm is sequence of images from video, and output is one image.
    in fact I want segmen objects in image for detect them.
    i dont know input must be what shape,for example for ten image with size 240 * 320 ,
    should define one matrix with size : 10*76800 as input of lstm and write in one row from matrix features every image ?
    With thanks beforehand.
    Best regards

  130. Avatar
    Gunjan Gautam July 15, 2021 at 10:08 am #

    Hi Jason

    Many thanks for this wonderful explanation. However I have a question:

    How to deal with categorical variables where the prediction is to be made for each category ?

    For example, there are three columns “week_date”, “products”, and “sales”, and the prediction is to be made for weekly/_sales for each product. Do you think one hot encoding will be helpful in this scenario? Or there is another way to transform the data for supervised learning?

    • Avatar
      Jason Brownlee July 16, 2021 at 5:20 am #

      You’re welcome!

      I suspect you would specify the date rather than predict it. You can also develop one model per product, so that it does not need to be specified either. Or the product would be provided as input.

  131. Avatar
    Will Ciog September 1, 2021 at 4:27 pm #

    Hi Jason,

    Thanks a lot for this tutorial, this is very helpful.

    I am currently working on fish farm data. I aim at testing if I can use a model to predict the concentration of ammonia in the water in tanks of a farm using as inputs information such as:

    Each line has the following columns:

    – Site Name (there are 3 main sites with data in farm) for 5 consecutive days
    – date
    – water temperature
    – Dissolved O2
    – Water pH
    – other, easy to measure water parameters

    – then, after the 5 days,
    – date (day after the 5th)
    – ammonia mg/l

    And predict the ammonia concentration in the tank on the 6th day.

    I tried to implement your system but I just managed to build a model for only one day (First day). I deleted all other days for each tank.

    In this case, what is the best solution, can you please show me the way? How can I use all 5 consecutive days for each row? I couldn’t do that.

    Kind regards

    • Avatar
      Jason Brownlee September 2, 2021 at 5:07 am #

      That sounds like a great project.

      You can use the code above to prepare your data directly.

      Perhaps you can model each site separately or all together, this may help:

      I recommend experimenting with different framing of the problem in order to discover what works best, e.g. maybe you only need lag ammonia as input, maybe you only need the observations for the day or many prior days, etc.

      • Avatar
        Will Ciog September 7, 2021 at 4:05 pm #

        Thank you so much for your suggestions. If I build models on one site, they may work poorly on other sites if used to predict. So I need to use all of the sites’ data. The problem is I couldn’t implement my data to your tutorial.

        How can I use them? For example, I have 10000 data, there are temp_day_1, temp_day_2, …, temp_day_5 columns. For all features, I have 5 days of consecutive data columns. In this case, should all of them be (t-1) including 5 days ammonia_day_1-2-3-4-5, then my output (6th days’ ammonia) is going to be (t)?

        It was easy for one day. I would have temp_1, ammonia_1, num_fish_1…as input and going to predict ammonia as output. But with the whole data, I couldn’t implement one of your suggestions above.

        Thanks a lot

        • Avatar
          Adrian Tam September 8, 2021 at 1:28 am #

          It should be just the same structure. Getting N features of day t as input to predict 1 particular feature of day t+1; or using 1 particular feature of days t,…,t+4 to predict the same feature of day t+5

  132. Avatar
    Ace October 11, 2021 at 1:08 am #

    Does this function behave like the GenerateTimeSeries function in keras or there are differences?

    • Avatar
      Adrian Tam October 13, 2021 at 7:12 am #

      Is there a GenerateTimeSeries function in Keras?

  133. Avatar
    Alaa Abutabaq October 22, 2021 at 4:44 am #

    Great Job Jason.

    I have made only one modification by adding column names to make the supervised data set more readable. Please check if you find it convenient.

    • Avatar
      Adrian Tam October 26, 2021 at 11:35 pm #

      Thanks Alaa. That’s useful. I’ve formatted your code so other people can access it easier.

  134. Avatar
    SJ October 24, 2021 at 11:12 pm #

    Hi Jason,
    I have one question.

    What is the difference between a normal sliding window and an overlapping moving window with overlap of 50%. Will the overlap 50% means less data than the normal sliding window when applied to multivariate time series data set??

    • Avatar
      Adrian Tam October 27, 2021 at 2:07 am #

      You’re right because normally a sliding window would mean we move the window by one time step. While the overlapping window here is to move half the window’s size each time.

  135. Avatar
    Miranda October 28, 2021 at 7:32 am #

    Hi Adrian,

    Thank you for this wonderful tutorial! I have a question regarding up-sampling and lagging. I have a time series that was sampled every 5 minutes. I realized that I don’t need that high resolution for prediction. Can I first upsample my data (for example every hour) and then perform lagging? Something like this:

    df_us = df.resample(‘1h’).mean()
    X = df_us[columns = cols].shift(1)[1:]
    y = df_us.iloc[1:,-1] # output is the last column

    X_train, y_train, X_test, y_test = train_tes_split(X,y, test_size=0.3, random_state=10)

    And then pass train and test data to a model for training and evaluation. Here, df is the given dataframe and cols is a list of column names in the dataframe. At each time t, my goal is to use the data at the previous instant (time=t-1) to predict the output at time t. I removed the first row of data after shifting because of the nan values. When mapping a time series prediction problem to a supervised learning problem, does resampling and then lagging the data (like what I did here) cause data leakage for prediction?

    • Avatar
      Adrian Tam October 28, 2021 at 2:02 pm #

      Upsample and then lagging is OK. The few lines of code you posted seems fine to me. I can’t see leakage.

  136. Avatar
    Luca February 10, 2022 at 6:28 am #

    Hi Jason,

    congratulation for the article, which I find it to be very useful.

    May I ask you why in your codes you apply MinMaxScaler() to the entire dataset (before splitting it into train and test)?

    I have learned that it is best practice to first split the dataset in train & test, then use .fit_transform() on the training set and the .transform() on the test set. This is done to avoid biasing the standardisation of the training set with data that should not be available in that moment (i.e. the test set).

    Thanks for your clarification.

  137. Avatar
    Zhongxian Men February 14, 2022 at 9:04 am #

    Hi Jason,

    Thanks for the tutorial. But the series_to_supervised() function has a problem. The first line
    n_vars = 1 if type(data) is list else data.shape[1]

    does not work when data is a np.array.


    • Avatar
      James Carmichael February 14, 2022 at 12:13 pm #

      Thank you for the feedback Zhongxian! My team will review and make any necessary corrections to the code listings.

  138. Avatar
    Thierno Diallo February 18, 2022 at 10:15 pm #

    Hi James,

    Thank you for your blog posts. I find them very usefull.

    Currently I am working on a kind of time series problem, I have to predict the number of people present in a school canteen. So I have two questions:

    1. The first one I would like to solve the problem as a supervised learning problem. But in addition to the number of present (which is the target value) I have the number of reservations for the day, in advance, and the calendar (day, month, season). I would like to know if it is possible to create another input variable from the number of present (by using previous time steps of the number of peaople present as input variables and use the next time step as the output variable).

    2. My second question, I would like to know if it is possible to make a grid seach with a cutting adapted to the time series (if there is a solution).

    Thanks for your answer,


    • Avatar
      James Carmichael February 20, 2022 at 12:41 pm #

      Hi Thierno…Please narrow your post to a single question so that I may better assist you.

  139. Avatar
    Thierno Diallo February 21, 2022 at 10:43 pm #

    Hi James,

    Thank you for your reaction. In fact, I would like to know if it’s possible to do a grid seach with a split adapted to time series data.



  140. Avatar
    Mika Sie March 8, 2022 at 12:23 am #

    Hi Jason! Thank you very much for this tutorial. It has helped me quite a bit. But right now i’m struggling a bit with a project that i am doing right now. I have found this dataset on kaggle:

    I want to use this dataset to make a prediction model using Keras and an LSTM.
    My plan was to train the model using only the metereological data for now. Later i will try and use the soil data as well. But i will have to get my model working first. RIght now i am struggling mostly with my data-processing. The set is already cleaned up but i don’t know how i have to process my data.

    Let me explain the data to you first:
    -There are 20 columns with meterological data about that day and it’s location.
    -Once every week there is a drought score assigned. This score ranges from 0 (no drought) to 5 (severe drought). So we have 6 labels
    -But not every day has a drought score- > So for every week there is one score assigned.

    Now i am struggling a bit with how to process this data. I first tried to use the function ‘create data_set’ but i see that it will create a dataset with only one column. This then made sense to me since this function isn’ t suitable for multivariate time series. (

    I have now also read this article but i was wondering if i could you the functions you defined here. If no, how would i have to do this then? I was doubting if i could use your functions since i have 7 rows of input for just one label but i don’t have labels for the other 6 rows of input.

    I would love to hear from you since i have been stuck on this for a few days!


    • Avatar
      James Carmichael March 9, 2022 at 6:00 am #

      Hi Mika…Please narrow you query to a specific question so that I may better assist you.

  141. Avatar
    Keeva March 20, 2022 at 9:36 pm #

    Hi Jason,

    Thank you for this great tutorial. In Multi-Step forecasting, why did you shift the input features and keep the output feature? Is it correct to shift the output(t+1) and keep the input data, instead?

  142. Avatar
    masoud naghshbandi May 30, 2022 at 6:17 pm #

    I have data set time series like car sale and I have this function
    def to_supervised(train,n_input,n_out):
    #falten data
    for _ in range(len(data)):
    in_end=in_start+ n_input
    out_end=in_end + n_out
    if out_end<=len(data):
    x_input=data[ in_start:in_end,0]
    return array(X), array(y)
    to make data supervised
    data set is like date isuued
    1960-01 65

    my question is what will happening with this function

  143. Avatar
    Lahan Olawale August 24, 2022 at 3:37 pm #

    Hello Jason,

    Thanks a bunch for these tutorials.
    I had a problem with trying to replicate prediction accuracy (approxitely) of an LSTM model that takes n time steps of input data with the same LSTM model, but the data has been transformed to a supervised learning structure as you have done in this tutorial.
    I’m new to this, however, I imagine that the same LSTM model would not be able to perform as well on this transformed data.
    Are there any particular model type/structures/architectures that can replicate the way LSTMs learn sequential time-stepped data but with supervised data instead; essentially treating the newly created columns of data like an LSTM treats time steps?
    Thanks for your time.

  144. Avatar
    Omar November 14, 2022 at 9:41 pm #

    Hi Jason,

    I have been following your work since 2019 and it helped me graduate from my undergrad. Now, I am pursuing my master’s degree in Business Analytics. Thank you very much.

    I have a question about time series forecasting. I saw other people’s code and sometimes they not only used lag values but also rolling averages (such as 7 steps or 28 steps). The questions are:

    1. Do you have any scholarly references for this kind of approach (either only the lag values or also with the rolling averages)?

    2. Do we also need to check the stationarity of the data before performing machine learning? Basically, do we need to do checks like in ARIMA?

    Thank you very much.