Feature Selection for Time Series Forecasting with Python

The use of machine learning methods on time series data requires feature engineering.

A univariate time series dataset is only comprised of a sequence of observations. These must be transformed into input and output features in order to use supervised learning algorithms.

The problem is that there is little limit to the type and number of features you can engineer for a time series problem. Classical time series analysis tools like the correlogram can help with evaluating lag variables, but do not directly help when selecting other types of features, such as those derived from the timestamps (year, month or day) and moving statistics, like a moving average.

In this tutorial, you will discover how you can use the machine learning tools of feature importance and feature selection when working with time series data.

After completing this tutorial, you will know:

  • How to create and interpret a correlogram of lagged observations.
  • How to calculate and interpret feature importance scores for time series features.
  • How to perform feature selection on time series input variables.

Let’s get started.

Tutorial Overview

This tutorial is broken down into the following 5 steps:

  1. Monthly Car Sales Dataset: That describes the dataset we will be working with.
  2. Make Stationary: That describes how to make the dataset stationary for analysis and forecasting.
  3. Autocorrelation Plot: That describes how to create a correlogram of the time series data.
  4. Feature Importance of Lag Variables: That describes how to calculate and review feature importance scores for time series data.
  5. Feature Selection of Lag Variables: That describes how to calculate and review feature selection results for time series data.

Let’s start off by looking at a standard time series dataset.

Stop learning Time Series Forecasting the slow way!

Take my free 7-day email course and discover how to get started (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Monthly Car Sales Dataset

In this tutorial, we will use the Monthly Car Sales dataset.

This dataset describes the number of car sales in Quebec, Canada between 1960 and 1968.

The units are a count of the number of sales and there are 108 observations. The source data is credited to Abraham and Ledolter (1983).

You can download the dataset from DataMarket.

Download the dataset and save it into your current working directory with the filename “car-sales.csv“. Note, you may need to delete the footer information from the file.

The code below loads the dataset as a Pandas Series object.

Running the example prints the first 5 rows of data.

A line plot of the data is also provided.

Monthly Car Sales Dataset Line Plot

Monthly Car Sales Dataset Line Plot

Make Stationary

We can see a clear seasonality and increasing trend in the data.

The trend and seasonality are fixed components that can be added to any prediction we make. They are useful, but need to be removed in order to explore any other systematic signals that can help make predictions.

A time series with seasonality and trend removed is called stationary.

To remove the seasonality, we can take the seasonal difference, resulting in a so-called seasonally adjusted time series.

The period of the seasonality appears to be one year (12 months). The code below calculates the seasonally adjusted time series and saves it to the file “seasonally-adjusted.csv“.

Because the first 12 months of data have no prior data to be differenced against, they must be discarded.

The stationary data is stored in “seasonally-adjusted.csv“. A line plot of the differenced data is created.

Seasonally Differenced Monthly Car Sales Dataset Line Plot

Seasonally Differenced Monthly Car Sales Dataset Line Plot

The plot suggests that the seasonality and trend information was removed by differencing.

Autocorrelation Plot

Traditionally, time series features are selected based on their correlation with the output variable.

This is called autocorrelation and involves plotting autocorrelation plots, also called a correlogram. These show the correlation of each lagged observation and whether or not the correlation is statistically significant.

For example, the code below plots the correlogram for all lag variables in the Monthly Car Sales dataset.

Running the example creates a correlogram, or Autocorrelation Function (ACF) plot, of the data.

The plot shows lag values along the x-axis and correlation on the y-axis between -1 and 1 for negatively and positively correlated lags respectively.

The dots above the blue area indicate statistical significance. The correlation of 1 for the lag value of 0 indicates 100% positive correlation of an observation with itself.

The plot shows significant lag values at 1, 2, 12, and 17 months.

Correlogram of the Monthly Car Sales Dataset

Correlogram of the Monthly Car Sales Dataset

This analysis provides a good baseline for comparison.

Time Series to Supervised Learning

We can convert the univariate Monthly Car Sales dataset into a supervised learning problem by taking the lag observation (e.g. t-1) as inputs and using the current observation (t) as the output variable.

We can do this in Pandas using the shift function to create new columns of shifted observations.

The example below creates a new time series with 12 months of lag values to predict the current observation.

The shift of 12 months means that the first 12 rows of data are unusable as they contain NaN values.

Running the example prints the first 13 rows of data showing the unusable first 12 rows and the usable 13th row.

The first 12 rows are removed from the new dataset and results are saved in the file “lags_12months_features.csv“.

This process can be repeated with an arbitrary number of time steps, such as 6 months or 24 months, and I would recommend experimenting.

Feature Importance of Lag Variables

Ensembles of decision trees, like bagged trees, random forest, and extra trees, can be used to calculate a feature importance score.

This is common in machine learning to estimate the relative usefulness of input features when developing predictive models.

We can use feature importance to help to estimate the relative importance of contrived input features for time series forecasting.

This is important because we can contrive not only the lag observation features above, but also features based on the timestamp of observations, rolling statistics, and much more. Feature importance is one method to help sort out what might be more useful in when modeling.

The example below loads the supervised learning view of the dataset created in the previous section, fits a random forest model (RandomForestRegressor), and summarizes the relative feature importance scores for each of the 12 lag observations.

A large-ish number of trees is used to ensure the scores are somewhat stable. Additionally, the random number seed is initialized to ensure that the same result is achieved each time the code is run.

Running the example first prints the importance scores of the lagged observations.

The scores are then plotted as a bar graph.

The plot shows the high relative importance of the observation at t-12 and, to a lesser degree, the importance of observations at t-2 and t-4.

It is interesting to note a difference with the outcome from the correlogram above.

Bar Graph of Feature Importance Scores on the Monthly Car Sales Dataset

Bar Graph of Feature Importance Scores on the Monthly Car Sales Dataset

This process can be repeated with different methods that can calculate importance scores, such as gradient boosting, extra trees, and bagged decision trees.

Feature Selection of Lag Variables

We can also use feature selection to automatically identify and select those input features that are most predictive.

A popular method for feature selection is called Recursive Feature Selection (RFE).

RFE works by creating predictive models, weighting features, and pruning those with the smallest weights, then repeating the process until a desired number of features are left.

The example below uses RFE with a random forest predictive model and sets the desired number of input features to 4.

Running the example prints the names of the 4 selected features.

Unsurprisingly, the results match features that showed a high importance in the previous section.

A bar graph is also created showing the feature selection rank (smaller is better) for each input feature.

Bar Graph of Feature Selection Rank on the Monthly Car Sales Dataset

Bar Graph of Feature Selection Rank on the Monthly Car Sales Dataset

This process can be repeated with different numbers of features to select more than 4 and different models other than random forest.

Summary

In this tutorial, you discovered how to use the tools of applied machine learning to help select features from time series data when forecasting.

Specifically, you learned:

  • How to interpret a correlogram for highly correlated lagged observations.
  • How to calculate and review feature importance scores in time series data.
  • How to use feature selection to identify the most relevant input variables in time series data.

Do you have any questions about feature selection with time series data?
Ask your questions in the comments and I will do my best to answer.

Want to Develop Time Series Forecasts with Python?

Introduction to Time Series Forecasting With Python

Develop Your Own Forecasts in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Introduction to Time Series Forecasting With Python

It covers self-study tutorials and end-to-end projects on topics like:
Loading data, visualization, modeling, algorithm tuning, and much more...

Finally Bring Time Series Forecasting to
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

50 Responses to Feature Selection for Time Series Forecasting with Python

  1. Andrewcz March 29, 2017 at 5:33 pm #

    Hi Jason big fan! I was wondering if you are going to a series on multivariate array time series forecasting.
    Many thanks,
    Best,
    Andrew

  2. Benson Dube April 2, 2017 at 6:13 am #

    Hello Jason,

    Many thanks for this blog. I will be so Interested to see how the multivariate Time Series Forecast is dealt with.

    Keep up the good works,

    Best Regards

    Ben

    • Jason Brownlee April 2, 2017 at 6:33 am #

      Thanks Ben, I hope to cover multivariate time series soon.

  3. Kélian April 13, 2017 at 2:05 am #

    Hello Jason,

    I wondered about your choice to keep only the last 12 lags for the feature importance and feature selection study.

    Because i understand the correlogram showed you should push the study until the 17 lag (correlogram showed 1, 2, 12, and 17 lags are correlated to current state)

    I m I right?

    Thanks for your work!

  4. Mehrdad May 26, 2017 at 5:18 am #

    The output of this lines
    ‘plot_acf(series)’
    ‘pyplot.show()’
    is not like yours. It just shows an straight line.
    May you please check it.
    Thanks

    • Merlin June 1, 2017 at 8:58 pm #

      Yeah, the plot_acf thing is not working properly.

      • Jason Brownlee June 2, 2017 at 12:57 pm #

        What problem do you see exactly?

        What version of statsmodels are you using?

    • Jason Brownlee June 2, 2017 at 11:50 am #

      I can confirm the example, please check that you have all of the code and the same source data.

      • porter October 27, 2017 at 4:11 am #

        I had a similar issue. It is due to the footer if you do not delete in the data set or drop the last row in the series after import.

  5. Ralph Li June 30, 2017 at 6:09 pm #

    Hello Jason!

    Can you recommend some references about recursive feature selection and random forest on feature selection for time series?

    Thanks!

    • Jason Brownlee July 1, 2017 at 6:29 am #

      No. My best advice: try it, get results and use them in developing better models.

  6. Saurav Sharma July 27, 2017 at 2:38 am #

    Hi Jason!

    I am still unable to understand the importance of lag variable?

    Is lag applied to a feature variable to find correlation with the target variable?

    Thanks!

    • Jason Brownlee July 27, 2017 at 8:11 am #

      A lag is a past observation, an observation at a prior time step.

      We can use these as input features to learning models. So abstractly we can predict today based on what happened yesterday.

      Yesterday’s ob is a lag variable.

      Does that help?

  7. Mert August 26, 2017 at 6:43 pm #

    Dear Jason,
    I am trying to run your code above with X size of (358,168) and test y (358,24), and having error “ValueError: bad input shape (358, 24)”. I would like to find the most relevant 12 features from 168 features in X(358,168) depending on 24 output of y(358,24)

    My y matrix has 24 output instead of 1. What might be the reason for the error?

    X = array[:,0:168]
    y = array[:,168:192]
    rfe = RFE(RandomForestRegressor(n_estimators=500, random_state=1), 12)
    fit = rfe.fit(X, y)

    • Jason Brownlee August 27, 2017 at 5:48 am #

      That might be too many output variables, most algorithms expect a single output variable in sklearn.

      I can’t think of any that support multiple, but I could be wrong.

      You might like to explore a neural network model instead?

      • Mert August 28, 2017 at 10:49 am #

        Thanks for your comment Jason.
        Actually, what I would like to do is determining the most relevant feature with RFE, then training a neural network model with this features. Do you think it is a reasonable approach?
        For the multiple output error, I will run RFE for each output instead of 24 one by one.

        • Jason Brownlee August 29, 2017 at 5:00 pm #

          You could try it and it would make sense if there is one highly predictive feature, but I would encourage you to test many configurations.

  8. Orry October 9, 2017 at 9:59 pm #

    Thanks for the great tutorial.

    I was wondering if you could explain the logic of why ACF might show some lags as statistically significant, while feature selection might show totally different lags as having predictive power.

    • Jason Brownlee October 10, 2017 at 7:44 am #

      Different operate under different assumptions and in turn, produce differing results. This is to be expected.

  9. lingxiao November 15, 2017 at 1:10 pm #

    hello Jason,

    Thank you for the post loved it!

    I’m a bit confused about the following:

    “This is important because we can contrive not only the lag observation features above, but also features based on the timestamp of observations, rolling statistics, and much more.”

    Would it make sense for me to add “month” to the set of features(“X”) if I have removed the seasonality from the time series already? Also, about the “much more” part, does stationarity still mean anything if I add extra features to “X”?

    If it is not a problem, why do we require the data to be stationary in the first place?
    If it is a problem, how do we make sure that the data is still stationary after we add extra features to “X”?

    • Jason Brownlee November 16, 2017 at 10:25 am #

      Yes, but you can also explore non-linear methods that offer more flexibility when it comes to stationarity requirements.

  10. Ali November 17, 2017 at 8:16 pm #

    Great tutorial! I have moderate experience with time series data. I am into detecting the most important features for a time series financial data for a binary classification task. And I have about 400 features (many of them highly correlated after I make the data stationary). How could I apply the method you show above? Getting the let’s say 10 previous days for each feature? Or do you have other suggestions?

    Thanks in advance!

    • Jason Brownlee November 18, 2017 at 10:15 am #

      I would recommend exploring a suite of approaches and see what features result in the best model skill.

  11. Francisco January 11, 2018 at 2:45 am #

    Hi Jason,

    This is great! How would you go about feature selection for time series using LSTM/keras. In that case, there won’t be a need to deconstruct the time series into the different lag variables from t to t-12.

    I’m currently working on a time series problem with multiple predictors. I need to know which predictors are important. Is the process the same as what you would do here or can I use a randomforest’s importance feature?

    Thanks!

    • Jason Brownlee January 11, 2018 at 5:53 am #

      Good question.

      There may be specialised methods, but I’m not across them right now – perhaps do a little research.

      I’d suggest grid searching models across different subsets of features to see what is important/results in better model skill. Basically an RFE approach.

  12. MLbeginner96 March 25, 2018 at 12:43 am #

    Hi Jason,

    I’m assuming we can extend this feature importance and selection beyond lag variables:
    – “temporal/seasonal features” such is hour of day,month of year etc
    – external variables that depend on the problem
    – rolling features such as min, max and mean of value of temperature in this case over past n days for example

    Essentially the features you provided in link below, we can then perfrom feature importance and selection, would you agree?

    https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/

    • Jason Brownlee March 25, 2018 at 6:32 am #

      Sure. I don’t have a lot of material on multivaraite time series though, I hope to cover it more in the future.

      • MLbeginner96 April 3, 2018 at 9:37 pm #

        Am i right in saying the process of feature selection/importance/etc occurs AFTER fitting the model to the training data?

        • Jason Brownlee April 4, 2018 at 6:12 am #

          Features should be chosen prior to fitting a model.

          Note though that the process of working through is iterative. Lots of looping back to prior steps.

  13. otw June 14, 2018 at 5:38 am #

    the observations in your training data are not iid. Do you think it is ok for your model?

    • Jason Brownlee June 14, 2018 at 6:13 am #

      Making the series stationary removes the time dependence.

  14. Leonildo August 21, 2018 at 8:25 am #

    RandomForestRegressor does bootstrap. Would not this be data leakage considering that the example is a time series?

  15. Vishal August 31, 2018 at 1:00 am #

    Hi, Jason

    I am using RandomForest for forecasting rainfall variable. I have around 15 predictors with 50 years data. When I am predicting rainfall values based on the predictors (variables), I am getting very low values as compared to original rainfalls. I mean, I am totally missing extreme values. Please suggest.

    Regards,
    Vishu

  16. zb September 4, 2018 at 5:31 pm #

    Hi Jason,

    Thanks for the blog. I learned a lot thanks to you.

    I’m looking for a method of selecting variables for time series like the RFE. But after reading this new post (https://machinelearningmastery.com/how-to-predict-whether-eyes-are-open-or-closed-using-brain-waves/), I have doubts about whether it is possible to apply a method that uses bootstrap.

    I think that when using RFE, the evaluation of the models does not respect the temporal ordering of the observations, as it happens in your post about how to predict whether eyes are open or closed and that uses the future information for the selection of variables. What do you think? Thanks!!

    Regards

    • Jason Brownlee September 5, 2018 at 6:29 am #

      It is a challenge. You could try classical feature selection methods, like RFE and correlation, knowing there is bias, then build models from the suggestions and compare the performance to using all features.

  17. Hamza September 15, 2018 at 5:06 am #

    Hi Jason,
    Many thanks for this blog.

    I use Simple Linear Regression in Sklearn.
    I have this error (could not convert to float: ‘(TOP (S (S (NP *’)

    I think it’s ncessary to encod categorical data !!!

    But, my dateset is for natural language processing (data from conll-2012).
    I use another algorithm that accepts string variables or there are an other solution?

  18. Hossein October 18, 2018 at 4:35 am #

    Hi Jason,

    Thank you for your great tutorials.

    Unfortunately, I got a problem running the code. The result of the code on my computer is exactly the same as yours till Autocorrelation Plot. At Autocorrelation Plot, my result just shows a straight line at zero.
    The next, there is an error as follows.

    runfile(‘C:/Users/Hossein/.spyder-py3/temp.py’, wdir=’C:/Users/Hossein/.spyder-py3′)
    Traceback (most recent call last):

    File “C:\Users\Hossein\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py”, line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

    File “”, line 1, in
    runfile(‘C:/Users/Hossein/.spyder-py3/temp.py’, wdir=’C:/Users/Hossein/.spyder-py3′)

    File “C:\Users\Hossein\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 705, in runfile
    execfile(filename, namespace)

    File “C:\Users\Hossein\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 102, in execfile
    exec(compile(f.read(), filename, ‘exec’), namespace)

    File “C:/Users/Hossein/.spyder-py3/temp.py”, line 48
    dataframe[‘t-‘+str(i)] = series.shift(i)
    ^
    IndentationError: expected an indented block

    • Jason Brownlee October 18, 2018 at 6:38 am #

      Looks like you did not copy the code with the indenting, here’s how to copy code:
      https://machinelearningmastery.com/faq/single-faq/how-do-i-copy-code-from-a-tutorial

      Also, I recommend running code from the command line:
      https://machinelearningmastery.com/faq/single-faq/how-do-i-run-a-script-from-the-command-line

      • Hossein October 18, 2018 at 8:47 am #

        Thank you for your prompt response.

        Unfortunately, I still have the same problem. Even I tried your code on https://repl.it and it showed the same error.

        dataframe[‘t-‘+str(i)] = series.shift(i)
        ^
        IndentationError: expected an indented block

        • Jason Brownlee October 18, 2018 at 2:32 pm #

          Perhaps try copy-pasting the code again and indenting it manually in your text editor?

    • Hossein October 18, 2018 at 2:15 pm #

      I figured it out, finally.

      The autocorrelation plot doesn’t show since there are two “nan”s at the end of series.
      add “series=series[1:-2]” after reading the following line.
      series = Series.from_csv(‘seasonally_adjusted.csv’, header=None)

      Another comment regarding the error in Time Series to Supervised Learning.

      the code needs a space just after “for” loop as follows:

      for i in range(12,0,-1):
      dataframe[‘t-‘+str(i)] = series.shift(i)

  19. hk November 24, 2018 at 11:54 pm #

    Codes don’t work. I get length of values does not match length of index, when you creating the dataframe with the shifted columns. I don’t know how could you produce the results with this code.

Leave a Reply