Time series forecasting can be framed as a supervised learning problem.

This re-framing of your time series data allows you access to the suite of standard linear and nonlinear machine learning algorithms on your problem.

In this post, you will discover how you can re-frame your time series problem as a supervised learning problem for machine learning. After reading this post, you will know:

- What supervised learning is and how it is the foundation for all predictive modeling machine learning algorithms.
- The sliding window method for framing a time series dataset and how to use it.
- How to use the sliding window for multivariate data and multi-step forecasting.

Let’s get started.

## Supervised Machine Learning

The majority of practical machine learning uses supervised learning.

Supervised learning is where you have input variables (**X**) and an output variable (**y**) and you use an algorithm to learn the mapping function from the input to the output.

1 |
Y = f(X) |

The goal is to approximate the real underlying mapping so well that when you have new input data (**X**), you can predict the output variables (**y**) for that data.

Below is a contrived example of a supervised learning dataset where each row is an observation comprised of one input variable (**X**) and one output variable to be predicted (**y**).

1 2 3 4 5 6 |
X, y 5, 0.9 4, 0.8 5, 1.0 3, 0.7 4, 0.9 |

It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process.

We know the correct answers; the algorithm iteratively makes predictions on the training data and is corrected by making updates. Learning stops when the algorithm achieves an acceptable level of performance.

Supervised learning problems can be further grouped into regression and classification problems.

**Classification**: A classification problem is when the output variable is a category, such as “*red*” and “*blue*” or “*disease*” and “*no disease*.”**Regression**: A regression problem is when the output variable is a real value, such as “*dollars*” or “*weight*.” The contrived example above is a regression problem.

### Stop learning Time Series Forecasting the *slow way*!

Take my free 7-day email course and discover data prep, modeling and more (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Sliding Window For Time Series Data

Time series data can be phrased as supervised learning.

Given a sequence of numbers for a time series dataset, we can restructure the data to look like a supervised learning problem. We can do this by using previous time steps as input variables and use the next time step as the output variable.

Let’s make this concrete with an example. Imagine we have a time series as follows:

1 2 3 4 5 6 |
time, measure 1, 100 2, 110 3, 108 4, 115 5, 120 |

We can restructure this time series dataset as a supervised learning problem by using the value at the previous time step to predict the value at the next time-step. Re-organizing the time series dataset this way, the data would look as follows:

1 2 3 4 5 6 7 |
X, y ?, 100 100, 110 110, 108 108, 115 115, 120 120, ? |

Take a look at the above transformed dataset and compare it to the original time series. Here are some observations:

- We can see that the previous time step is the input (
**X**) and the next time step is the output (**y**) in our supervised learning problem. - We can see that the order between the observations is preserved, and must continue to be preserved when using this dataset to train a supervised model.
- We can see that we have no previous value that we can use to predict the first value in the sequence. We will delete this row as we cannot use it.
- We can also see that we do not have a known next value to predict for the last value in the sequence. We may want to delete this value while training our supervised model also.

The use of prior time steps to predict the next time step is called the sliding window method. For short, it may be called the window method in some literature. In statistics and time series analysis, this is called a lag or lag method.

The number of previous time steps is called the window width or size of the lag.

This sliding window is the basis for how we can turn any time series dataset into a supervised learning problem. From this simple example, we can notice a few things:

- We can see how this can work to turn a time series into either a regression or a classification supervised learning problem for real-valued or labeled time series values.
- We can see how once a time series dataset is prepared this way that any of the standard linear and nonlinear machine learning algorithms may be applied, as long as the order of the rows is preserved.
- We can see how the width sliding window can be increased to include more previous time steps.
- We can see how the sliding window approach can be used on a time series that has more than one value, or so-called multivariate time series.

We will explore some of these uses of the sliding window, starting next with using it to handle time series with more than one observation at each time step, called multivariate time series.

## Sliding Window With Multivariate Time Series Data

The number of observations recorded for a given time in a time series dataset matters.

Traditionally, different names are used:

**Univariate Time Series**: These are datasets where only a single variable is observed at each time, such as temperature each hour. The example in the previous section is a univariate time series dataset.**Multivariate Time Series**: These are datasets where two or more variables are observed at each time.

Most time series analysis methods, and even books on the topic, focus on univariate data. This is because it is the simplest to understand and work with. Multivariate data is often more difficult to work with. It is harder to model and often many of the classical methods do not perform well.

Multivariate time series analysis considers simultaneously multiple time series. … It is, in general, much more complicated than univariate time series analysis

— Page 1, Multivariate Time Series Analysis: With R and Financial Applications.

The sweet spot for using machine learning for time series is where classical methods fall down. This may be with complex univariate time series, and is more likely with multivariate time series given the additional complexity.

Below is another worked example to make the sliding window method concrete for multivariate time series.

Assume we have the contrived multivariate time series dataset below with two observations at each time step. Let’s also assume that we are only concerned with predicting **measure2**.

1 2 3 4 5 6 |
time, measure1, measure2 1, 0.2, 88 2, 0.5, 89 3, 0.7, 87 4, 0.4, 88 5, 1.0, 90 |

We can re-frame this time series dataset as a supervised learning problem with a window width of one.

This means that we will use the previous time step values of **measure1** and **measure2**. We will also have available the next time step value for **measure1**. We will then predict the next time step value of **measure2**.

This will give us 3 input features and one output value to predict for each training pattern.

1 2 3 4 5 6 7 |
X1, X2, X3, y ?, ?, 0.2 , 88 0.2, 88, 0.5, 89 0.5, 89, 0.7, 87 0.7, 87, 0.4, 88 0.4, 88, 1.0, 90 1.0, 90, ?, ? |

We can see that as in the univariate time series example above, we may need to remove the first and last rows in order to train our supervised learning model.

This example raises the question of what if we wanted to predict both **measure1** and **measure2** for the next time step?

The sliding window approach can also be used in this case.

Using the same time series dataset above, we can phrase it as a supervised learning problem where we predict both **measure1** and **measure2** with the same window width of one, as follows.

1 2 3 4 5 6 7 |
X1, X2, y1, y2 ?, ?, 0.2, 88 0.2, 88, 0.5, 89 0.5, 89, 0.7, 87 0.7, 87, 0.4, 88 0.4, 88, 1.0, 90 1.0, 90, ?, ? |

Not many supervised learning methods can handle the prediction of multiple output values without modification, but some methods, like artificial neural networks, have little trouble.

We can think of predicting more than one value as predicting a sequence. In this case, we were predicting two different output variables, but we may want to predict multiple time-steps ahead of one output variable.

This is called multi-step forecasting and is covered in the next section.

## Sliding Window With Multi-Step Forecasting

The number of time steps ahead to be forecasted is important.

Again, it is traditional to use different names for the problem depending on the number of time-steps to forecast:

**One-Step Forecast**: This is where the next time step (t+1) is predicted.**Multi-Step Forecast**: This is where two or more future time steps are to be predicted.

All of the examples we have looked at so far have been one-step forecasts.

There are are a number of ways to model multi-step forecasting as a supervised learning problem. We will cover some of these alternate ways in a future post.

For now, we are focusing on framing multi-step forecast using the sliding window method.

Consider the same univariate time series dataset from the first sliding window example above:

1 2 3 4 5 6 |
time, measure 1, 100 2, 110 3, 108 4, 115 5, 120 |

We can frame this time series as a two-step forecasting dataset for supervised learning with a window width of one, as follows:

1 2 3 4 5 6 7 |
X1, y1, y2 ? 100, 110 100, 110, 108 110, 108, 115 108, 115, 120 115, 120, ? 120, ?, ? |

We can see that the first row and the last two rows cannot be used to train a supervised model.

It is also a good example to show the burden on the input variables. Specifically, that a supervised model only has **X1** to work with in order to predict both **y1** and **y2**.

Careful thought and experimentation are needed on your problem to find a window width that results in acceptable model performance.

## Further Reading

If you are looking for more resources on how to work with time series data as a machine learning problem, see the following two papers:

- Machine Learning for Sequential Data: A Review (2002) [PDF]
- Machine Learning Strategies for Time Series Forecasting (2013) (also slides PDF)

For Python code for how to do this, see the post:

## Summary

In this post, you discovered how you can re-frame your time series prediction problem as a supervised learning problem for use with machine learning methods.

Specifically, you learned:

- Supervised learning is the most popular way of framing problems for machine learning as a collection of observations with inputs and outputs.
- Sliding window is the way to restructure a time series dataset as a supervised learning problem.
- Multivariate and multi-step forecasting time series can also be framed as supervised learning using the sliding window method.

Do you have any questions about the sliding window method or about this post?

Ask your questions in the comments below and I will do my best to answer.

Thanks for the article. I understand the transformation. Now how do you separate the data into training and testing sets? Also, will the next article be working a simple example through to building a predictive model?

Great question Robert, I will have a post on this soon.

Can you please post a link to the article (if you created one) which you mentioned in this comment?

Sure, see this post:

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

Machine learning methods are not suitable for time series analysis. They do not take into account the relationship that exists between data values.

Interesting perspective Leo.

Machine learning methods require this relationship is exposed to them explicitly in the form of a moving average, lag obs, seasonality indicators, etc. Just like linear regression does in ARIMA. No really a big leap here.

Classical methods (like MA/AR/ARMA/ARIMA/and friends) breakdown when relationships are non-linear, obs are not iid, residuals are not gaussian, etc. Sometimes the complexity of the problem requires we try alternate methods.

Finally, there are newer methods that can learn sequence, like LSTM recurrent neural networks. These methods have the potential to redefine an industry, just like has been done in speech recognition and computer vision.

Machine learning methods require that there is no correlation between variables. This breaks down for time series where the lagged values are correlated.

Moreover, there are many nonlinear time series methods like GARCH and its variants.

Great point, thanks Leo.

The point about correlated inputs is true for many statistical methods, less true for others like trees, instance-based methods and even some neural nets (cnn and rnn).

I think you’re spot on – most small univariate time series datasets will be satisfied with a classical statistical method. Perhaps LSTMs or decision trees on lagged vars can add something, perhaps not.

When things get hairy in data with a time component (like movement prediction, gesture classification, …) perhaps ML is the way to go. I need to do a better job of fleshing out this detail.

I tried it half a month ago, but it didn’t work well

Is that not a bit Bombastic.

There are several quant hedge funds that have made and continue to make mind blowing returns through the use of ML methods and correlated variables in multivariate TS data.

Maybe I’m missing something ?

ML does NOT require that there is no correlation between variables… nor does any regression model.

Regression models prefer uncorrelated input variables for model stability.

Not a requirement (we can still do it…), more of a strong preference.

See this article on Multicollinearity

https://en.wikipedia.org/wiki/Multicollinearity

Good point Jason. I guess I need to study LSTM.

When do you public something about the Multi-Step Forecasting? 🙂

They are scheduled for later this month or early next month.

The data generated from sensors of IoT or industrial machines are also typical time siries, and usually of huge amout, aka industrial big data.

For this type of TS, many digital signal processing methods are used when being analysised, such as FFT, wavelet transform, euclidean distance.

It seems that books discussing ML on TS usually don’t cover this DSP area. What do you think?

I agree Dehai.

We can view these methods as data preparation/data transforms in the project process.

Use of more advanced methods like FFT and wavelets requires knowledge of DSP which might be a step too far for devs looking to get into machine learning with little math background.

Thanks!

I had a project where I had to predict the likelihood of equipment failure from an event log. What worked pretty well was creating a training set from the event log with temporal target features that included whether or not a piece of equipment failed in the next 30, 60 days, etc. I also added temporal features for a piece of equipments past history, e.g., frequency of maintenance over different veriods, variance in measurements, etc. Could then apply any machine learning technique. Test set was created from last 20% of samples.

— Jay Urbain

Very nice, thanks for sharing Jay!

Hi Jay,

I am interested in finding out more about the predictive task you were involved with. Any chance you ahve a blog or can share more by email?

Hi Jason and Jay

We are also trying to predict device failure based on temporal signals like temperatures, humidity, power consumption, events\alarms etc..

How does one relate 5 temporal data signals into one single fail\pass result at the end of the period?

Most examples seem to be about predicting the signal itself where as in our case we probably need to find patterns in the relation between the signals. For example, if it is using a lot of power, the ambient temperature is low but the temperature is not decreasing, something something is wrong with the compressor.

Any tips would be highly appreciated.

Perhaps this example of multivariate forecasting will help as a starting point:

https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

Jason, is using multi steps time lags with multivariate KNN or Random Forrest equivalent to transforming the feature space in similar way to kernel functions?

I will also be curious to see how SVM can be used on multivariate problems.

Thanks for the post.

I don’t think so Ziad, do you have a specific idea in mind?

In activity prediction application, the activity can be predicted only after multiple sequence of steps (multivariate time series data). Kindly suggest how to handle this problem for predicting the activity

Nice problem Kavitha. Sorry, I don’t have any examples of activity prediction. I don’t want to give you uninformed advice.

How would the time series restructuring be affected if we have 2 level or n level categorization within a time series. For example in case of sensor data we get it on each day and with-in the day say at every 5 seconds. The correlation may exist at the outer level i.e at day level but may not at internal level i.e at next sample (in seconds).

Day1 Measure

5PM 20

5PM5Sc 22

Day2 Measure

5pm 25

5pm5sc 27

so on.

Great question pankaj.

I would suggest resampling the data to a few different time scales and building a model on lag signals of each, then ensemble the predictions. Alsom build bigger models on lagged signals at each scale You want to give your models every opportunity to exploit the temporal structure in the problem.

It was a great article. My question is not really on this topic.

how can use capture the errors in a neural network for each instance of a data and print it out in java and now to interpolate on the captured errors so predict the errors.

This is great. Though the multi-step forecast is somewhat border me. If we make a data model with features, for example, 3 continuous lag, then it show that somehow, the next step would be build upon the value of these 3 data, like X(t) = a1.X(t-1) + a2.X(t-2) + a3.X(t-3). And what’s more, to predict further into the future, have we extended the width of the window ? In that case, as the number of features also extended, the size of training data also must be extended right ?

That is correct.

There are two general approaches for a multi-step forecast: direct (one model for each future time step to be predicted) and recursive (use the one-step model again and again with predictions as inputs).

Nice article. You are proposing supervised learning for complex time series, instead of classical forecasting methods. Do you have any particular supervised learning method in mind? If so, what makes you think it will work better than NN based LSTM.

You also mentioned, in response to a comment, that some ML techniques are not adversely impacted by correlated input. Can you please shed some light on your comment.

Hi Pranab,

No specific method in mind, more of a methodology of framing time series forecasting as supervised learning, making it available to the suite of linear and nonlinear machine learning algorithms and ensemble methods. Not a new idea for sure.

Sure, often decision trees are unflappable when it comes to irrelevant features and correlated features. In fact, often when there are unknown nonlinear interactions across features, accepting pairwise multicollinearity in input features results in better performing models.

Hello Jason,

This is a cery interesting. topic Have you considred forecasting one-step-ahead as a function of multi steps before. This will represent an output which is a function of several variables. The question of interest, by analogy to the traditionale mult-variate function, is how many variables (back step) to use and which ones are most significant to use through a variable selecion process.Variable selection could identify which time periods influence the analysis and forecat.

This approach can greatly benefit the forecasting and anallysis of time series using all of machine learning algorithms.

A colleague and I applied this approach. Four published papers on this work can be “googled using my name (Hassine Saidane)

Happy continuation and thanks for sharing the article.

Thanks for sharing Hassine.

Hi Jason,

I am trying to predict customer attrition based on revenue trend as time series

Month1 –> $ ; month2 –> $ as training data set.

How can i use predictive algorithm to predict customer attrition based on the above training data ?

Thanks

Sam

I would encourage you to re-read this post, it sells out exactly how to frame your problem Sam.

Thanks for your response Jason.I understood the above example.The above example seems to be predicting Y as regression value.But i am trying to predict Y as classification value (attrition = 1 or non attrition = 0)

Example : Below is the time series of revenue where 1,2,3.. are the months and Y tell us if the customer attrited or not. Y will have only 2 values 1 or 0.

So can i use the below format for my test data ?

revenue1 revenue2 revenue3 …Y

100 50 -25 1

200 100 300 ….. 0

Appreciate your help.

Thanks

Sam

Looks good Sam.

Thanks a ton Jason for your quick response.You made my day 🙂

I’m here to help Sam.

Dear Dr Jason,

Two topics please

(1) On cropping data and applying the model ‘to the real world’. I understand that cropping is done on the 0th and kth data points to get a 1:1 correspondence between data values at t and t-1. I assume from previous posts that you crop say the (k-10)th to kth data points, perform the successive 1 step ahead predictions and select the model based on the min(set of mse of all selected models) of the difference between the test and predicted models.

(a)Is the idea to use the that model to predict the (k+1)th unknown.

(b)Can we assume that the model you ‘trained’ will be acceptable when more data is acquired. In other words, what happens if you collect another x data points, and you want to predict the (k + x + 1) data point, can we assume that the model trained at k data points will work for the model at k + x data points? Or in other words, when do you ‘retrain’ the model.

(2) On windowing the data: based on this blog, is the purpose of windowing the data to find the differences and train the differenced data to find the model. How can we make the assumption that the (k+1)th differenced observation can be predicted from the kth differenced observation.

Thank you,

Anthony from Sydney Australia

Hi Anthony,

Sorry, I don’t understand what you mean by cropping. Perhaps you could give an example?

Generally, we use all available historical data to make a one-step prediction (t+1) or a multi-step prediction (t+1, t+2, …, t+n). This applies when evaluating a model and when new data becomes available.

Windowing is about framing a univariate time series into a supervised learning problem with lag obs as input features. This allows us to use traditional supervised learning algorithms to model the problem and make predictions.

I hope that helps.

Dear Dr Jason,

I will rephrase both (1) and (2) into one.

Perhaps I wasn’t very clear at all.

Cropping. by cropping I mean remove the earliest, the 0th and the latest kth data points because there are no corresponding lagged values by virtue of lagging.

eg

data point value lagged data point array reference

1 ? – this is cropped/pruned 0

2 1 1

3 2 2

44 3 3

5 4 4

. .

560 1234 k-1.

? X – this is cropped/pruned. k

dataset available for processing

datapoint lagged data point (array ref based on original data)

2 1 1

3 2 2

44 3 3

5 44 4

. .

560 1234 k-1

This is the above dataset with the 0th and kth elements cropped/pruned from the original.

I should have been clearer. I apologise.

My questions

(a) Based on the ‘new’ lagged dataset, how can you make a prediction for the (k + 1)th dataset given the kth data point is not available.In other words, are making a prediction for the (k+1)th data point based on the (k-1)th datapoint.

(b) Perhaps I’m missing something, having read the other posts on ARIMA. How can we make the assumption that predicting the next data point is based on the previous data point when there may well be MA or AR or other kinds processes on the data? Or in other words how can we assume that differencingor windowing as in this tutorial/blog will be the basis of our training model?

(c) Suppose you trained your model based on the original dataset. Suppose that as your system acquires more datapoints, won’t the original model that you trained become invalid. Say you got an extra 10 or 1000 datapoints, do you have to retrain your data because the coefficients of the original model may not be an adequate predictor for a larger dataset.

Thank you again and I hope I have been clearer,

Anthony of Sydney Australia

Hi Anthony,

What is k? Is that a time step t? I think it is given context.

If you want to forecast a new data point that is out of sample (t+1) beyond the training dataset, your model will use t-1, … t-n as inputs to make the forecast.

This applied regardless of the type of model used. E.g. if you are using an AR, the inputs will be lagged obs. If MA, the inputs will be an autoregression of the lagged error series.

If differencing is performed in the preparation of the model, it will have to be performed on any new data. The decision to difference or seasonally adjust is based on the data itself and your analysis of temporal structure like trends and seasonality.

Yes, as new data comes in the model will need to be refit. This is not a requirement for all problems, but a good idea. To mimic this real world expectation, we evaluate models in the same way using walk-forward validation that does exactly this – refits a model each time a new ob is available and predicts the next out of sample ob.

I hope this helps. I do cover all of this in my book, lesson by lesson.

Dear Dr Jason, apologies again, my original spaced data set example did not appear neat.

In both the original and the cropped/pruned/windowed datasets, there are meant to be three columns consisting of the data, data lagged by 1, and the array index based on the original dataset.

I don’t know how to get nicely spaced tabbed data when posting replies on this blog

Regards

Anthony of Sydney

You can use the pre HTML tag, e.g.:

[src]https://en.wikipedia.org/wiki/BBCode[/src]

On how to insert BBCode in forum replies

[list]

* 1 ?

* 2 1

* 3 2

* 4 3

* ? 4

[/list]

This is an experiment in inserting HTML code on a forum reply.

I hope this works,

[b] Anthony [/b] [i] from Sydney [/i]

Testing using the ‘pre’ enclosed in ”, inserting “this is a test message”, then ”

Hope it works

Dear Jason,

have you planned any blog on forecasting Multivariate Time Series? I went through your ARIMA post and it was good start point for me.

Thanks,

#student #aspring data analyst

Thnaks Nirikshith.

Yes, I hope to cover multivariate time series forecasting in depth soon.

Jason,

I am new to machine learning. I have a problem type and I was wondering if you could point me to the right area to study so I can learn and apply the appropriate model/technique. I have a set of time series data(rows), composed of a number of different measurements from a process(columns). Think hundreds of sensors, measured each second. I have a hunch that there is a relationship between the columns that is offset in time. Say something happens at time t1 in column 1 and 10 seconds later there is a change in column 2. My desire is to find the columns that have this time relationship and the time between when a change in one column is reflected in the related column(s). My goal would be to then train a model to indicate predictions based on changes in the earlier in time variable prior to the later in time variable changing. Your article is helpful to understand how I might try to train a model to forecast within a single column, but how do I train or dig out the relationships between columns?

If you could point me to what parts of machine learning I should focus my learning efforts I would appreciate it.

Thanks

Bruce

Hi Bruce, time series analysis is a big field. I’d recommend picking up a good practical book.

Generally, consider looking for correlations between specific lags and your output variable. (e.g. correlation plots).

I hope that helps as a start.

Jason,

Thank you, do you have a suggestion for a good book to start with?

Bruce

Yes, my book:

https://machinelearningmastery.com/introduction-to-time-series-forecasting-with-python/

If you are interested in R, here are more books:

http://machinelearningmastery.com/books-on-time-series-forecasting-with-r/

Hi Jason,

Very Nice Article, Just had a question whether there is a forecasting technique for Region/Branch based forecasting.

What do you mean exactly?

Hi Jason,

Thanks for this article. I have 2 questions:

1. Is there a way to avoid removing the rows altogether? If we are creating lag (t-2), (t-3) etc then we will have to remove more rows. I have seen kaggle masters use XGB with missing = NA option so that it handles missing data but not sure what can be done with other models.

2. Can you please shed some light on the fact that data may not be i.i.d. P(Y|X) (may be identical but y|x may not be independent for rows). I think most ML models should fail in this scenario. Am i thinking in the right direction? Also is there a way to check the iid hypothesis?

Regards,

Rishi

Thanks Rishi.

Yes, you can mark the values as NaN values, some algorithms can support this, or set them to 0.0 and mask them. Like xgboost or neural nets.

Great point. Classical methods would not fail, but may fair worse than methods that are adjusted for the dependence. I’d still recommend spot checking a suit of methods on a problem as a baseline. ARIMA is corrected for the dependence (as far as I remember).

Thanks for the reply Jason. I was reading up on auto correlation correction in regression ( detected using Durbin Watson) but that was applicable for continuous data – Cochrane orcutt. Is there in general any way to correct for it? I think most of the problems that we work on in real world are time series such as customer churn etc. And I feel time series regression is what we (unknowingly) do as well, as in use X such as performance in last month etc. Please suggest some material.

Regards,

Rishabh

Sorry, I’m not sure what you’re asking, can you restate your question Rishi?

Let’s say we pick a real life case study, predict customer’s retail spend this month. In this case a person spending amount this month might depend on whether he had a big spend large month or not. Obviously we can have lagged y as X in the model to capture the info but do you think that data will be iid. Residual analysis should give some insight into it for sure (Durbin Watson should also help detect that).

Also problems like customer churn, I always use this approach: fix a timeline lets say 1 Jan, Target is customer who churned in Jan – Feb and X are information from past (spend in last 2 months Dec and Nov for all customers). Variables used are like spend in last x months etc. Does this approach seem right for time series kind of classification?

Sorry for a long post, just wanted to clarify my thoughts.

Yes, I would encourage you to test it empirically rather than getting too bogged down in analysis.

You cannot pick the best algorithm for a specific prediction problem analytically.

Hi Jason,

Superb post!

I have a query. I am working on a real life problem of forecasting 3 days sales for a Retail store.

I am thinking of applying a hybrid model(ARIMAX+Neural network) i.e Dynamic regression with regressors using auto.arima,then fitting Neural network model on the residuals.The final forecast will be y= L+N where L=forecast from ARIMAX and N= forecast of residuals from NNETAR. What do you think of this approach?

Also, I need your input on applying the cross validation techniques. I have daily sales data from Jan14-June17. Would it be worth to tune the parameters using cross validation techniques(Adding months/quarters) or should I go ahead training the model only once (Let’s say from Jan14-Dec16) and measure the accuracy on the rest? (Test & Validation)?What could be the best approach as I need only 3 days forecasts?

It does not matter what I think, use data to make decisions – e.g. the model or combination of models that get the best skill on a robust test harness.

Use walk forward validation on time series, more here:

http://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

Jason

Excellent article about time series forecast. I have a fair understanding of statistical traditional ML techniques and its application. I have couple of questions on applying NN/LSTM to time series forecast

1. To what an extent we need to worry about over fitting?

2 are there ensemble techniques that apply different models for different time horizons?.

Overfitting is always a problem in applied machine learning.

Not sure I follow. If you have different time horizons, then you will need different models to make those predictions. Perhaps you can use outputs from one model as inputs to another, but I have not seen a structured way to do this – I’d encourage you to experiment.

Hi Jason,

Thanks for the nice and helpful article you have shared. There is this research paper I am trying to implement, based on predicting cloud resource usage. Sliding window technique is required for preprocessing of data and the data is fed to the LSTM as input. For eg. while predicting CPU usage of a particular VM, I have the time series data at an interval of 1min. in the following format:

Timestamp CPU usage

1. t value1

2. t+1 value2

3. t+2 value3

.

.

.

.

and so on, similarly for other parameters as well, such as RAM, DISK, etc.

Could you please guide me with what should be the format of my training and testing sets, if I use LSTM.

Thanks in advance.

This post on backtesting models for time series data might give you some ideas:

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

Hi Jason,

I was wondering is common/good practice to have two windows/lags in a multivariate analysis? Suppose y is correlated with t-1 on x1, but t-5 on x2. Is this possible?

Thanks,

Jeff

Yes, often a fixed window of lag obs are provided across all features. Zero coefficients can be used to zero out features that do not add value.

Hi Jason i am working on multivariate time series data for anomaly detection could you please suggest some algorithms i have tried isolation forest, and ARIMA but ARIMA works only for single variable.

Please help

Perhaps this process will help you work through your problem systematically:

https://machinelearningmastery.com/start-here/#process

Hello Jason,

I have read your article, I would assume as you have said that forecasting a time series as it is shown might work with certain algorithms, as you said LSTM, however, I am analyzing a multivariate regression with random forests predicting a final output as a value based on an attribute vectors, but the nature of RF is that it is not time dependent so, this time window is not required I believe?

Nevertheless, the ML lag obs can be framed as input variables and sometimes stateless (time-unaware) methods can achieve impressive results. Try it and see on your problem.

Hi,

Thank you for a great post! I enjoyed reading it 🙂

This is my firs time trying to solve a time series problem, so you explanations really ease the “where to start” issue.

I have a problem in which i’m trying to find correlation between:

1. 9 facial expressions scores (given: joy 0.9, happy 0.77, angry 0.5 etc) every 3 mil-seconds

2. participates action – move, spin, play music, stop music (close list of options with time stamps)

3. Curiosity score – as measured by various means (questionnaires, behavioral measures) – one score per participant.

The study question: Is there a correlation between the user’s facial expressions and his behavior and his curiosity?

Can you suggest a way to work on this kind of data?

Can you refer me to a post about it? or an article?

Any idea / suggestion / solution will help 🙂

Thank you very much,

Shani

Sounds like you need to load the data as a DataFrame and calculate the correlation between columns.

This post might help:

http://machinelearningmastery.com/understand-machine-learning-data-descriptive-statistics-python/

You can even calculate the correlation directly on the DataFrame:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html

Too much usefull !

I used your technic (Multivariate Time Series) to prepare datas.

After running a regression model from these ones, I get awsome prediction precision about daily industry electrical consumption. And I swear the energy demands was really not stable !

Thanks Jason !

Glad to hear it!

Hello, I don’t understand the following statements:

“We can see that the order between the observations is preserved, and must continue to be preserved when using this dataset to train a supervised model.”

“We can see how once a time series dataset is prepared this way that any of the standard linear and nonlinear machine learning algorithms may be applied, as long as the order of the rows is preserved.”

Why does the order of the rows have to be preserved when training the data? Haven’t you essentially converted the time series data to cross-sectional data once you have included the relevant lags in a given row?

Thank you,

Andrea

No, we are exposing temporal structure as inputs.

I don’t understand the same statement of Andrea and I have one more question.

1) Why does the order of the instances (rows) have to be preserved when training the data?

2) Does this mean that we can not perform k-fold cross validation on the prepared dataset?

Thanks.

In time series the order between observations is important, we want to harness this in the model. It is also a constraint, e.g. we cannot use obs from the future to predict the future.

Correct, we cannot perform k-fold cross validation. We can use walk forward validation instead:

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

Thanks for the patience but i have this specific problem. I have a univariate time series and i want to train a SVM (regression) in order to predict one step ahead. Suppose we have the sequence: 1, 2, 3, 4, 5, 6, 7, 8, 9.

As you suggest, I create the following representation in order to perform supervised learning:

1 2 3 | 4

2 3 4 | 5

3 4 5 | 6

4 5 6 | 7

5 6 7 | 8

6 7 8 | 9

Where the last column is the target. Now I want to train a SVM and I have to choose hyperparameter such as C and best number of input feature so I need k-fold cross validation. I don’t understand the point when you say that the order of the instances (single row of the dataset above) must be preserved during training so we can’t create random samples as folds of k-fold cross validation. In general, if we pick the dataset and train SVM using instances in reversed or random order (first istance is vector 6, 7, 8 with target 9, second vector is 5 6 7 with target 8 and so on) we must obtain the same model.

I found an article in which authors use SVM and ANN for time series forecasting problem and in order to achieve supervised learning they transform time series according to your idea but also they perform k-fold cross validation (random samples) in order to choose best hyperparameters. What do you think about this article (PAGE 7)? http://docsdrive.com/pdfs/ansinet/jas/2010/950-958.pdf

I understand we can’t perform k-fold cross validation of raw time series if we use statistical models (ARIMA, Exponential Smoothing, ecc) so we use walk forward validation and I accept it. But in case of general purpose algorithms such as SVM and ANN if we transform time series data into a data frame for supervised learning with input variables (features) and output variables (target) we can use it as a “normal” dataset for a regression problem where the order is not important in training so which we can random split for train and test.

Thanks for your support!

Yes, excellent point.

If the model has no state (e.g. not an LSTM), then it is just working with input/output pairs. In which case, using k-fold cross-validation may be defendable. It might even be preferred.

This is true as long as the train/test sets were prepared in such a way as to preserve the order of obs through time. E.g. that the model is not learning about the test set during training.

Hi Jason,

Thank you very much for this contribution. Your contribution helped me a lot to understand how to use two powerful tools together. But I have a question. I have a series of data which show seasonality. Would not there be a problem in using this technique or should I first apply a SARIMA model to apply your advice?

Thank you!

I would recommend removing the seasonality first.

Hi Jason! I have three questions regarding the way I’m modeling my problem.

INTRO

I’m trying to predict the demand of different products for a company. I have the day at which the order was registered, the price of the product, size of the order, client id, etc, etc, etc for each order in the past 5 years or more.

EXAMPLE

Here is an oversimplified example I wrote to make it clear:

day | price | size

1 | 80 | 3

2 | 85 | 10

3 | 90 | 5

4 | 100 | 8

5 | 110 | 10

6 | 100 | 12

7 | 90 | 1 <– small size in t=7, maybe this caused the increase in t=10

8 | 100 | 21

9 | 95 | 18

10 | 90 | 50 <– increase

11 | 100 | 25

12 | 100 | 20

13 | 110 | 1 <– small size in t=13, maybe this caused the increase in t=14

14 | 110 | 60

15 | 110 | 27

QUESTION 1

I first tried regression but it's hard to know how well it performs, the model can easily be predicting that the value in t+1 is equal to the value in t plus/minus a random number and the chart would look pretty good anyway, in fact I can approximate the value in t+1 as a simple moving average and that would do it in most cases except during rapid increases which is what I'm trying to detect. How do you evaluate the performance of regression model in this problem?

QUESTION 2

I also tried to model it as a classification problem and here is where I'm stuck. I decided to have two labels: increase and decrease. How do you decide what window size you use? In other words, if I see a rapid increase in t, should I label the sample in t-1 as "increase"? I don't think so, maybe the clue for such a rapid increase is in t-2, or t-10.

I'm afraid that whatever window size I choose, I will be forcing the network to look for a correlation between my inputs and the label at points in which maybe there isn't any correlation to look at. Maybe sometime the label should be in t-1, other times in t-10, t-9, t-8, …, t-1, who knows.

QUESTION 3

The price may change due to inflation and other factors, so the same product may have a price of $30 1 year ago, and $200 next year and that's fine. If I train a model as I described above, shouldn't I do something so all prices are comparable to one another?

Ideally after I train the model I want to to be able to give good predictions regardless of the price level at that time, specially because the test dataset has samples from different periods! This is even worse if I train the model using data of different products where for the same period I would have two products at $100 and $1000, or demands that looks completely different. I have the feeling I should be relativizing those values somehow.

Thank you a lot!

You must choose a way to evaluate a forecast for your problem. It must be meaningful technically and to the stakeholders.

Find out what matters to the stakeholders about a forecast. They might say minimum error. In that case, you could use RMSE or MAE of a forecast to estimate and present the skill of a model.

Start with simple methods such as persistence and moving averages. If a ML method cannot do better than these, it is not skilful and you can move on. More on that here:

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

I would encourage you to explore as many different framings of the problem as you can think up. Framing as a classification problem is a clever idea. See how far you can push it. How to best frame the data or set window size in your case? No one knows, design experiments and discover the answers.

Perhaps look at ACF and PACF plots to get an idea of significant correlations that you can use to help design window sizes. More on that here:

https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/

Inflation is a small effect. Nevertheless, you might need to correct data prior to modeling. E.g. transform all dollars to 2018 dollars or similar.

Also, consider modeling by product, by product groups, by all products, etc. Get creative, see what sticks. There are no right answers, only the best results you can discover on your problem given the time and resources you have available.

Does that help as a start?

Thank you for your answers and your prompt reply. I’m not sure about some things you mention, let me ask you some details.

> Find out what matters to the stakeholders about a forecast.

They want to predict spikes in the demand before they occur but the spikes only appear sporadically so in general if you use a moving average the error (RMSE or MAE) is pretty low, but such a simple model also miss all spikes of course.

I’m guessing that’s what the network do for regression. Maybe there’s a loss function I can use in order to penalize very hard differences in the trend (it predicts the demand will go up while it goes down, whatever the value).

> No one knows, design experiments and discover the answers.

I’m arguing that for this problem there should be a more reliable approach that I’m not aware of. In my example no window size will make the labeling correct. This is how I -as a human- would label it assuming a small demand size implies a big demand size in the near future.

day | price | size | label

1 | 80 | 3 | normal

2 | 85 | 10 | normal

3 | 90 | 5 | normal

4 | 100 | 8 | normal

5 | 110 | 10 | normal

6 | 100 | 12 | normal

7 | 90 | 1 | increase (window size 3)

8 | 100 | 20 | normal

9 | 95 | 18 | normal

10 | 90 | 50 | decrease (window size 2)

11 | 100 | 25 | normal

12 | 110 | 1 | increase (window size 2)

13 | 100 | 20 | normal

14 | 110 | 60 | decrease (window size 1)

15 | 110 | 27 | –

As you can see I had to use different window sizes. The problem is in this silly example the labeling is pretty obvious but in reality it’s not, so I thought there was something I can do.

One idea would be to mark the previous n samples before a rapid increase as “increase”, but then the network will look at t=8 and t=9 for instance, and it will try to get some kind of pattern where there’s none. The score will be random and the performance (as in precision/recall) difficult to read!

Makes sense? No idea how to tackle this.

> you might need to correct data prior to modeling

Regarding adding multiple products in the same dataset (or one product in different periods). Not sure it’s used in the industry but I thought about substracting a given value the moving average of the previous N values, so if the price or demand tend to increase over time as a natural process I’ll only see the difference of it against previous values.

I can’t think of any other way to put together products of different price ranges in the same dataset.

Interesting.

Matt, it’s supposed to be a slog/hard work, this is the job: figuring out how to frame the problem and what works best. Running code is the easy part.

Based on this info, I would recommend looking into framing the problem as anomaly detection, perhaps a classification problem where you predict whether a spike is expected in the next interval of time. This might allow you to capture the precursors to the spike and simplify the spike such that you are not predicting the magnitude only the occurrence (simpler). If this works to any degree, you can then later see if the magnitude can be predicted also.

Also, some problems are not predictable. Or not predictable with the data/resources available. Keep this in mind.

Let me know how you go.

On a second thought I think this problem is analogous to predicting movements in the stock market.

Labeling my samples would be equivalent to labeling bars before a spike in the price of a stock. In that case I guess the correct place to put the “spike” label is right before it occurs and not an arbitrary amount of time before it (let’s say 15 minutes).

LSTM should be able to learn the correct dependency even if the catalyst for the spike is not the bar I labeled as “spike”. Right?

LSTMs are poor at autoregression and I am not knee deep in your data. I cannot say anything will work for sure. You’re the expert on your problem and you must discover these answers.

LSTMs __may__ be useful at classifying a sequence of obs and indicating whether an event is imminent. A ton of prior examples would be required though.

Hello Jason,

Is there a way to predict the state variance using LSTMs?

Thank you in advance.

What do you mean by state variance?

Hi Jason,

Great post.

i want to predict the turnover ( in percentage) for candidates for HR analytics for next 6 months. The factors are joining date, age, gender, overtime, commute time, rewards in last year, years in current service etc. Now i want to ask that :

1) Is this a time series problem or a classification problem.

If i do classification then how can i proceed for turnover predictions for upcoming months and if i proceed with time series than how will i take the other factors into consideration.Please advise.

You could frame this as sequence prediction or not. I would recommend exploring both approaches and see what works best for your specific data.

Jason thanks for the reply but the main question is how can we predict for lets say future 1st ,2nd and 3rd months consecutively as i need to predict the percentage turnover for next 3 months. Could you please guide me. I have different independent variables like date of joining, date of leaving, gender, salary, overtime etc

Is this goes like if i have the data for past 3 months then the prediction is for the 4th month. Now to consider the 5th months do i need to merge the past 3 + future 1 month data so as to predict for the 5th month ?

I have gone through a lot of blogs but nowhere it is clearly mentioned. I think it will also help others.

I think you are describing multi-step forecasting. You can learn more here:

https://machinelearningmastery.com/multi-step-time-series-forecasting/

Does that help?

Hello Jason,

I am enjoying your blogs and the two ebooks on time-series. I have been attempting to train an LSTM with a look_back value assigned. After training and testing, the plotted results have a gap equal to the look_back interval between the final training result and the first test result. Is there a way to use the train/test size split instruction to force overlap between the training dataset and the test dataset? I would like to see the first few results from the test data, even though it would be exposing the network to data previously trained on (at least part of the look_back range). I could prepare separate .csv files for training and test, but was wondering if there was a simpler way to accomplish this. Thanks

Great question.

I wrote a function to prepare data for this, you can see it here:

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

Let me know how you go.

The time series data samples generated by the sliding window method could not be expected to be i.i.d. (independent, identically distributed random variables) in general, so that strategy for turning time series data into training data for a standard supervised learning classifier seems questionable. At least one other seems to have brought this up in another comment above (but stated it somewhat differently).

They will not be IID, and many supervised learning methods do not make this assumption directly.

Further the approach can prove very effective for some problems.

Hello Jason,

I find your articles in https://machinelearningmastery.com/time-series-forecasting-supervised-learning/ and https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/ superb! It’s my first time encountering articles talking about lagged values as detailed and concise as yours.

However, after reading your article in here -> https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/, I became a bit confused.

I hope you won’t be too bothered by my question since I’m a newbie in this area. Trivial as it may seems, I’ve been stuck with this problem for the longest time.

You see, I’m using a sliding window method on my univariate time series dataset, which will be fed to feed-forward ANN for forecasting.

The problem is that, when using ANN, we’re required to split the data into Train-Test set. So, I was wondering if I should first restructure the data into a supervised learning problem and then split the data into train and test sets, or should I split the data first and then use sliding windows on the train and test data separately? You mentioned about respecting the “temporal order of observations” in your other article, but I couldn’t quite catch the meaning behind word.

I hope you can shed some light on this problem for me. Your help is very much appreciated. Thank you in advance!

Thanks.

Yes, structure the data as a supervised learning problem then split it into train/test.

Does that help? Any further confusions?

Hi Dr. Jason,

Thanks for the wonderful article.

I was wondering if there is an algorithm which will forecast based on independent variables.

I have 12 month of data with 30 features, I want to predict for the next 3-6 months ( dependent variable) but I don’t have independent variables for the future so I can’t use conventional forecasting techniques like multivariate forecasting model.

I recommend framing the data as a supervised learning problem then test a suite of machine learning algorithms.

This post will help you to get started:

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

Uahuuuh, Thanks Jason and all the community. I am learning from both the post and all the questions/answers ! I really appreciate it.

I am completely a newbie and I am tackling a capacity plan problem. Basically I have to create a ML/AI system that can forecast how many Compute instances need to run during the day based on previous data to cope with all the incoming requests. Because the instance will take some time to be ready I cannot rely on real-time autoscaling.

The problem is surely a multi-variate because in the game I have multiple regions ( 3 ) and the capacity plan should consider that one region can completely fail while the others would manage the increased traffic.

In mine idea the features will be:

– QPS ( query per seconds ) x Region

– Total QPS worldwide

– Day of the week

– Day of the year

The forecast would be how many QPS I should have to manage all the incoming traffic. After I have the QPS I can say how many instances I need.

I think that a time-series forecast would help me. Can you give me any hints or suggestion on how to tackle the problem?

Another concern I have is how to transfer the knowledge from the previous data analysis to the next analysis without crunching all the data from the beginning. Imaging that I will run the model every hour and I need to do a multistep ( 2-3 ) forecast and I have already years of back data I would avoid, if possible, to crunch all the data from the beginning at every run.

Thanks

Alessandro

Sounds like a great problem. I recommend starting with a simple ml method, e.g. frame as supervised learning and test a ton of methods from sklearn.

I am currently writing a ton of tutorials on this topic. They should be up soon.

HI Jason,

can you share the tutorial’s title you have in mind. So I will check them out.

Thanks

Thanks for the offer.

Hi Jason,

I have a question for you. Assume there is a correlation between attributes in time series data, then is there any restriction on the choice of algorithms to apply,

What solutions would you recommend if there are missing values in time series data? why?

Here’s help with missing data in time series:

https://machinelearningmastery.com/handle-missing-timesteps-sequence-prediction-problems-python/

hi Jason,

Suppose we have multivariate time series data but the quantity of data is small,could you suggest any semi supervised deep learning model for the following problem

1 ) Regression problem

2 ) Classification problem

Perhaps try transfer learning with a model fit on a lot more time series data?