Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step.

It is a very simple idea that can result in accurate forecasts on a range of time series problems.

In this tutorial, you will discover how to implement an autoregressive model for time series forecasting with Python.

After completing this tutorial, you will know:

- How to explore your time series data for autocorrelation.
- How to develop an autocorrelation model and use it to make predictions.
- How to use a developed autocorrelation model to make rolling predictions.

Let’s get started.

**Update May/2017**: Fixed small typo in autoregression equation.

## Autoregression

A regression model, such as linear regression, models an output value based on a linear combination of input values.

For example:

1 |
yhat = b0 + b1*X1 |

Where yhat is the prediction, b0 and b1 are coefficients found by optimizing the model on training data, and X is an input value.

This technique can be used on time series where input variables are taken as observations at previous time steps, called lag variables.

For example, we can predict the value for the next time step (t+1) given the observations at the last two time steps (t-1 and t-2). As a regression model, this would look as follows:

1 |
X(t+1) = b0 + b1*X(t-1) + b2*X(t-2) |

Because the regression model uses data from the same input variable at previous time steps, it is referred to as an autoregression (regression of self).

## Stop learning Time Series Forecasting the *slow way*

#### Sign-up and get a FREE 7-day Time Series Forecasting Mini-Course

**You will get:**

...one lesson each day delivered to your inbox

...exclusive PDF ebook containing all lessons

...confidence and skills to work through your own projects

Download Your FREE Mini-Course

## Autocorrelation

An autoregression model makes an assumption that the observations at previous time steps are useful to predict the value at the next time step.

This relationship between variables is called correlation.

If both variables change in the same direction (e.g. go up together or down together), this is called a positive correlation. If the variables move in opposite directions as values change (e.g. one goes up and one goes down), then this is called negative correlation.

We can use statistical measures to calculate the correlation between the output variable and values at previous time steps at various different lags. The stronger the correlation between the output variable and a specific lagged variable, the more weight that autoregression model can put on that variable when modeling.

Again, because the correlation is calculated between the variable and itself at previous time steps, it is called an autocorrelation. It is also called serial correlation because of the sequenced structure of time series data.

The correlation statistics can also help to choose which lag variables will be useful in a model and which will not.

Interestingly, if all lag variables show low or no correlation with the output variable, then it suggests that the time series problem may not be predictable. This can be very useful when getting started on a new dataset.

In this tutorial, we will investigate the autocorrelation of a univariate time series then develop an autoregression model and use it to make predictions.

Before we do that, let’s first review the Minimum Daily Temperatures data that will be used in the examples.

## Minimum Daily Temperatures Dataset

This dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city Melbourne, Australia.

The units are in degrees Celsius and there are 3,650 observations. The source of the data is credited as the Australian Bureau of Meteorology.

Learn more about the dataset here.

Download the dataset into your current working directory with the filename “*daily-minimum-temperatures.csv*“.

**Note**: The downloaded file contains some question mark (“?”) characters that must be removed before you can use the dataset. Open the file in a text editor and remove the “?” characters. Also remove any footer information in the file.

The code below will load the dataset as a Pandas Series.

1 2 3 4 5 6 |
from pandas import Series from matplotlib import pyplot series = Series.from_csv('daily-minimum-temperatures.csv', header=0) print(series.head()) series.plot() pyplot.show() |

Running the example prints the first 5 rows from the loaded dataset.

1 2 3 4 5 6 7 |
Date 1981-01-01 20.7 1981-01-02 17.9 1981-01-03 18.8 1981-01-04 14.6 1981-01-05 15.8 Name: Temp, dtype: float64 |

A line plot of the dataset is then created.

## Quick Check for Autocorrelation

There is a quick, visual check that we can do to see if there is an autocorrelation in our time series dataset.

We can plot the observation at the previous time step (t-1) with the observation at the next time step (t+1) as a scatter plot.

This could be done manually by first creating a lag version of the time series dataset and using a built-in scatter plot function in the Pandas library.

But there is an easier way.

Pandas provides a built-in plot to do exactly this, called the lag_plot() function.

Below is an example of creating a lag plot of the Minimum Daily Temperatures dataset.

1 2 3 4 5 6 |
from pandas import Series from matplotlib import pyplot from pandas.tools.plotting import lag_plot series = Series.from_csv('daily-minimum-temperatures.csv', header=0) lag_plot(series) pyplot.show() |

Running the example plots the temperature data (t) on the x-axis against the temperature on the previous day (t-1) on the y-axis.

We can see a large ball of observations along a diagonal line of the plot. It clearly shows a relationship or some correlation.

This process could be repeated for any other lagged observation, such as if we wanted to review the relationship with the last 7 days or with the same day last month or last year.

Another quick check that we can do is to directly calculate the correlation between the observation and the lag variable.

We can use a statistical test like the Pearson correlation coefficient. This produces a number to summarize how correlated two variables are between -1 (negatively correlated) and +1 (positively correlated) with small values close to zero indicating low correlation and high values above 0.5 or below -0.5 showing high correlation.

Correlation can be calculated easily using the corr() function on the DataFrame of the lagged dataset.

The example below creates a lagged version of the Minimum Daily Temperatures dataset and calculates a correlation matrix of each column with other columns, including itself.

1 2 3 4 5 6 7 8 9 10 |
from pandas import Series from pandas import DataFrame from pandas import concat from matplotlib import pyplot series = Series.from_csv('daily-minimum-temperatures.csv', header=0) values = DataFrame(series.values) dataframe = concat([values.shift(1), values], axis=1) dataframe.columns = ['t-1', 't+1'] result = dataframe.corr() print(result) |

This is a good confirmation for the plot above.

It shows a strong positive correlation (0.77) between the observation and the lag=1 value.

1 2 3 |
t-1 t+1 t-1 1.00000 0.77487 t+1 0.77487 1.00000 |

This is good for one-off checks, but tedious if we want to check a large number of lag variables in our time series.

Next, we will look at a scaled-up version of this approach.

## Autocorrelation Plots

We can plot the correlation coefficient for each lag variable.

This can very quickly give an idea of which lag variables may be good candidates for use in a predictive model and how the relationship between the observation and its historic values changes over time.

We could manually calculate the correlation values for each lag variable and plot the result. Thankfully, Pandas provides a built-in plot called the autocorrelation_plot() function.

The plot provides the lag number along the x-axis and the correlation coefficient value between -1 and 1 on the y-axis. The plot also includes solid and dashed lines that indicate the 95% and 99% confidence interval for the correlation values. Correlation values above these lines are more significant than those below the line, providing a threshold or cutoff for selecting more relevant lag values.

1 2 3 4 5 6 |
from pandas import Series from matplotlib import pyplot from pandas.tools.plotting import autocorrelation_plot series = Series.from_csv('daily-minimum-temperatures.csv', header=0) autocorrelation_plot(series) pyplot.show() |

Running the example shows the swing in positive and negative correlation as the temperature values change across summer and winter seasons each previous year.

The statsmodels library also provides a version of the plot in the plot_acf() function as a line plot.

1 2 3 4 5 6 |
from pandas import Series from matplotlib import pyplot from statsmodels.graphics.tsaplots import plot_acf series = Series.from_csv('daily-minimum-temperatures.csv', header=0) plot_acf(series, lags=31) pyplot.show() |

In this example, we limit the lag variables evaluated to 31 for readability.

Now that we know how to review the autocorrelation in our time series, let’s look at modeling it with an autoregression.

Before we do that, let’s establish a baseline performance.

## Persistence Model

Let’s say that we want to develop a model to predict the last 7 days of minimum temperatures in the dataset given all prior observations.

The simplest model that we could use to make predictions would be to persist the last observation. We can call this a persistence model and it provides a baseline of performance for the problem that we can use for comparison with an autoregression model.

We can develop a test harness for the problem by splitting the observations into training and test sets, with only the last 7 observations in the dataset assigned to the test set as “unseen” data that we wish to predict.

The predictions are made using a walk-forward validation model so that we can persist the most recent observations for the next day. This means that we are not making a 7-day forecast, but 7 1-day forecasts.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
from pandas import Series from pandas import DataFrame from pandas import concat from matplotlib import pyplot from sklearn.metrics import mean_squared_error series = Series.from_csv('daily-minimum-temperatures.csv', header=0) # create lagged dataset values = DataFrame(series.values) dataframe = concat([values.shift(1), values], axis=1) dataframe.columns = ['t-1', 't+1'] # split into train and test sets X = dataframe.values train, test = X[1:len(X)-7], X[len(X)-7:] train_X, train_y = train[:,0], train[:,1] test_X, test_y = test[:,0], test[:,1] # persistence model def model_persistence(x): return x # walk-forward validation predictions = list() for x in test_X: yhat = model_persistence(x) predictions.append(yhat) test_score = mean_squared_error(test_y, predictions) print('Test MSE: %.3f' % test_score) # plot predictions vs expected pyplot.plot(test_y) pyplot.plot(predictions, color='red') pyplot.show() |

Running the example prints the mean squared error (MSE).

The value provides a baseline performance for the problem.

1 |
Test MSE: 3.423 |

The expected values for the next 7 days are plotted (blue) compared to the predictions from the model (red).

## Autoregression Model

An autoregression model is a linear regression model that uses lagged variables as input variables.

We could calculate the linear regression model manually using the LinearRegession class in scikit-learn and manually specify the lag input variables to use.

Alternately, the statsmodels library provides an autoregression model that automatically selects an appropriate lag value using statistical tests and trains a linear regression model. It is provided in the AR class.

We can use this model by first creating the model AR() and then calling fit() to train it on our dataset. This returns an ARResult object.

Once fit, we can use the model to make a prediction by calling the predict() function for a number of observations in the future. This creates 1 7-day forecast, which is different from the persistence example above.

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from pandas import Series from matplotlib import pyplot from statsmodels.tsa.ar_model import AR from sklearn.metrics import mean_squared_error series = Series.from_csv('daily-minimum-temperatures.csv', header=0) # split dataset X = series.values train, test = X[1:len(X)-7], X[len(X)-7:] # train autoregression model = AR(train) model_fit = model.fit() print('Lag: %s' % model_fit.k_ar) print('Coefficients: %s' % model_fit.params) # make predictions predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False) for i in range(len(predictions)): print('predicted=%f, expected=%f' % (predictions[i], test[i])) error = mean_squared_error(test, predictions) print('Test MSE: %.3f' % error) # plot results pyplot.plot(test) pyplot.plot(predictions, color='red') pyplot.show() |

Running the example first prints the chosen optimal lag and the list of coefficients in the trained linear regression model.

We can see that a 29-lag model was chosen and trained. This is interesting given how close this lag is to the average number of days in a month.

The 7 day forecast is then printed and the mean squared error of the forecast is summarized.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Lag: 29 Coefficients: [ 5.57543506e-01 5.88595221e-01 -9.08257090e-02 4.82615092e-02 4.00650265e-02 3.93020055e-02 2.59463738e-02 4.46675960e-02 1.27681498e-02 3.74362239e-02 -8.11700276e-04 4.79081949e-03 1.84731397e-02 2.68908418e-02 5.75906178e-04 2.48096415e-02 7.40316579e-03 9.91622149e-03 3.41599123e-02 -9.11961877e-03 2.42127561e-02 1.87870751e-02 1.21841870e-02 -1.85534575e-02 -1.77162867e-03 1.67319894e-02 1.97615668e-02 9.83245087e-03 6.22710723e-03 -1.37732255e-03] predicted=11.871275, expected=12.900000 predicted=13.053794, expected=14.600000 predicted=13.532591, expected=14.000000 predicted=13.243126, expected=13.600000 predicted=13.091438, expected=13.500000 predicted=13.146989, expected=15.700000 predicted=13.176153, expected=13.000000 Test MSE: 1.502 |

A plot of the expected (blue) vs the predicted values (red) is made.

The forecast does look pretty good (about 1 degree Celsius out each day), with big deviation on day 5.

The statsmodels API does not make it easy to update the model as new observations become available.

One way would be to re-train the AR model each day as new observations become available, and that may be a valid approach, if not computationally expensive.

An alternative would be to use the learned coefficients and manually make predictions. This requires that the history of 29 prior observations be kept and that the coefficients be retrieved from the model and used in the regression equation to come up with new forecasts.

The coefficients are provided in an array with the intercept term followed by the coefficients for each lag variable starting at t-1 to t-n. We simply need to use them in the right order on the history of observations, as follows:

1 |
yhat = b0 + b1*X1 + b2*X2 ... bn*Xn |

Below is the complete example.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
from pandas import Series from matplotlib import pyplot from statsmodels.tsa.ar_model import AR from sklearn.metrics import mean_squared_error series = Series.from_csv('daily-minimum-temperatures.csv', header=0) # split dataset X = series.values train, test = X[1:len(X)-7], X[len(X)-7:] # train autoregression model = AR(train) model_fit = model.fit() window = model_fit.k_ar coef = model_fit.params # walk forward over time steps in test history = train[len(train)-window:] history = [history[i] for i in range(len(history))] predictions = list() for t in range(len(test)): length = len(history) lag = [history[i] for i in range(length-window,length)] yhat = coef[0] for d in range(window): yhat += coef[d+1] * lag[window-d-1] obs = test[t] predictions.append(yhat) history.append(obs) print('predicted=%f, expected=%f' % (yhat, obs)) error = mean_squared_error(test, predictions) print('Test MSE: %.3f' % error) # plot pyplot.plot(test) pyplot.plot(predictions, color='red') pyplot.show() |

Again, running the example prints the forecast and the mean squared error.

1 2 3 4 5 6 7 8 |
predicted=11.871275, expected=12.900000 predicted=13.659297, expected=14.600000 predicted=14.349246, expected=14.000000 predicted=13.427454, expected=13.600000 predicted=13.374877, expected=13.500000 predicted=13.479991, expected=15.700000 predicted=14.765146, expected=13.000000 Test MSE: 1.451 |

We can see a small improvement in the forecast when comparing the error scores.

## Further Reading

This section provides some resources if you are looking to dig deeper into autocorrelation and autoregression.

- Autocorrelation on Wikipedia
- Autoregressive model on Wikipedia
- Chapter 7 – Regression-Based Models: Autocorrelation and External Information, Practical Time Series Forecasting with R: A Hands-On Guide.
- Section 4.5 – Autoregressive Models, Introductory Time Series with R.

## Summary

In this tutorial, you discovered how to make autoregression forecasts for time series data using Python.

Specifically, you learned:

- About autocorrelation and autoregression and how they can be used to better understand time series data.
- How to explore the autocorrelation in a time series using plots and statistical tests.
- How to train an autoregression model in Python and use it to make short-term and rolling forecasts.

Do you have any questions about autoregression, or about this tutorial?

Ask your questions in the comments below and I will do my best to answer.

Thank you Jason for the awesome article

In case anyone hits the same problem I had –

I downloaded the data from the link above as a csv file.

It was failing to be imported due to three rows in the temperature column containing ‘?’.

Once these were removed the data imported ok.

Thanks for the heads up Gary.

Hey Jason, thanks for the article. How would you go about forecasting from the end of the file when expected value is not known?

Hi Tim, you can use mode.predict() as in the example and specify the index of the time step to be predicted.

Hi Jason,

Thanks for all of your wonderful blogs. They are really helping a lot. One question regarding this post is that I believe that AR modeling also presume that time series is stationary as the observations should be i.i.d. .Does that AR function from statsmodels library checks for stationary and use the de-trended de-seasonalized time series by itself if required? Also, if we use sckit learn library for AR model as you described do we need to check for and make adjustments by ourselfs for this?

Hi Farrukh, great question.

The AR in statsmodels does assume that the data is stationary.

If your data is not stationary, you must make it stationary (e.g. differencing and other transforms).

Thanks for the answer. Though we did not conduct proper test here for trend/seasonal stationarity check in the example above but from figure apparently it seems like that there is a seasonal effect. So in that case whether applying AR model is good to go?

Great question Farrukh.

AR is designed to be used on stationary data, meaning data with no seasonal or trend information.

Or to be specific, is it OK to apply AR model direct here on the given data without checking the seasonality and removing it if present which is showing some signs in first graph apparently?

Dear Dr Jason,

I had a go at the ‘persistence model’ section.

The dataset I used was the sp500.csv dataset.

From your code

As soon as I try to compute the “test_score”, I get the following error,

Any idea,

Anthony of Sydney NSW

It looks like you need to convert your data to floating point values.

E.g. in numpy this might be:

Dear Dr Jason,

Fixed the problem. You cannot assume that all *.csv numbers are floats or ints. For some reason, the numbers seem to be enclosed in quotes. Note that the data is is for the sp500.csv not the above exercise.

Note the numbers in the output are enclosed in quotes:

It works now,

Regards

Anthony from Sydney

Glad to hear you fixed it Anthony.

Dear Dr Jason,

The problem has been fixed. The values in the array were strings, so I had to convert them to strings.

So I converted the strings in each array to float.

Hope that helps others trying to convert values to appropriate data types in order to do numerical calculations.

Anthony from Sydney NSW

how can I show my prediction as date, instead of log, for example I have data set incident number for each week, I want to predict the following week

week1 669

week2 443

week3 555

so on april week 1 I want to show time series the prediction of week1 week2 of April

Sorry, I’m not sure I understand. Perhaps you can give a fuller example of inputs and outputs?

Thanks for this wonderful tutorial. I discovered your articles on facebook last year. Since then I have been following all your tutorials and I must confess that, though I started learning about machine learning in less than a year, my knowledge base has tremendously increased as a result of this free services you have been posting for all to see on the website.

Thanks once more for this generosity Dr Jason.

My question is that, I have carefully followed all your procedures in this article on my data set that was recorded every 10mins for 2 months. Please I would like to know which time lag is appropriate for forecasting to see the next 7 days value or more or less.

My time series is a stationary one according to the test statistics and histogram I have applied on it. but I still don’t know if a reasonable conclusion can be reached with a data set that was recorded for 2 months every 10mins daily.

Well done on your progress!

This post gives you some ideas on how to select suitable q and p values (lag vars):

http://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/

I hope that helps as a start.

Dr Jason,

Thank you so much for this post. I have finally learned how to go from theory to practice.

I’m so glad to hear that Soy, thanks!